Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures

arXiv cs.LG Papers

Summary

This paper introduces Test-Time Personalization (TTP), a framework that improves LLM personalization by scaling inference-time computation through candidate sampling and reward-based selection. It diagnoses failure modes in standard reward models and proposes a probabilistic personalized reward model to mitigate them.

arXiv:2605.10991v1 Announce Type: new Abstract: Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.
Original Article
View Cached Full Text

Cached at: 05/13/26, 06:26 AM

# Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
Source: [https://arxiv.org/html/2605.10991](https://arxiv.org/html/2605.10991)
Linhai Zhang King’s College London linhai\.zhang@kcl\.ac\.uk&Yulan He King’s College London The Alan Turing Institute yulan\.he@kcl\.ac\.uk

###### Abstract

Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single\-shot process\. In this work, we studyTest\-Time Personalization \(TTP\)along an unexplored axis: scaling inference\-time computation by samplingNNcandidates from a personalized policy model and selecting the best with a personalized reward model\. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test\-time scaling\. However, standard reward models fail to realize this potential\. To diagnose why, we derive a unified scaling law that decomposes any reward model’s Best\-of\-NNcurve into four measurable quantities and reveals two failure modes,*user\-level collapse*\(near\-constant prediction for some users\) and*query\-level reward hacking*\(negative correlation with true quality for some queries\)\. Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes\. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward\-model variants\.

## 1Introduction

Large language models excel across diverse tasks, yet they predominantly produce*one\-size\-fits\-all*responses that ignore individual user preferences\[[7](https://arxiv.org/html/2605.10991#bib.bib1)\]\. This limitation has motivated growing research interest in LLM personalization\. Existing approaches fall into three categories\.*Personalized Prompting*augments input context with retrieved user history\[[15](https://arxiv.org/html/2605.10991#bib.bib3),[17](https://arxiv.org/html/2605.10991#bib.bib2)\]\.*Personalized Adaptation*directly fine\-tunes model parameters on user data\[[21](https://arxiv.org/html/2605.10991#bib.bib4),[25](https://arxiv.org/html/2605.10991#bib.bib5)\]\.*Personalized Alignment*combines multi\-objective reward models with user\-specific weights\[[3](https://arxiv.org/html/2605.10991#bib.bib6),[1](https://arxiv.org/html/2605.10991#bib.bib7)\]\. Despite their differences, these approaches share a common paradigm: they focus on constructing better personalized models or inputs, while treating inference as a single\-shot generation process\.

Meanwhile, scaling test\-time computation has emerged as a powerful axis for improving LLM performance, particularly in reasoning tasks\[[22](https://arxiv.org/html/2605.10991#bib.bib8),[19](https://arxiv.org/html/2605.10991#bib.bib9),[18](https://arxiv.org/html/2605.10991#bib.bib10)\]\. Recent work has begun to combine test\-time scaling with personalization, but only along two axes:*scaling user interactions*to learn each user’s preferences online\[[11](https://arxiv.org/html/2605.10991#bib.bib28)\], and*scaling reward\-model reasoning*via generative reward models\[[26](https://arxiv.org/html/2605.10991#bib.bib30)\]\. A third, equally natural axis,*scaling the policy model itself*, has not been studied for personalization\. This raises a natural question:*Can we improve personalization by scaling test\-time computation for a weak personalized model?*

A straightforward instantiation of test\-time personalization is parallel sampling\[[17](https://arxiv.org/html/2605.10991#bib.bib2),[4](https://arxiv.org/html/2605.10991#bib.bib12)\]: a personalized policy generatesNNcandidates per query, and a personalized reward model selects the best\. We first establish the theoretical promise of this approach: when the number of candidates increases, the expected utility of the optimal sample grows logarithmically \(Section[3](https://arxiv.org/html/2605.10991#S3)\)\. This provides a theoretical ceiling for what test\-time personalization can achieve\. Realizing this ceiling in practice, however, is far from trivial\. As Figure[1](https://arxiv.org/html/2605.10991#S1.F1)shows, both a global reward model trained on pooled data and a per\-user reward model trained on individual histories perform near random selection acrossNN, leaving the oracle gap wide open\.*What prevents standard reward models from scaling?*

![Refer to caption](https://arxiv.org/html/2605.10991v1/x1.png)Figure 1:Test\-time personalization on LaMP\-4: News Headline Generation\. Oracle shows logarithmic scaling that surpasses training\-based baselines, while standard Reward Models \(RMs\) fail to scale, performing close to, or worse than, random selection\.To diagnose this gap, we develop an analytical framework that connects a reward model’s correlation with the golden score to its Best\-of\-NNscaling behavior \(Section[4](https://arxiv.org/html/2605.10991#S4)\)\. The analysis surfaces two distinct failure patterns,*user\-level collapse*where the reward model degenerates to near\-constant predictions for some users and*query\-level reward hacking*where the reward model’s predictions negatively correlate with quality for some queries\. We then generalize the oracle scaling law into a*general expression*that forecasts the scaling curve of any reward model from four measurable quantities\. Guided by this expression, we propose aprobabilistic personalized reward modelthat mitigates both failure patterns via learned variance, enabling stable test\-time scaling in practice \(Section[5](https://arxiv.org/html/2605.10991#S5)\)\.

Experiments on two benchmarks covering five personalized text generation tasks confirm both elements of the framework: our probabilistic reward model scales reliably and surpasses training\-based baselines on most tasks, and the general expression closely matches observed scaling curves\. In summary, our contributions are three\-fold:

- •We introduce Test\-Time Personalization \(TTP\), a new paradigm focused on scaling policy\-model compute for personalized text generation\.
- •We detect two failure patterns and a predictive scaling law, a general expression that forecasts any reward model’s scaling curve from four measurable quantities\.
- •We propose a probabilistic personalized reward model that mitigates both failure patterns and scales reliably across multiple policy types and tasks\.

## 2Preliminaries

We focus on*personalized text generation*tasks, where each user has demonstrated stylistic preferences through written examples and the system must mimic these preferences at inference time\.

### 2\.1Problem Formulation

Consider a set of users𝒰\\mathcal\{U\}, where each useru∈𝒰u\\in\\mathcal\{U\}has an underlying preference functionru∗:𝒬×𝒳→ℝr\_\{u\}^\{\*\}:\\mathcal\{Q\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}that maps a query\-response pair\(q,x\)\(q,x\)to a scalar reward reflecting how wellxxaligns with the user’s preferences for queryqq\. Each user is associated with historical data𝒟u=\{\(qi,xi\)\}i=1nu\\mathcal\{D\}\_\{u\}=\\\{\(q\_\{i\},x\_\{i\}\)\\\}\_\{i=1\}^\{n\_\{u\}\}\.

Given a queryqq, a personalized policyπu\\pi\_\{u\}conditioned on user history generatesNNcandidate responses:\{x1,x2,…,xN\}∼πu\(⋅∣q\)\\\{x\_\{1\},x\_\{2\},\\ldots,x\_\{N\}\\\}\\sim\\pi\_\{u\}\(\\cdot\\mid q\)\. A personalized reward modelr^u\\hat\{r\}\_\{u\}trained on𝒟u\\mathcal\{D\}\_\{u\}selects the best candidate:x∗=arg⁡maxxi⁡r^u​\(q,xi\)x^\{\*\}=\\arg\\max\_\{x\_\{i\}\}\\hat\{r\}\_\{u\}\(q,x\_\{i\}\)\. The effectiveness of TTP is measured by the expected true reward of the selected response:

U​\(N\)=𝔼q,x1:N∼πu​\[ru∗​\(q,x∗\)\]\.U\(N\)=\\mathbb\{E\}\_\{q,\\,x\_\{1:N\}\\sim\\pi\_\{u\}\}\\left\[r\_\{u\}^\{\*\}\(q,x^\{\*\}\)\\right\]\.\(1\)The*oracle*strategy selects candidates using the true preference functionru∗r\_\{u\}^\{\*\}, yielding the optimal utilityUoracle​\(N\)U\_\{\\text\{oracle\}\}\(N\)\. The goal of TTP is to approach this oracle performance\.

### 2\.2Experimental Setup

We use five personalized text generation tasks from LaMP\[[17](https://arxiv.org/html/2605.10991#bib.bib2)\]and LongLaMP\[[4](https://arxiv.org/html/2605.10991#bib.bib12)\], covering news headlines, scholarly titles, abstracts, product reviews, and topical posts\. The personalized policy follows a retrieval\-augmented generation \(RAG\) recipe\[[17](https://arxiv.org/html/2605.10991#bib.bib2)\], conditioning generation on user\-history examples retrieved per query\. Sinceru∗r\_\{u\}^\{\*\}is unobservable, we use ROUGE between generated responses and user\-written references as a ground\-truth\-reward proxy\. For each user, we sample candidate responses from the policy model and compute their ROUGE scores based on the golden response to construct training data\. We consider two standard reward models:Global RM, trained on pooled data across users, andUser RM, trained per user on𝒟u\\mathcal\{D\}\_\{u\}\. For evaluation, the policy model generatesNNcandidates per query and different reward models select the best candidate\. We report ROUGE scores averaged across users\.

## 3The Promise of Test\-Time Personalization

We begin by establishing the theoretical foundation of TTP: given access to an oracle reward function, the expected utility of selected responses grows logarithmically with the number of candidates\. We then empirically validate this scaling law and show that oracle TTP can surpass training\-based methods\[[21](https://arxiv.org/html/2605.10991#bib.bib4)\]\.

### 3\.1Theoretical Foundation

Our theoretical analysis establishes a scaling law for oracle selection:

###### Theorem 3\.1\(Oracle Scaling Law\)\.

Assume the true reward of responses sampled fromπu\(⋅∣q\)\\pi\_\{u\}\(\\cdot\\mid q\)is sub\-Gaussian with meanμu\\mu\_\{u\}and variance proxyσu2\\sigma\_\{u\}^\{2\}\. Letxoracle∗=arg⁡maxi⁡ru∗​\(q,xi\)x^\{\*\}\_\{\\mathrm\{oracle\}\}=\\arg\\max\_\{i\}r\_\{u\}^\{\*\}\(q,x\_\{i\}\)denote the oracle selection fromNNi\.i\.d\. samples\. Then the expected population\-level utility satisfies:

U¯oracle​\(N\)=𝔼u​\[Uoracle,u​\(N\)\]≤μ¯\+σ¯⋅c​ln⁡N\\bar\{U\}\_\{\\mathrm\{oracle\}\}\(N\)=\\mathbb\{E\}\_\{u\}\\left\[U\_\{\\mathrm\{oracle\},u\}\(N\)\\right\]\\leq\\bar\{\\mu\}\+\\bar\{\\sigma\}\\cdot c\\sqrt\{\\ln N\}\(2\)forN≥2N\\geq 2, whereμ¯=𝔼u​\[μu\]\\bar\{\\mu\}=\\mathbb\{E\}\_\{u\}\[\\mu\_\{u\}\],σ¯=𝔼u​\[σu\]\\bar\{\\sigma\}=\\mathbb\{E\}\_\{u\}\[\\sigma\_\{u\}\], andc\>0c\>0is a universal constant\.

The formal proof is provided in Appendix[A\.1](https://arxiv.org/html/2605.10991#A1.SS1)\. The sub\-Gaussian assumption is mild and generally satisfied when the policy generates diverse but bounded\-quality responses, which is typical for LLM outputs under temperature sampling; we empirically verify this property on our data in Appendix[C\.2](https://arxiv.org/html/2605.10991#A3.SS2)\. Theln⁡N\\sqrt\{\\ln N\}form arises from the well\-known result on the expected maximum of sub\-Gaussian random variables, which grows asO​\(ln⁡N\)O\(\\sqrt\{\\ln N\}\)forNNi\.i\.d\. samples\.

Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)establishes a theoretical ceiling for test\-time personalization: even a weak policy \(lowμ¯\\bar\{\\mu\}\) can yield high\-quality outputs through scaled sampling, provided the reward model can identify the best candidates\. The key question becomes whether we can build reward models that approach this oracle performance\.

### 3\.2Empirical Validation

We validate the oracle scaling law on two representative personalization tasks: scholarly title generation \(LaMP\-5\) and product review generation \(LongLaMP\)\. For each task, we sampleN∈\{1,5,10,15,20,30\}N\\in\\\{1,5,10,15,20,30\\\}candidates per query and select the best one using the ground\-truth reward \(average ROUGE\-1 and ROUGE\-L scores against reference\)\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x2.png)Figure 2:Oracle scaling on \(a\) LaMP\-5 and \(b\) LongLaMP\-Product\. Oracle selection \(red solid\) closely follows the theoretical predictionμ¯\+σ¯​c​ln⁡N\\bar\{\\mu\}\+\\bar\{\\sigma\}c\\sqrt\{\\ln N\}\(orange dashed\); the horizontal teal dotted line marks the per\-user training\-based baseline\.Figure[2](https://arxiv.org/html/2605.10991#S3.F2)presents the results\. The oracle scaling curves exhibit clear logarithmic growth, closely matching the theoretical prediction of Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)\. Notably, oracle TTP surpasses training\-based methods at modest candidate counts \(N≈5​–​10N\\approx 5\\text\{\-\-\}10\), showing that substantial gains are achievable with moderate computational overhead\. These results establish that test\-time personalization offers a promising pathway to performance beyond training\-based methods, provided we can construct effective reward models\.

## 4The Challenges of Test\-Time Personalization

Having established that TTP can substantially improve personalization with oracle reward models, we now investigate whether this potential can be realized with learned reward models\. We find that standard approaches fail in surprising ways, then develop a theoretical framework to diagnose the underlying causes\.

### 4\.1Standard Reward Models Fail to Scale

We evaluate the two reward model approaches defined in Section[2](https://arxiv.org/html/2605.10991#S2): Global RM trained on population\-level data, and User\-specific RM trained individually per user\. Figure[3](https://arxiv.org/html/2605.10991#S4.F3)presents the scaling curves on two representative tasks\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x3.png)Figure 3:Scaling curves for standard reward models\. \(a\) On LaMP\-5, Global RM performs no better than random selection\. \(b\) On LongLaMP\-Product Review, User RM underperforms Global RM despite being explicitly trained on personalized user data\. Both fall far short of the oracle upper bound\.Two unexpected phenomena emerge\. First, on LaMP\-5, Global RM achieves nearly identical performance to random selection; it fails to provide any meaningful signal for candidate selection\. Second, on LongLaMP, User RM underperforms Global RM despite being explicitly personalized\. This contradicts the intuition that user\-specific training should improve personalization\. Both approaches fall far short of the oracle upper bound, indicating that realizing the promise of TTP requires understanding why these standard methods fail\.

### 4\.2From Correlation to Scaling

To analyze reward model quality, we introduce the correlation between learned and true rewards:

###### Definition 4\.1\(Reward Model Correlation\)\.

For useruu, the correlation between a learned reward modelr^u\\hat\{r\}\_\{u\}and the true preferenceru∗r\_\{u\}^\{\*\}is:

ρu=Corr​\(r^u​\(q,x\),ru∗​\(q,x\)\)\.\\rho\_\{u\}=\\mathrm\{Corr\}\\left\(\\hat\{r\}\_\{u\}\(q,x\),\\,r\_\{u\}^\{\*\}\(q,x\)\\right\)\.\(3\)

This correlation directly determines scaling behavior:

###### Lemma 4\.2\(Correlation\-Scaling Relationship\)\.

Under uniformity assumption, for a reward model with correlationρu\\rho\_\{u\}, the Best\-of\-N utility satisfies:

Uu​\(N\)≈μu\+ρu⋅σu⋅c​ln⁡NU\_\{u\}\(N\)\\approx\\mu\_\{u\}\+\\rho\_\{u\}\\cdot\\sigma\_\{u\}\\cdot c\\sqrt\{\\ln N\}\(4\)

The proof is provided in Appendix[A\.2](https://arxiv.org/html/2605.10991#A1.SS2), and the bivariate\-linearity assumption it relies on is empirically verified in Appendix[C\.2](https://arxiv.org/html/2605.10991#A3.SS2)\. Intuitively, correlation acts as a scaling coefficient: positiveρ\\rhoyields increasing curves,ρ≈0\\rho\\approx 0yields flat curves, and negativeρ\\rhoyields decreasing curves\. This framework allows us to diagnose reward model failures by examining correlation distributions\.

### 4\.3Diagnosing Failure Modes

Lemma[4\.2](https://arxiv.org/html/2605.10991#S4.Thmtheorem2)motivates two diagnostic correlations: per\-userρu\\rho\_\{u\}\(averaged over queries\) and per\-queryρq\\rho\_\{q\}\(across candidates within a query\)\. Figure[4](https://arxiv.org/html/2605.10991#S4.F4)reveals two failure modes, each explaining one phenomenon in Figure[3](https://arxiv.org/html/2605.10991#S4.F3)\.Query\-level reward hacking\(ρq<0\\rho\_\{q\}<0\): on LaMP\-5, 46% of queries exhibit negative correlation under Global RM \(37% under User RM\), causing it to systematically prefer worse candidates; this explains why Global RM cannot beat random in Figure[3](https://arxiv.org/html/2605.10991#S4.F3)\(a\)\.User\-level collapse\(ρu<0\.1\\rho\_\{u\}<0\.1\): User RM achieves higher mean correlation than Global RM \(0\.37 vs 0\.24 on LaMP\-5; 0\.43 vs 0\.40 on LongLaMP\-Product\), but 25% of LongLaMP\-Product users collapse under User RM \(vs 0% under Global RM\), where the reward model emits near\-constant predictions and provides no useful signal; this explains why User RM falls below Global RM in Figure[3](https://arxiv.org/html/2605.10991#S4.F3)\(b\)\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x4.png)Figure 4:Correlation diagnostics on LaMP\-5 and LongLaMP\-Product Review\.\(a, b\)Per\-user correlationρu\\rho\_\{u\}: User RM achieves higher mean correlation but exhibits a heavy lower tail crossing the red dotted line atρu=0\.1\\rho\_\{u\}\\\!=\\\!0\.1\(user\-level collapse threshold\); Global RM remains tightly concentrated\.\(c, d\)Per\-query correlationρq\\rho\_\{q\}: a substantial fraction of queries lie below the red dotted line atρq=0\\rho\_\{q\}\\\!=\\\!0\(query\-level reward hacking\)\.
### 4\.4Unified Scaling Law

We formalize these failure modes and derive their joint effect on scaling behavior:

###### Definition 4\.3\(Failure Rates\)\.

For a reward modeling approach, we define:

- •Collapse rateα\\alpha: Fraction of users withρu<0\.1\\rho\_\{u\}<0\.1;
- •Reward hacking rateβ\\beta: Fraction of queries withρq<0\\rho\_\{q\}<0\.

###### Proposition 4\.4\(Unified Scaling Law\)\.

The population\-level Best\-of\-N utility can be approximated as:

U¯​\(N\)≈μ¯\+\(1−α\)⏟non\-collapsed⋅\[\(1−β\)​ρ¯\+−β​\|ρ¯−\|\]⏟effective correlation⋅σ¯⋅c​ln⁡N\\bar\{U\}\(N\)\\approx\\bar\{\\mu\}\+\\underbrace\{\(1\-\\alpha\)\}\_\{\\text\{non\-collapsed\}\}\\cdot\\underbrace\{\\left\[\(1\-\\beta\)\\bar\{\\rho\}\_\{\+\}\-\\beta\|\\bar\{\\rho\}\_\{\-\}\|\\right\]\}\_\{\\text\{effective correlation\}\}\\cdot\\bar\{\\sigma\}\\cdot c\\sqrt\{\\ln N\}\(5\)whereρ¯\+\>0\\bar\{\\rho\}\_\{\+\}\>0is the mean correlation on non\-hacked queries andρ¯−<0\\bar\{\\rho\}\_\{\-\}<0is the mean correlation on hacked queries\.

This formula reveals thepersonalization\-stability trade\-off: Global RM achieves stability \(lowα\\alphaandβ\\beta\) by averaging across users, but this same averaging limits its ability to capture individual preferences\. User RM achieves higher correlation for many users but suffers from both collapse and reward hacking due to limited per\-user data\.

Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4)\(proof in Appendix[A\.3](https://arxiv.org/html/2605.10991#A1.SS3)\) suggests that an effective approach must simultaneously achieve: \(1\) high base correlation through personalization, \(2\) low collapse rate for user\-level stability, and \(3\) low reward hacking rate for query\-level stability\. We empirically validate the formula in Section[6\.2](https://arxiv.org/html/2605.10991#S6.SS2)and Appendix[C\.1](https://arxiv.org/html/2605.10991#A3.SS1): the four measured quantities forecast observed Best\-of\-NNscaling curves at relMAE below3%3\\%across all tasks\. In the next section, we show that probabilistic reward modeling is the unique design that achieves the three desiderata above\.

## 5Probabilistic Reward Modeling

The analysis in Section[4](https://arxiv.org/html/2605.10991#S4)reveals that effective TTP requires high correlation, low collapse rateα\\alpha, and low reward hacking rateβ\\beta\. We now show that probabilistic reward modeling achieves all three by introducing a learned varianceσ2\\sigma^\{2\}that absorbs uncertainty rather than forcing premature commitment\.

### 5\.1Theoretical Analysis

Consider the Gaussian negative log\-likelihood \(NLL\) loss

ℒNLL=12​log⁡σ2​\(x\)\+\(y−μ​\(x\)\)22​σ2​\(x\),\\mathcal\{L\}\_\{\\text\{NLL\}\}=\\tfrac\{1\}\{2\}\\log\\sigma^\{2\}\(x\)\+\\frac\{\(y\-\\mu\(x\)\)^\{2\}\}\{2\\sigma^\{2\}\(x\)\},\(6\)whose gradient with respect toμ\\muis\(μ−y\)/σ2\(\\mu\-y\)/\\sigma^\{2\}\. The learnedσ2\\sigma^\{2\}thus acts as an adaptive per\-input weight, giving two complementary effects\.

###### Lemma 5\.1\(Gradient Buffering\)\.

For inputs whereyyis poorly predictable fromxx\(collapse\-prone users\), increasingσ2\\sigma^\{2\}attenuates the gradient onμ\\mu, preventing the constant\-prediction equilibrium that traps deterministic models\. This reduces the collapse rate:αprob<αdet\\alpha^\{\\text\{prob\}\}<\\alpha^\{\\text\{det\}\}\.

###### Lemma 5\.2\(Implicit Regularization\)\.

For inputs with inconsistent training signals, NLL drivesσ2\\sigma^\{2\}large, down\-weighting their contribution to the gradient onμ\\mu\. This prevents overfitting to spurious patterns that produce negative per\-query correlation, reducing the hacking rate:βprob<βdet\\beta^\{\\text\{prob\}\}<\\beta^\{\\text\{det\}\}\.

Together \(proofs in Appendix[A\.4](https://arxiv.org/html/2605.10991#A1.SS4)\), the two lemmas give probabilistic User RM the unique combination required by Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4): lowα\\alpha, lowβ\\beta, and highρ¯\+\\bar\{\\rho\}\_\{\+\}\. This produces the monotonic Best\-of\-NNscaling we observe empirically \(Section[6\.2](https://arxiv.org/html/2605.10991#S6.SS2)\), in contrast to the flat or non\-monotonic curves of deterministic models when failure modes dominate\.

### 5\.2Implementation

We extend a language model backbone with two heads predicting per\-input mean and variance:

μ​\(x\)=Sigmoid​\(MLPμ​\(𝐡\)\),σ2​\(x\)=Softplus​\(MLPσ​\(𝐡\)\),\\mu\(x\)=\\text\{Sigmoid\}\(\\text\{MLP\}\_\{\\mu\}\(\\mathbf\{h\}\)\),\\qquad\\sigma^\{2\}\(x\)=\\text\{Softplus\}\(\\text\{MLP\}\_\{\\sigma\}\(\\mathbf\{h\}\)\),\(7\)where𝐡\\mathbf\{h\}is the masked\-mean\-pooled hidden state from the backbone, which we adapt per user via LoRA for parameter efficiency\.

The training objective combines the NLL loss in Eq\. \([6](https://arxiv.org/html/2605.10991#S5.E6)\) with a margin loss that sharpens ranking among high\-quality candidates, the regime that matters for Best\-of\-NNselection:

ℒcontrast=∑i,j:yi\>yjyi\>τmax⁡\(0,m−\(μi−μj\)\),\\mathcal\{L\}\_\{\\text\{contrast\}\}=\\sum\_\{\\begin\{subarray\}\{c\}i,j:\\,y\_\{i\}\>y\_\{j\}\\\\ y\_\{i\}\>\\tau\\end\{subarray\}\}\\max\(0,m\-\(\\mu\_\{i\}\-\\mu\_\{j\}\)\),\(8\)whereτ\\tauis a high\-score threshold andmmis the margin, givingℒ=ℒNLL\+λ​ℒcontrast\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{NLL\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{contrast\}\}\. At inference, we score candidates byμ\\mualone; the variance is used only at training time\. Backbone, hyperparameters, and prompt templates appear in Appendix[D](https://arxiv.org/html/2605.10991#A4), an ablation across architectural components and the loss decomposition is in Appendix[B\.3](https://arxiv.org/html/2605.10991#A2.SS3), and computational\-cost measurements \(training and inference timing\) are in Appendix[B\.7](https://arxiv.org/html/2605.10991#A2.SS7)\.

## 6Experiments

We evaluate Test\-Time Personalization on five personalized text\-generation tasks, validating both the predictive scaling law of Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4)and the practical effectiveness of probabilistic RM\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x5.png)Figure 5:Main TTP results under the RAG policy across five tasks \(a\)–\(e\)\.Top row: observed Best\-of\-NNROUGE \(solid lines with markers\) and theory predictions \(dash\-dot lines\), where each method’s prediction is computed by substituting its four diagnostic quantities\(α,β,ρ¯\+,ρ¯−\)\(\\alpha,\\beta,\\bar\{\\rho\}\_\{\+\},\\bar\{\\rho\}\_\{\-\}\)into Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4)\(oracle uses Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)\); theory and observation agree at relMAE<<3%\.Middle row: per\-user correlationρu\\rho\_\{u\}violin distributions\.Bottom row: per\-query correlationρq\\rho\_\{q\}violin distributions; the red dotted line marksρq=0\\rho\_\{q\}\\\!=\\\!0\. Probabilistic User RM \(green\) is the only method whose curve scales monotonically, and the right\-shifted, tightened correlation distributions confirm that this gain is driven by the elimination of user\-level collapse and query\-level reward hacking\.### 6\.1Setup

#### Datasets\.

We use five tasks from LaMP\[[17](https://arxiv.org/html/2605.10991#bib.bib2)\]\(4\-News Headline Generation, 5\-Scholarly Title Generation\) and LongLaMP\[[4](https://arxiv.org/html/2605.10991#bib.bib12)\]\(Abstract Generation, Product Review Writing, Topic Writing\)\. FollowingTanet al\.\[[21](https://arxiv.org/html/2605.10991#bib.bib4)\], we select the top\-3030users with the longest profiles for LaMP and the top\-2020for LongLaMP\.

#### Policy Families\.

Our main results use a retrieval\-augmented generation \(RAG\) policy\[[17](https://arxiv.org/html/2605.10991#bib.bib2)\]that conditions on user\-history examples retrieved per query\. To probe whether TTP gains depend on the choice of generator, we additionally evaluate two alternatives in Appendix[B\.1](https://arxiv.org/html/2605.10991#A2.SS1):*Persona prompting*\[[16](https://arxiv.org/html/2605.10991#bib.bib31)\]\(a user\-style description synthesised from1515history samples by an external LLM\) and*Persona\+RAG*\(concatenating both\)\. All three useQwen3\-4B\-Instruct\[[23](https://arxiv.org/html/2605.10991#bib.bib22)\]as the generator and produceN=30N\{=\}30candidates per query at temperatureT=1\.7T\{=\}1\.7\(LaMP\) orT=1\.5T\{=\}1\.5\(LongLaMP\)\. We use80%80\\%of each user’s data to train reward models and20%20\\%for evaluation\.

#### Reward Models\.

We compare five selection strategies:Random\(uniform pick\),Global RM\(single model on pooled data\),Deterministic User RM\(per\-user, MSE loss with high\-score contrastive loss\),Probabilistic User RM\(per\-user, Gaussian NLL with high\-score contrastive loss; ours\), andOracle\(selection by ground\-truth ROUGE\)\. All reward models share theQwen2\.5\-1\.5B\-Instruct\[[12](https://arxiv.org/html/2605.10991#bib.bib23)\]backbone with LoRA \(r=8r\{=\}8\); the probabilistic variant adds a softplus variance head\. For comparison with training\-based personalization, we also fine\-tune per\-user LoRA adapters on the policy following\[[21](https://arxiv.org/html/2605.10991#bib.bib4)\]\. Variance\-based inference strategies are explored in Appendix[B\.6](https://arxiv.org/html/2605.10991#A2.SS6)but offer no improvement over mean\-based selection\.

#### Evaluation\.

We sampleN∈\{1,5,10,15,20,30\}N\\in\\\{1,5,10,15,20,30\\\}candidates per validation query and report the average of ROUGE\-1 and ROUGE\-L across users\. Each experiment is repeated1010times; standard deviations remain below0\.0070\.007across methods andNN\. Robustness to ROUGE is assessed with BERTScore and an LLM\-as\-judge in Appendix[B\.2](https://arxiv.org/html/2605.10991#A2.SS2); sensitivity to per\-user data volume, LoRA rank, and statistical stability across repetitions are reported in Appendix[B\.5](https://arxiv.org/html/2605.10991#A2.SS5)\.

### 6\.2Main Results

Figure[5](https://arxiv.org/html/2605.10991#S6.F5)presents three views of evaluation with RAG policy: scaling curves with theory overlay, per\-user correlation distributions, and per\-query correlation distributions\. We highlight three findings\.

#### Probabilistic RM is the only method that scales reliably\.

Across all five tasks \(Figure[5](https://arxiv.org/html/2605.10991#S6.F5), top row\), Probabilistic User RM is the only method whose Best\-of\-NNcurve increases monotonically\. ByN=30N\{=\}30, Probabilistic RM closes between51%51\\%and84%84\\%of the oracle gap depending on the task, and consistently surpasses the per\-user LoRA training\-based baseline byN=5N\{=\}5–1010on three of the five tasks \(per\-task numbers in Appendix[B\.1](https://arxiv.org/html/2605.10991#A2.SS1)\)\. The same selector ranking holds under two additional policy families \(Persona prompting and Persona\+RAG, Appendix[B\.1](https://arxiv.org/html/2605.10991#A2.SS1)\), confirming that TTP behaves as a policy\-orthogonal amplifier whose effect is determined by the reward model rather than the underlying generator\.

#### Theory predictions match observations and the correlation distributions show why\.

The dash\-dot theory curves in the top row are computed by substituting each reward model’s four measured quantities\(α,β,ρ¯\+,ρ¯−\)\(\\alpha,\\beta,\\bar\{\\rho\}\_\{\+\},\\bar\{\\rho\}\_\{\-\}\)into Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4), with no per\-method fitting; they yield effective correlationsρeff=0\.20/0\.19/0\.61\\rho\_\{\\text\{eff\}\}=0\.20/0\.19/0\.61for Global / Det / Prob RM and predict the observed curves at relMAE4\.4%/2\.7%/1\.7%4\.4\\%/2\.7\\%/1\.7\\%averaged across the five tasks \(per\-task table in Appendix[C\.1](https://arxiv.org/html/2605.10991#A3.SS1)\)\. The middle and bottom rows of Figure[5](https://arxiv.org/html/2605.10991#S6.F5)explain why theseρeff\\rho\_\{\\text\{eff\}\}values arise: probabilistic modeling shifts both per\-user and per\-query correlation distributions rightward and tightens them, dropping the user\-level collapse rateα\\alphafrom0\.150\.15\(Det\) to0and the query\-level hacking rateβ\\betafrom0\.250\.25to0\.060\.06averaged across all five tasks\. These distributional shifts are exactly the conditions Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4)requires for monotonic improvement, completing a tight loop between theory, diagnosis, and method\.

#### Task characteristics shape TTP headroom\.

Long\-form tasks \(Abstract, Product, Topic\) yield larger oracle gaps and correspondingly larger TTP gains than short\-form tasks \(Headlines, Titles\), where small token differences produce large ROUGE swings and reduce learnable signal\. The diagnostic framework explains this directly: short\-form tasks exhibit smaller Oracleσ¯\\bar\{\\sigma\}, hence a smaller logarithmic\-growth coefficientσ¯​c​ln⁡N\\bar\{\\sigma\}c\\sqrt\{\\ln N\}\(Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)\)\.

### 6\.3Failure\-Mode Analysis

To understand*when*and*how*the two failure modes manifest in practice, we conduct a deeper analysis on LaMP\-4 \(Figure[6](https://arxiv.org/html/2605.10991#S6.F6)\); cross\-task results appear in Appendix[C\.3](https://arxiv.org/html/2605.10991#A3.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x6.png)Figure 6:Failure\-mode analysis on LaMP\-4\.\(a\)Ground\-truth ROUGE standard deviation distinguishes collapsed users \(ρu<0\.1\\rho\_\{u\}\\\!<\\\!0\.1under Det RM\) from normal users \(ρu\>0\.5\\rho\_\{u\}\\\!\>\\\!0\.5\)\.\(b\)Score histogram of a representative collapsed user: GT scores cluster near zero, Det RM degenerates to near\-constant predictions \(ρ=−0\.10\\rho\\\!=\\\!\{\-\}0\.10\), while Prob RM preserves meaningful variation \(ρ=0\.78\\rho\\\!=\\\!0\.78\)\.\(c\)Per\-query correlation scatter \(Det vs Prob\): green points in the upper\-left quadrant are queries where Prob “rescues” Det’s negative correlation\.\(d\)Learned varianceσ2\\sigma^\{2\}correlates strongly with prediction error \(Spearmanr=0\.73r\\\!=\\\!0\.73\), evidence that Prob RM is well\-calibrated\.#### When does user\-level collapse occur?

Figure[6](https://arxiv.org/html/2605.10991#S6.F6)\(a\) groups users by their ground\-truth ROUGE variance: collapsed users \(under Det RM\) have significantly lower label variance than normal users \(p<0\.001p\\\!<\\\!0\.001\)\. A representative collapsed user \(Figure[6](https://arxiv.org/html/2605.10991#S6.F6)\(b\)\) illustrates the mechanism: when GT scores cluster narrowly, Det RM emits near\-constant predictions \(ρ=−0\.10\\rho\\\!=\\\!\{\-\}0\.10\), while Prob RM preserves meaningful variation \(ρ=0\.78\\rho\\\!=\\\!0\.78\)\. This aligns with Lemma[5\.1](https://arxiv.org/html/2605.10991#S5.Thmtheorem1): NLL training increasesσ2\\sigma^\{2\}on weak\-signal users, attenuating the gradient onμ\\muand preventing the regression\-to\-the\-mean that traps deterministic models\. The collapse rate drops fromαdet=0\.20\\alpha^\{\\text\{det\}\}=0\.20toαprob=0\\alpha^\{\\text\{prob\}\}=0on LaMP\-4, and to0on every other task as well\.

#### When does query\-level hacking occur?

Figure[6](https://arxiv.org/html/2605.10991#S6.F6)\(c\) plots per\-query Det vs Prob correlations\. The green points \(upper\-left quadrant\) are queries where Det produces negative correlation but Prob produces positive correlation, accounting for28%28\\%of all LaMP\-4 queries; combined with cases where both are positive \(blue\), Prob’s negative\-correlation rate drops fromβdet=0\.33\\beta^\{\\text\{det\}\}=0\.33toβprob=0\.08\\beta^\{\\text\{prob\}\}=0\.08\. Figure[6](https://arxiv.org/html/2605.10991#S6.F6)\(d\) explains why: Prob RM’s learnedσ2\\sigma^\{2\}rises with prediction error \(Spearmanr=0\.73r\\\!=\\\!0\.73\), so during training the loss naturally down\-weights queries it cannot fit\. This implicit regularization \(Lemma[5\.2](https://arxiv.org/html/2605.10991#S5.Thmtheorem2)\) prevents Det RM’s overfitting to spurious patterns on hard queries, which is what produces negative correlation in the first place\.

## 7Related Work

LLM Personalization\.Existing approaches to LLM personalization fall into three families\.*Prompting\-based methods*augment the input context with user information: LaMP\[[17](https://arxiv.org/html/2605.10991#bib.bib2)\]retrieves user history via dense retrieval, while PAG\[[15](https://arxiv.org/html/2605.10991#bib.bib3)\]first generates user profiles with LLMs to enrich personalization\.*Adaptation\-based methods*fine\-tune model parameters per user: OPPU\[[21](https://arxiv.org/html/2605.10991#bib.bib4)\]introduces per\-user LoRA adapters, and FDLoRA\[[10](https://arxiv.org/html/2605.10991#bib.bib13)\]improves efficiency through federated learning\.*Alignment\-based methods*formulate personalization as multi\-objective optimization: Rewarded Soup\[[13](https://arxiv.org/html/2605.10991#bib.bib14)\]interpolates separately trained policy weights to balance preference dimensions\. All three modify what the model sees or how it is trained; we instead scale the policy output space at inference time and study when selection\-based scaling is reliable\.

Uncertainty in Reward Modeling\.Uncertainty quantification has been used to mitigate reward hacking in general alignment\. Ensemble approaches train multiple reward models and use disagreement as the uncertainty signal\[[9](https://arxiv.org/html/2605.10991#bib.bib15)\]\. UP\-RLHF\[[24](https://arxiv.org/html/2605.10991#bib.bib17)\]trains diverse LoRA ensembles with uncertainty\-penalized optimization, while PURM\[[20](https://arxiv.org/html/2605.10991#bib.bib16)\]extends Bradley–Terry to a distributional reward\. A common feature of these works is that uncertainty is consumed at*inference time*, e\.g\., as a PPO penalty or a selection adjustment\. Our use of the variance head is structurally different:σ2\\sigma^\{2\}acts purely at*training time*via gradient buffering and implicit regularization \(Lemmas[5\.1](https://arxiv.org/html/2605.10991#S5.Thmtheorem1),[5\.2](https://arxiv.org/html/2605.10991#S5.Thmtheorem2)\); inference uses only the predicted mean\. This shift exposes a personalization\-specific role for probabilistic modeling that prior alignment\-focused work does not address\.

Test\-Time Scaling for Personalization\.Recent work has shown that scaling test\-time compute can improve LLM alignment, but the available compute can be allocated to several distinct targets\. Three are particularly relevant for personalization:*\(i\) scaling user interactions*: T\-POP\[[11](https://arxiv.org/html/2605.10991#bib.bib28)\]learns per\-user preferences online via dueling bandits, requiring repeated preference queries;*\(ii\) scaling reward\-model reasoning*: generative reward models unroll chain\-of\-thought tokens inside the reward model itself, trading inference cost for ranking accuracy; and*\(iii\) scaling policy outputs*: samplingNNcandidates from a personalized policy and selecting the best with a per\-user reward model, the axis we study here\. The three axes are complementary rather than competing; our diagnostic framework \(Section[4](https://arxiv.org/html/2605.10991#S4)\) applies to any reward model used in axis \(iii\) and could be combined with \(i\) or \(ii\)\. Training\-free LLM\-as\-judge methods, such as persona\-conditioned pairwise tournaments, instantiate axis \(ii\) with a general\-purpose LLM as the personalized reward model, but achieve only coin\-flip accuracy on per\-user style matching in our setting \(Appendix[B\.4](https://arxiv.org/html/2605.10991#A2.SS4)\), confirming that per\-user trained reward models remain necessary for stylistic personalization\. TPO\[[6](https://arxiv.org/html/2605.10991#bib.bib29)\]iteratively refines outputs through textual critiques but does not target personalization\.

## 8Conclusion and Discussion

Summary\.We introducedTest\-Time Personalization\(TTP\), scaling policy outputs at inference for personalized text generation\. Our central contribution is a*predictive scaling law*that links four measurable reward\-model properties to the Best\-of\-NNutility curve, validated empirically at relMAE<<3%\. The law identifies two failure modes \(user\-level collapse and query\-level reward hacking\), and we prove that probabilistic reward modeling overcomes both via training\-time gradient buffering and implicit regularization\. The resulting probabilistic User RM is the only method whose Best\-of\-NNcurve scales monotonically across all1515\(task, policy\) cells we tested, confirming TTP as a policy\-orthogonal amplifier\.

Limitations\.Our evaluation focuses on*personalized text generation*with publicly available benchmarks \(LaMP, LongLaMP\) using ROUGE\-derived ground\-truth rewards\. While we show robustness to alternative metrics \(BERTScore and an LLM\-as\-judge, Appendix[B\.2](https://arxiv.org/html/2605.10991#A2.SS2)\), generalization beyond text generation \(e\.g\., dialogue, agentic task\) remains untested\. The framework also assumes per\-user data sufficient to train a small probabilistic reward model; cold\-start scenarios fall outside the regime we study and are better addressed by orthogonal axes such as scaling user interactions \(Section[7](https://arxiv.org/html/2605.10991#S7)\)\. Finally, training separate reward models per user may pose deployment challenges at scale, motivating future work on parameter\-efficient cross\-user sharing\.

Future Directions\.The diagnostic framework opens several avenues\. First, the three test\-time axes \(user interactions, reward\-model reasoning, policy outputs\) are complementary; combining probabilistic per\-user RMs with generative reward modeling is a natural next step\. Second, extending probabilistic reward modeling to non\-text modalities would test whether the gradient\-buffering and implicit\-regularization mechanisms transfer to settings with richer label structure\.

## Acknowledgments and Disclosure of Funding

## References

- \[1\]S\. Aroca\-Ouellette, N\. Mackraz, B\. Theobald, and K\. Metcalf\(2025\)Aligning LLMs by predicting preferences from user writing samples\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=eUMGCipgtE)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p1.1)\.
- \[2\]T\. Dao\(2024\)FlashAttention\-2: faster attention with better parallelism and work partitioning\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mZn2Xyh9Ec)Cited by:[§D\.2](https://arxiv.org/html/2605.10991#A4.SS2.SSS0.Px7.p1.1)\.
- \[3\]J\. Jang, S\. Kim, B\. Y\. Lin, Y\. Wang, J\. Hessel, L\. Zettlemoyer, H\. Hajishirzi, Y\. Choi, and P\. Ammanabrolu\(2024\)Personalized soups: personalized large language model alignment via post\-hoc parameter merging\.InAdaptive Foundation Models: Evolving AI for Personalized and Efficient Learning,External Links:[Link](https://openreview.net/forum?id=EMrnoPRvxe)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p1.1)\.
- \[4\]I\. Kumar, S\. Viswanathan, S\. Yerra, A\. Salemi, R\. A\. Rossi, F\. Dernoncourt, H\. Deilamsalehy, X\. Chen, R\. Zhang, S\. Agarwal, N\. Lipka, C\. V\. Nguyen, T\. H\. Nguyen, and H\. Zamani\(2024\)LongLaMP: a benchmark for personalized long\-form text generation\.External Links:2407\.11016,[Link](https://arxiv.org/abs/2407.11016)Cited by:[§D\.1](https://arxiv.org/html/2605.10991#A4.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.10991#S1.p3.2),[§2\.2](https://arxiv.org/html/2605.10991#S2.SS2.p1.3),[§6\.1](https://arxiv.org/html/2605.10991#S6.SS1.SSS0.Px1.p1.2)\.
- \[5\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles,SOSP ’23,New York, NY, USA,pp\. 611–626\.External Links:ISBN 9798400702297,[Link](https://doi.org/10.1145/3600006.3613165),[Document](https://dx.doi.org/10.1145/3600006.3613165)Cited by:[§D\.1](https://arxiv.org/html/2605.10991#A4.SS1.SSS0.Px2.p2.1)\.
- \[6\]Y\. Li, X\. Hu, X\. Qu, L\. Li, and Y\. Cheng\(2025\)Test\-time preference optimization: on\-the\-fly alignment via iterative textual feedback\.External Links:2501\.12895,[Link](https://arxiv.org/abs/2501.12895)Cited by:[§7](https://arxiv.org/html/2605.10991#S7.p3.1)\.
- \[7\]J\. Liu, Z\. Qiu, Z\. Li, Q\. Dai, W\. Yu, J\. Zhu, M\. Hu, M\. Yang, T\. Chua, and I\. King\(2025\)A survey of personalized large language models: progress and future directions\.External Links:2502\.11528,[Link](https://arxiv.org/abs/2502.11528)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p1.1)\.
- \[8\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§D\.2](https://arxiv.org/html/2605.10991#A4.SS2.SSS0.Px7.p1.1)\.
- \[9\]X\. Lou, D\. Yan, W\. Shen, Y\. Yan, J\. Xie, and J\. Zhang\(2025\)Uncertainty\-aware reward model: teaching reward models to know what is unknown\.External Links:2410\.00847,[Link](https://arxiv.org/abs/2410.00847)Cited by:[§7](https://arxiv.org/html/2605.10991#S7.p2.1)\.
- \[10\]J\. Qi, Z\. Luan, S\. Huang, C\. Fung, H\. Yang, and D\. Qian\(2024\)Fdlora: personalized federated learning of large language model via dual lora tuning\.arXiv preprint arXiv:2406\.07925\.Cited by:[§7](https://arxiv.org/html/2605.10991#S7.p1.1)\.
- \[11\]Z\. Qu, M\. Zhang, M\. Kong, X\. Li, Z\. Shang, Z\. Wang, Y\. Ban, S\. Qiu, Y\. Shu, and Z\. Dai\(2025\)T\-pop: test\-time personalization with online preference feedback\.External Links:2509\.24696,[Link](https://arxiv.org/abs/2509.24696)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p2.1),[§7](https://arxiv.org/html/2605.10991#S7.p3.1)\.
- \[12\]Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu\(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§D\.2](https://arxiv.org/html/2605.10991#A4.SS2.SSS0.Px1.p1.2),[§6\.1](https://arxiv.org/html/2605.10991#S6.SS1.SSS0.Px3.p1.1)\.
- \[13\]A\. Rame, G\. Couairon, C\. Dancette, J\. Gaya, M\. Shukor, L\. Soulier, and M\. Cord\(2023\)Rewarded soups: towards pareto\-optimal alignment by interpolating weights fine\-tuned on diverse rewards\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=lSbbC2VyCu)Cited by:[§7](https://arxiv.org/html/2605.10991#S7.p1.1)\.
- \[14\]N\. Reimers and I\. Gurevych\(2019\-11\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3982–3992\.External Links:[Link](https://aclanthology.org/D19-1410/),[Document](https://dx.doi.org/10.18653/v1/D19-1410)Cited by:[§D\.1](https://arxiv.org/html/2605.10991#A4.SS1.SSS0.Px2.p1.1)\.
- \[15\]C\. Richardson, Y\. Zhang, K\. Gillespie, S\. Kar, A\. Singh, Z\. Raeesy, O\. Z\. Khan, and A\. Sethy\(2023\)Integrating summarization and retrieval for enhanced personalization via large language models\.External Links:2310\.20081,[Link](https://arxiv.org/abs/2310.20081)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p1.1),[§7](https://arxiv.org/html/2605.10991#S7.p1.1)\.
- \[16\]M\. J\. Ryan, O\. Shaikh, A\. Bhagirath, D\. Frees, W\. Held, and D\. Yang\(2025\-07\)SynthesizeMe\! inducing persona\-guided prompts for personalized reward models in LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 8045–8078\.External Links:[Link](https://aclanthology.org/2025.acl-long.397/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.397),ISBN 979\-8\-89176\-251\-0Cited by:[§6\.1](https://arxiv.org/html/2605.10991#S6.SS1.SSS0.Px2.p1.6)\.
- \[17\]A\. Salemi, S\. Mysore, M\. Bendersky, and H\. Zamani\(2024\-08\)LaMP: when large language models meet personalization\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 7370–7392\.External Links:[Link](https://aclanthology.org/2024.acl-long.399/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.399)Cited by:[§D\.1](https://arxiv.org/html/2605.10991#A4.SS1.SSS0.Px1.p1.1),[§D\.1](https://arxiv.org/html/2605.10991#A4.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.10991#S1.p1.1),[§1](https://arxiv.org/html/2605.10991#S1.p3.2),[§2\.2](https://arxiv.org/html/2605.10991#S2.SS2.p1.3),[§6\.1](https://arxiv.org/html/2605.10991#S6.SS1.SSS0.Px1.p1.2),[§6\.1](https://arxiv.org/html/2605.10991#S6.SS1.SSS0.Px2.p1.6),[§7](https://arxiv.org/html/2605.10991#S7.p1.1)\.
- \[18\]J\. Shen, H\. Bai, L\. Zhang, Y\. Zhou, A\. Setlur, S\. Tong, D\. Caples, N\. Jiang, T\. Zhang, A\. Talwalkar, and A\. Kumar\(2025\)Thinking vs\. doing: agents that reason by scaling test\-time interaction\.InWorkshop on Scaling Environments for Agents,External Links:[Link](https://openreview.net/forum?id=uhigrPHBm5)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p2.1)\.
- \[19\]C\. V\. Snell, J\. Lee, K\. Xu, and A\. Kumar\(2025\)Scaling LLM test\-time compute optimally can be more effective than scaling parameters for reasoning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=4FWAwZtd2n)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p2.1)\.
- \[20\]W\. Sun, X\. Cheng, X\. Yu, H\. Xu, Z\. Yang, S\. He, J\. Zhao, and K\. Liu\(2025\)Probabilistic uncertain reward model\.External Links:2503\.22480,[Link](https://arxiv.org/abs/2503.22480)Cited by:[§7](https://arxiv.org/html/2605.10991#S7.p2.1)\.
- \[21\]Z\. Tan, Q\. Zeng, Y\. Tian, Z\. Liu, B\. Yin, and M\. Jiang\(2024\-11\)Democratizing large language models via personalized parameter\-efficient fine\-tuning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 6476–6491\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.372/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.372)Cited by:[§B\.1](https://arxiv.org/html/2605.10991#A2.SS1.SSS0.Px1.p1.4),[§D\.1](https://arxiv.org/html/2605.10991#A4.SS1.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2605.10991#S1.p1.1),[§3](https://arxiv.org/html/2605.10991#S3.p1.1),[§6\.1](https://arxiv.org/html/2605.10991#S6.SS1.SSS0.Px1.p1.2),[§6\.1](https://arxiv.org/html/2605.10991#S6.SS1.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2605.10991#S7.p1.1)\.
- \[22\]Y\. Wu, Z\. Sun, S\. Li, S\. Welleck, and Y\. Yang\(2025\)Inference scaling laws: an empirical analysis of compute\-optimal inference for LLM problem\-solving\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VNckp7JEHn)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p2.1)\.
- \[23\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§D\.1](https://arxiv.org/html/2605.10991#A4.SS1.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2605.10991#S6.SS1.SSS0.Px2.p1.6)\.
- \[24\]Y\. Zhai, Y\. Lei, H\. Zhang, Y\. Yu, K\. Xu, D\. Feng, B\. Ding, and H\. Wang\(2026\)Uncertainty\-penalized reinforcement learning from human feedback with diversified reward lora ensembles\.Information Processing and Management63\(3\),pp\. 104548\.External Links:ISSN 0306\-4573,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ipm.2025.104548),[Link](https://www.sciencedirect.com/science/article/pii/S0306457325004893)Cited by:[§7](https://arxiv.org/html/2605.10991#S7.p2.1)\.
- \[25\]L\. Zhang, J\. Wu, D\. Zhou, and Y\. He\(2025\-07\)PROPER: a progressive learning framework for personalized large language models with group\-level adaptation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 16399–16411\.External Links:[Link](https://aclanthology.org/2025.acl-long.800/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.800),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p1.1)\.
- \[26\]P\. Zhang, T\. Lin, Y\. Wu, J\. Chen, Z\. Wang, H\. Yang, X\. Ze, F\. Huang, Y\. Li, and K\. Zhang\(2026\)P\-genRM: personalized generative reward model with test\-time user\-based scaling\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hXNApWLBZG)Cited by:[§1](https://arxiv.org/html/2605.10991#S1.p2.1)\.

## Appendix ATheoretical Proofs

In this section, we provide formal proofs or theoretical analysis for all theoretical results presented in the main text\. Our theoretical framework rests on a chain of results that progressively characterizes test\-time personalization:

1. 1\.Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)\(Oracle Scaling Law, Section[A\.1](https://arxiv.org/html/2605.10991#A1.SS1)\): Establishes the theoretical ceiling for TTP, expected utility grows asO​\(ln⁡N\)O\(\\sqrt\{\\ln N\}\)with oracle selection\.
2. 2\.Lemma[4\.2](https://arxiv.org/html/2605.10991#S4.Thmtheorem2)\(Correlation\-Scaling Relationship, Section[A\.2](https://arxiv.org/html/2605.10991#A1.SS2)\): Shows that reward model correlation directly determines scaling behavior, providing a diagnostic tool for analyzing RM quality\.
3. 3\.Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4)\(Unified Scaling Law, Section[A\.3](https://arxiv.org/html/2605.10991#A1.SS3)\): Derives how the two failure modes, collapse rateα\\alphaand hacking rateβ\\beta, jointly determine population\-level scaling\.
4. 4\.Lemmas[5\.1](https://arxiv.org/html/2605.10991#S5.Thmtheorem1)and[5\.2](https://arxiv.org/html/2605.10991#S5.Thmtheorem2)\(Gradient Buffering & Implicit Regularization, Section[A\.4](https://arxiv.org/html/2605.10991#A1.SS4)\): Explains the mechanisms by which probabilistic reward modeling reduces both failure modes\.

Throughout the proofs, we introduce necessary assumptions and provide remarks connecting theoretical insights to empirical observations\. Table[A1](https://arxiv.org/html/2605.10991#A1.T1)summarizes the key assumptions used in our analysis\.

Table A1:Summary of assumptions used in theoretical analysis\.### A\.1Proof of Theorem 3\.1

We first introduce the sub\-Gaussian property and a key lemma on the maximum of random variables, then prove Theorem 3\.1\.

###### Definition A\.1\(Sub\-Gaussian Random Variable\)\.

A random variableXXwith meanμ=𝔼​\[X\]\\mu=\\mathbb\{E\}\[X\]isσ\\sigma\-sub\-Gaussian if:

𝔼​\[exp⁡\(λ​\(X−μ\)\)\]≤exp⁡\(λ2​σ22\),∀λ∈ℝ\\mathbb\{E\}\\left\[\\exp\\left\(\\lambda\(X\-\\mu\)\\right\)\\right\]\\leq\\exp\\left\(\\frac\{\\lambda^\{2\}\\sigma^\{2\}\}\{2\}\\right\),\\quad\\forall\\lambda\\in\\mathbb\{R\}\(9\)whereσ\>0\\sigma\>0is the sub\-Gaussian parameter\. This assumption covers a wide range of bounded or light\-tailed distributions, including typical LLM quality scores\.

###### Lemma A\.2\(Upper Bound on Expected Maximum\)\.

LetX1,…,XNX\_\{1\},\\ldots,X\_\{N\}be i\.i\.d\.σ\\sigma\-sub\-Gaussian random variables with meanμ\\mu\. Then the expected maximum is bounded by:

𝔼​\[maxi∈\[N\]⁡Xi\]≤μ\+σ​2​ln⁡N\\mathbb\{E\}\\left\[\\max\_\{i\\in\[N\]\}X\_\{i\}\\right\]\\leq\\mu\+\\sigma\\sqrt\{2\\ln N\}\(10\)

###### Proof\.

For anyλ\>0\\lambda\>0, using Jensen’s inequality and the monotonicity of the exponential function:

exp⁡\(λ​𝔼​\[maxi⁡\(Xi−μ\)\]\)≤𝔼​\[exp⁡\(λ​maxi⁡\(Xi−μ\)\)\]=𝔼​\[maxi⁡exp⁡\(λ​\(Xi−μ\)\)\]\\exp\\left\(\\lambda\\mathbb\{E\}\\left\[\\max\_\{i\}\(X\_\{i\}\-\\mu\)\\right\]\\right\)\\leq\\mathbb\{E\}\\left\[\\exp\\left\(\\lambda\\max\_\{i\}\(X\_\{i\}\-\\mu\)\\right\)\\right\]=\\mathbb\{E\}\\left\[\\max\_\{i\}\\exp\(\\lambda\(X\_\{i\}\-\\mu\)\)\\right\]\(11\)Bounding the maximum by the sum:

𝔼​\[maxi⁡exp⁡\(λ​\(Xi−μ\)\)\]≤∑i=1N𝔼​\[exp⁡\(λ​\(Xi−μ\)\)\]\\mathbb\{E\}\\left\[\\max\_\{i\}\\exp\(\\lambda\(X\_\{i\}\-\\mu\)\)\\right\]\\leq\\sum\_\{i=1\}^\{N\}\\mathbb\{E\}\\left\[\\exp\(\\lambda\(X\_\{i\}\-\\mu\)\)\\right\]\(12\)Applying the sub\-Gaussian property \(Definition[A\.1](https://arxiv.org/html/2605.10991#A1.Thmtheorem1)\):

∑i=1N𝔼​\[exp⁡\(λ​\(Xi−μ\)\)\]≤N​exp⁡\(λ2​σ22\)\\sum\_\{i=1\}^\{N\}\\mathbb\{E\}\\left\[\\exp\(\\lambda\(X\_\{i\}\-\\mu\)\)\\right\]\\leq N\\exp\\left\(\\frac\{\\lambda^\{2\}\\sigma^\{2\}\}\{2\}\\right\)\(13\)Taking logarithms on both sides:

λ​𝔼​\[maxi⁡\(Xi−μ\)\]≤ln⁡N\+λ2​σ22⟹𝔼​\[maxi⁡Xi\]≤μ\+ln⁡Nλ\+λ​σ22\\lambda\\mathbb\{E\}\\left\[\\max\_\{i\}\(X\_\{i\}\-\\mu\)\\right\]\\leq\\ln N\+\\frac\{\\lambda^\{2\}\\sigma^\{2\}\}\{2\}\\implies\\mathbb\{E\}\\left\[\\max\_\{i\}X\_\{i\}\\right\]\\leq\\mu\+\\frac\{\\ln N\}\{\\lambda\}\+\\frac\{\\lambda\\sigma^\{2\}\}\{2\}\(14\)The right\-hand side is minimized whenλ=2​ln⁡N/σ\\lambda=\\sqrt\{2\\ln N\}/\\sigma\. Substituting this value yields the boundμ\+σ​2​ln⁡N\\mu\+\\sigma\\sqrt\{2\\ln N\}\. ∎

###### Proof\.

For a useruuwith queryqq, assuming the true rewards\{ru∗​\(q,xi\)\}i=1N\\\{r^\{\*\}\_\{u\}\(q,x\_\{i\}\)\\\}\_\{i=1\}^\{N\}are i\.i\.d\.σu\\sigma\_\{u\}\-sub\-Gaussian with meanμu\\mu\_\{u\}, we apply Lemma[A\.2](https://arxiv.org/html/2605.10991#A1.Thmtheorem2):

Uoracle,u​\(N\)=𝔼​\[maxi∈\[N\]⁡ru∗​\(q,xi\)\]≤μu\+c⋅σu⋅ln⁡NU\_\{\\text\{oracle\},u\}\(N\)=\\mathbb\{E\}\\left\[\\max\_\{i\\in\[N\]\}r^\{\*\}\_\{u\}\(q,x\_\{i\}\)\\right\]\\leq\\mu\_\{u\}\+c\\cdot\\sigma\_\{u\}\\cdot\\sqrt\{\\ln N\}\(15\)wherec=2c=\\sqrt\{2\}is the theoretical scaling coefficient derived from the sub\-Gaussian bound\.

Taking the expectation over the population of users:

U¯oracle​\(N\)=𝔼u​\[Uoracle,u​\(N\)\]≤𝔼u​\[μu\]\+c​ln⁡N⋅𝔼u​\[σu\]=μ¯\+c​σ¯​ln⁡N\\bar\{U\}\_\{\\text\{oracle\}\}\(N\)=\\mathbb\{E\}\_\{u\}\[U\_\{\\text\{oracle\},u\}\(N\)\]\\leq\\mathbb\{E\}\_\{u\}\[\\mu\_\{u\}\]\+c\\sqrt\{\\ln N\}\\cdot\\mathbb\{E\}\_\{u\}\[\\sigma\_\{u\}\]=\\bar\{\\mu\}\+c\\bar\{\\sigma\}\\sqrt\{\\ln N\}\(16\)Thus, the oracle utility is upper bounded by a logarithmic growth term scaled by the average reward noiseσ¯\\bar\{\\sigma\}\. ∎

#### Remark on Bound Tightness\.

While Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)provides an upper bound withc=2c=\\sqrt\{2\}, our empirical observations suggest that the scaling curve tightly follows this boundary\. This indicates that the reward distributions in practice are not merely sub\-Gaussian but exhibit tail behaviors close to the Gaussian limit \(where the bound becomes an asymptotic equality\)\. Consequently, we treatccas an effective constant in our analysis to absorb minor distributional deviations, recognizing thatc≈2c\\approx\\sqrt\{2\}represents the fully saturated theoretical potential\.

### A\.2Proof of Lemma 4\.2

This lemma establishes how reward model correlation determines TTP scaling behavior\. We first present the idealized result under a uniformity assumption, then discuss when this assumption breaks down in practice\.

#### Setup and Notation

For a useruu, letr∗=ru∗​\(q,x\)r^\{\*\}=r^\{\*\}\_\{u\}\(q,x\)denote the true reward andr^=r^u​\(q,x\)\\hat\{r\}=\\hat\{r\}\_\{u\}\(q,x\)denote the predicted reward for a query\-response pair\. We assume both are standardized to have meanμ\\muand consider their correlation:

ρu=Corr​\(r^,r∗\)=𝔼​\[\(r^−μ\)​\(r∗−μ\)\]σr^​σr∗\\rho\_\{u\}=\\text\{Corr\}\(\\hat\{r\},r^\{\*\}\)=\\frac\{\\mathbb\{E\}\[\(\\hat\{r\}\-\\mu\)\(r^\{\*\}\-\\mu\)\]\}\{\\sigma\_\{\\hat\{r\}\}\\sigma\_\{r^\{\*\}\}\}\(17\)
###### Assumption A\.3\(Correlation Uniformity\)\.

The correlationρu\\rho\_\{u\}between predicted and true rewards is approximately uniform across the score distribution\. Formally, for any subset𝒮⊆\[0,1\]\\mathcal\{S\}\\subseteq\[0,1\]of the score range:

ρu\(𝒮\):=Corr​\(r^,r∗∣r∗∈𝒮\)≈ρu\\rho\_\{u\}^\{\(\\mathcal\{S\}\)\}:=\\text\{Corr\}\(\\hat\{r\},r^\{\*\}\\mid r^\{\*\}\\in\\mathcal\{S\}\)\\approx\\rho\_\{u\}\(18\)

This assumption is analogous to homoscedasticity in regression: the predictive relationship is stable across different regions of the target distribution\.

###### Lemma A\.4\(Correlation\-Scaling Relationship, Restated\)\.

Under Assumption[A\.3](https://arxiv.org/html/2605.10991#A1.Thmtheorem3), for a reward model with correlationρu\\rho\_\{u\}, the Best\-of\-N utility satisfies:

Uu​\(N\)≈μu\+ρu⋅σu⋅c​ln⁡NU\_\{u\}\(N\)\\approx\\mu\_\{u\}\+\\rho\_\{u\}\\cdot\\sigma\_\{u\}\\cdot c\\sqrt\{\\ln N\}\(19\)wherec\>0c\>0is the same constant as in Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)\.

###### Proof\.

We assume the joint distribution of predicted and true rewards follows a Bivariate Normal distribution \(or satisfies the property of linear conditional expectation\)\. Under joint normality \(or linear regression structure\), the conditional expectation of true reward given predicted reward is:

𝔼​\[r∗∣r^\]=μ\+ρu⋅σr∗σr^​\(r^−μ\)\\mathbb\{E\}\[r^\{\*\}\\mid\\hat\{r\}\]=\\mu\+\\rho\_\{u\}\\cdot\\frac\{\\sigma\_\{r^\{\*\}\}\}\{\\sigma\_\{\\hat\{r\}\}\}\(\\hat\{r\}\-\\mu\)\(20\)
Letx^∗=arg⁡maxi∈\[N\]⁡r^​\(xi\)\\hat\{x\}^\{\*\}=\\arg\\max\_\{i\\in\[N\]\}\\hat\{r\}\(x\_\{i\}\)be the candidate selected by the reward model\. The expected true reward of this selection is:

𝔼​\[r∗​\(x^∗\)\]=𝔼​\[𝔼​\[r∗∣r^​\(x^∗\)\]\]=μ\+ρu⋅σr∗σr^⋅𝔼​\[r^​\(x^∗\)−μ\]\\mathbb\{E\}\[r^\{\*\}\(\\hat\{x\}^\{\*\}\)\]=\\mathbb\{E\}\\left\[\\mathbb\{E\}\[r^\{\*\}\\mid\\hat\{r\}\(\\hat\{x\}^\{\*\}\)\]\\right\]=\\mu\+\\rho\_\{u\}\\cdot\\frac\{\\sigma\_\{r^\{\*\}\}\}\{\\sigma\_\{\\hat\{r\}\}\}\\cdot\\mathbb\{E\}\[\\hat\{r\}\(\\hat\{x\}^\{\*\}\)\-\\mu\]\(21\)
Sincer^​\(xi\)\\hat\{r\}\(x\_\{i\}\)are i\.i\.d\. sub\-Gaussian, by Lemma[A\.2](https://arxiv.org/html/2605.10991#A1.Thmtheorem2):

𝔼​\[r^​\(x^∗\)\]=𝔼​\[maxi∈\[N\]⁡r^​\(xi\)\]=μ\+c⋅σr^⋅ln⁡N\\mathbb\{E\}\[\\hat\{r\}\(\\hat\{x\}^\{\*\}\)\]=\\mathbb\{E\}\\left\[\\max\_\{i\\in\[N\]\}\\hat\{r\}\(x\_\{i\}\)\\right\]=\\mu\+c\\cdot\\sigma\_\{\\hat\{r\}\}\\cdot\\sqrt\{\\ln N\}\(22\)
Substituting into the previous equation:

Uu​\(N\)=𝔼​\[r∗​\(x^∗\)\]=μ\+ρu⋅σr∗σr^⋅c⋅σr^⋅ln⁡N=μu\+ρu⋅σu⋅c​ln⁡NU\_\{u\}\(N\)=\\mathbb\{E\}\[r^\{\*\}\(\\hat\{x\}^\{\*\}\)\]=\\mu\+\\rho\_\{u\}\\cdot\\frac\{\\sigma\_\{r^\{\*\}\}\}\{\\sigma\_\{\\hat\{r\}\}\}\\cdot c\\cdot\\sigma\_\{\\hat\{r\}\}\\cdot\\sqrt\{\\ln N\}=\\mu\_\{u\}\+\\rho\_\{u\}\\cdot\\sigma\_\{u\}\\cdot c\\sqrt\{\\ln N\}\(23\)whereσu:=σr∗\\sigma\_\{u\}:=\\sigma\_\{r^\{\*\}\}for notational consistency\. ∎

#### Remark\.

In practice, Assumption[A\.3](https://arxiv.org/html/2605.10991#A1.Thmtheorem3)often does not hold exactly\. High\-score regions are typically harder to model due to data sparsity, subtle quality distinctions, and regression to the mean, leading toρu\(high\)<ρu\(overall\)\\rho\_\{u\}^\{\(\\text\{high\}\)\}<\\rho\_\{u\}^\{\(\\text\{overall\}\)\}\. Since Best\-of\-N selection operates primarily in the right tail of the score distribution, the effective correlation for TTP is lower than the overall correlation\. This explains why in our main experiments \(Figure[5](https://arxiv.org/html/2605.10991#S6.F5)\), even when probabilistic RM achieves high user\-level correlation on LongLaMP tasks, a gap remains between TTP performance and the oracle upper bound\.

### A\.3Proof of Proposition 4\.4

This proposition establishes how the two failure modes, user\-level collapse and query\-level reward hacking, jointly determine TTP scaling behavior\.

#### Definitions and Assumptions\.

We first formalize the failure rates introduced in Definition[4\.3](https://arxiv.org/html/2605.10991#S4.Thmtheorem3):

###### Definition A\.5\(Failure Rates, Restated\)\.

For a reward modeling approach:

- •Collapse rateα\\alpha: Fraction of users with per\-user correlationρu<τc\\rho\_\{u\}<\\tau\_\{c\}, whereτc=0\.1\\tau\_\{c\}=0\.1is the collapse threshold\.
- •Reward hacking rateβ\\beta: Fraction of queries \(among non\-collapsed users\) with per\-query correlationρq<0\\rho\_\{q\}<0\.

###### Assumption A\.6\(Collapse Absorbs Hacking\)\.

For collapsed users \(ρu<τc\\rho\_\{u\}<\\tau\_\{c\}\), the reward model outputs near\-constant predictions, rendering query\-level correlation undefined or negligible\. Thus, the hacking rateβ\\betais measured only among non\-collapsed users\.

This assumption reflects the intuition that collapse represents a more severe failure mode: when a reward model cannot discriminate between candidates at all, the notion of “selecting worse candidates” \(hacking\) becomes moot\.

###### Assumption A\.7\(Variance Homogeneity\)\.

The varianceσu\\sigma\_\{u\}of true rewards is approximately constant across users and query types\. Variations inσ\\sigmaacross collapsed/non\-collapsed users or hacked/non\-hacked queries are treated as second\-order effects\.

###### Proposition A\.8\(Unified Scaling Law, Restated\)\.

Under Assumptions[A\.3](https://arxiv.org/html/2605.10991#A1.Thmtheorem3),[A\.6](https://arxiv.org/html/2605.10991#A1.Thmtheorem6), and[A\.7](https://arxiv.org/html/2605.10991#A1.Thmtheorem7), the population\-level Best\-of\-N utility can be approximated as:

U¯​\(N\)≈μ¯\+\(1−α\)⋅\[\(1−β\)​ρ¯\+−β​\|ρ¯−\|\]⋅σ¯⋅c​ln⁡N\\bar\{U\}\(N\)\\approx\\bar\{\\mu\}\+\(1\-\\alpha\)\\cdot\\left\[\(1\-\\beta\)\\bar\{\\rho\}\_\{\+\}\-\\beta\|\\bar\{\\rho\}\_\{\-\}\|\\right\]\\cdot\\bar\{\\sigma\}\\cdot c\\sqrt\{\\ln N\}\(24\)whereρ¯\+\>0\\bar\{\\rho\}\_\{\+\}\>0is the mean correlation on non\-hacked queries andρ¯−<0\\bar\{\\rho\}\_\{\-\}<0is the mean correlation on hacked queries \(both measured among non\-collapsed users\)\.

###### Proof\.

We derive the result through a two\-level decomposition\. Partition users into collapsed \(fractionα\\alpha\) and non\-collapsed \(fraction1−α1\-\\alpha\):

U¯​\(N\)=α⋅U¯collapsed​\(N\)\+\(1−α\)⋅U¯non\-collapsed​\(N\)\\bar\{U\}\(N\)=\\alpha\\cdot\\bar\{U\}\_\{\\text\{collapsed\}\}\(N\)\+\(1\-\\alpha\)\\cdot\\bar\{U\}\_\{\\text\{non\-collapsed\}\}\(N\)\(25\)
For collapsed users,ρu≈0\\rho\_\{u\}\\approx 0, so by Lemma[4\.2](https://arxiv.org/html/2605.10991#S4.Thmtheorem2),U¯collapsed​\(N\)≈μ¯\\bar\{U\}\_\{\\text\{collapsed\}\}\(N\)\\approx\\bar\{\\mu\}\.

Among non\-collapsed users, partition queries into hacked \(fractionβ\\beta\) and non\-hacked \(fraction1−β1\-\\beta\)\. Applying Lemma[4\.2](https://arxiv.org/html/2605.10991#S4.Thmtheorem2)to each partition:

U¯non\-hacked​\(N\)\\displaystyle\\bar\{U\}\_\{\\text\{non\-hacked\}\}\(N\)≈μ¯\+ρ¯\+⋅σ¯⋅c​ln⁡N\\displaystyle\\approx\\bar\{\\mu\}\+\\bar\{\\rho\}\_\{\+\}\\cdot\\bar\{\\sigma\}\\cdot c\\sqrt\{\\ln N\}\(26\)U¯hacked​\(N\)\\displaystyle\\bar\{U\}\_\{\\text\{hacked\}\}\(N\)≈μ¯\+ρ¯−⋅σ¯⋅c​ln⁡N\\displaystyle\\approx\\bar\{\\mu\}\+\\bar\{\\rho\}\_\{\-\}\\cdot\\bar\{\\sigma\}\\cdot c\\sqrt\{\\ln N\}\(27\)
Combining the query\-level results:

U¯non\-collapsed​\(N\)=μ¯\+\[\(1−β\)​ρ¯\+−β​\|ρ¯−\|\]⋅σ¯⋅c​ln⁡N\\bar\{U\}\_\{\\text\{non\-collapsed\}\}\(N\)=\\bar\{\\mu\}\+\\left\[\(1\-\\beta\)\\bar\{\\rho\}\_\{\+\}\-\\beta\|\\bar\{\\rho\}\_\{\-\}\|\\right\]\\cdot\\bar\{\\sigma\}\\cdot c\\sqrt\{\\ln N\}\(28\)
Substituting back into the user\-level decomposition:

U¯​\(N\)=μ¯\+\(1−α\)⋅\[\(1−β\)​ρ¯\+−β​\|ρ¯−\|\]⋅σ¯⋅c​ln⁡N\\bar\{U\}\(N\)=\\bar\{\\mu\}\+\(1\-\\alpha\)\\cdot\\left\[\(1\-\\beta\)\\bar\{\\rho\}\_\{\+\}\-\\beta\|\\bar\{\\rho\}\_\{\-\}\|\\right\]\\cdot\\bar\{\\sigma\}\\cdot c\\sqrt\{\\ln N\}\(29\)∎

#### Remark on the trade\-off\.

Proposition[A\.8](https://arxiv.org/html/2605.10991#A1.Thmtheorem8)reveals a fundamental trade\-off: Global RM achieves stability \(lowα\\alpha, lowβ\\beta\) but weak personalization \(lowρ¯\+\\bar\{\\rho\}\_\{\+\}\); deterministic User RM achieves highρ¯\+\\bar\{\\rho\}\_\{\+\}but suffers from instability \(highα\\alphaorβ\\beta\); probabilistic User RM achieves both high correlation and stability through gradient buffering and implicit regularization \(Lemmas[5\.1](https://arxiv.org/html/2605.10991#S5.Thmtheorem1)and[5\.2](https://arxiv.org/html/2605.10991#S5.Thmtheorem2)\), enabling effective TTP\. The unified scaling law thus provides a diagnostic framework for predicting TTP performance by measuringα\\alpha,β\\beta,ρ¯\+\\bar\{\\rho\}\_\{\+\}, andρ¯−\\bar\{\\rho\}\_\{\-\}\.

#### Remark on the collapsed\-user assumption\.

Assumption[A\.6](https://arxiv.org/html/2605.10991#A1.Thmtheorem6)treats the contribution of collapsed users as negligible because their per\-user correlation is approximately zero\. In our data, however, collapsed users tend to have a small but non\-zero negative correlation \(e\.g\.,ρ¯collapsed≈−0\.027\\bar\{\\rho\}\_\{\\text\{collapsed\}\}\\approx\-0\.027on LaMP\-4 and−0\.030\-0\.030on LongLaMP\-Product Review\)\. Replacing theρu≈0\\rho\_\{u\}\\approx 0assumption with the empirically measuredρ¯collapsed\\bar\{\\rho\}\_\{\\text\{collapsed\}\}yields a refined formula

U¯​\(N\)≈μ¯\+\[\(1−α\)​ρeff\+α​ρ¯collapsed\]⋅σ¯⋅c​ln⁡N\.\\bar\{U\}\(N\)\\approx\\bar\{\\mu\}\+\\big\[\(1\-\\alpha\)\\rho\_\{\\text\{eff\}\}\+\\alpha\\bar\{\\rho\}\_\{\\text\{collapsed\}\}\\big\]\\cdot\\bar\{\\sigma\}\\cdot c\\sqrt\{\\ln N\}\.\(30\)The refinement is small in magnitude \(improving relMAE on LaMP\-4 Det RM from9\.1%9\.1\\%to8\.8%8\.8\\%, similarly on the other tasks where collapsed users are most prevalent\) and*strengthens*the main conclusion: collapsed users contribute slightly*negative*scaling, so probabilistic modeling, which eliminates collapse altogether \(αprob=0\\alpha^\{\\text\{prob\}\}\{=\}0\), is even more important than the original analysis suggests\.

### A\.4Theoretical Analysis of Lemma 5\.1 and Lemma 5\.2

In this section, we provide theoretical analysis of how probabilistic reward modeling addresses the two failure modes identified in Section[4](https://arxiv.org/html/2605.10991#S4)\. Both mechanisms operate during training by leveraging the learned varianceσ2\\sigma^\{2\}to modulate gradient flow and sample weighting\.

#### Preliminaries: NLL vs MSE Training\.

Consider a reward model that predicts meanμ​\(x\)\\mu\(x\)and varianceσ2​\(x\)\\sigma^\{2\}\(x\)for inputxxwith ground\-truth labelyy\. The two training objectives are:

MSE Loss:

ℒMSE=\(y−μ\)2\\mathcal\{L\}\_\{\\text\{MSE\}\}=\(y\-\\mu\)^\{2\}\(31\)
Gaussian NLL Loss:

ℒNLL=12​log⁡σ2\+\(y−μ\)22​σ2\\mathcal\{L\}\_\{\\text\{NLL\}\}=\\frac\{1\}\{2\}\\log\\sigma^\{2\}\+\\frac\{\(y\-\\mu\)^\{2\}\}\{2\\sigma^\{2\}\}\(32\)
The key difference lies in their gradients with respect toμ\\mu:

∂ℒMSE∂μ\\displaystyle\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{MSE\}\}\}\{\\partial\\mu\}=−2​\(y−μ\)\\displaystyle=\-2\(y\-\\mu\)\(33\)∂ℒNLL∂μ\\displaystyle\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{NLL\}\}\}\{\\partial\\mu\}=μ−yσ2\\displaystyle=\\frac\{\\mu\-y\}\{\\sigma^\{2\}\}\(34\)
Under NLL training, the gradient magnitude is scaled by1/σ21/\\sigma^\{2\}, allowing the model to modulate learning signals through the variance prediction\.

#### Theoretical Analysis of Lemma 5\.1\.

###### Lemma A\.9\(Gradient Buffering, Restated\)\.

Under NLL training, the effective gradient onμ\\muscales asO​\(1/σ2\)O\(1/\\sigma^\{2\}\)\. For users with narrow score distributions where training signals provide weak discriminative information, the model learns to increaseσ2\\sigma^\{2\}, thereby attenuating gradient magnitudes and preventing the collapse that occurs in deterministic models trained with MSE\.

###### Proof\.

Consider a useruuwhose ground\-truth reward distribution has low variance \(narrow distribution with smallΔy:=ymax−ymin\\Delta\_\{y\}:=y\_\{\\max\}\-y\_\{\\min\}\)\.

Under MSE training, whenΔy\\Delta\_\{y\}is small and inputsxix\_\{i\}are diverse, the model faces conflicting gradients: different inputs with similar labels pullμ\\mutoward the same value\. The equilibrium is a near\-constant predictionμMSE​\(x\)≈y¯\\mu\_\{\\text\{MSE\}\}\(x\)\\approx\\bar\{y\}, constituting user\-level collapse\.

Under NLL training, the model jointly optimizesμ\\muandσ2\\sigma^\{2\}\. The gradient with respect toσ2\\sigma^\{2\}is:

∂ℒNLL∂σ2=12​σ2−\(y−μ\)22​σ4\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{NLL\}\}\}\{\\partial\\sigma^\{2\}\}=\\frac\{1\}\{2\\sigma^\{2\}\}\-\\frac\{\(y\-\\mu\)^\{2\}\}\{2\\sigma^\{4\}\}\(35\)
Setting this to zero yields the per\-sample equilibriumσ∗2=\(y−μ\)2\\sigma^\{\*2\}=\(y\-\\mu\)^\{2\}\. When the model cannot accurately predictμ\\mu, the residual\(y−μ\)2\(y\-\\mu\)^\{2\}remains large, andσ2\\sigma^\{2\}increases to match this residual\. The gradient onμ\\muthen becomes:

∂ℒNLL∂μ=μ−yσ2=O​\(1/\|y−μ\|\)\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{NLL\}\}\}\{\\partial\\mu\}=\\frac\{\\mu\-y\}\{\\sigma^\{2\}\}=O\(1/\|y\-\\mu\|\)\(36\)
Asσ2\\sigma^\{2\}increases, gradient magnitudes onμ\\mudecrease, preventing the model from being forced to fit contradictory signals and allowingμ\\muto maintain meaningful variation\. ∎

#### Theoretical Analysis of Lemma 5\.2\.

###### Lemma A\.10\(Implicit Regularization, Restated\)\.

Under NLL training, the learned varianceσ2\\sigma^\{2\}acts as an adaptive sample weight, effectively down\-weighting samples with unreliable or inconsistent labels\. This implicit regularization prevents overfitting to spurious patterns, resulting in mean predictionsμ\\muthat exhibit higher correlation with true rewards and reduced reward hacking\.

###### Justification\.

The NLL loss can be decomposed as:

ℒNLL=12​log⁡σ2\+12​σ2​\(y−μ\)2\\mathcal\{L\}\_\{\\text\{NLL\}\}=\\frac\{1\}\{2\}\\log\\sigma^\{2\}\+\\frac\{1\}\{2\\sigma^\{2\}\}\(y\-\\mu\)^\{2\}\(37\)
The second term is a weighted MSE with sample\-dependent weightw=1/σ2w=1/\\sigma^\{2\}\. For samples where the input\-output relationship is inconsistent or noisy, the model cannot achieve low residuals throughμ\\mualone\. The optimal response is to increaseσ2\\sigma^\{2\}for these samples, which reduces their contribution to the MSE term while incurring a modest penalty from thelog⁡σ2\\log\\sigma^\{2\}term \(preventingσ2→∞\\sigma^\{2\}\\to\\infty\)\.

Spurious patterns arise when the model memorizes incidental correlations that do not generalize\. Under NLL training, precisely these noisy samples receive highσ2\\sigma^\{2\}predictions, reducing their influence onμ\\mu\. This prevents the model from confidently learning spurious correlations\. By down\-weighting noisy samples during training, the resultingμ\\mupredictions are more aligned with genuine quality signals, yielding higher correlationρq\\rho\_\{q\}and lower hacking rateβ\\beta\. ∎

#### Remark\.

A key insight is that the variance headσ2\\sigma^\{2\}functions as an auxiliary task that improves learning ofμ\\muwithout directly participating in inference\-time decisions\. Both gradient buffering and implicit regularization operate entirely during training\. At inference time, we select candidates using onlyμ\\mu, but thisμ\\muis more robust than one trained with MSE alone\. Notably, usingσ2\\sigma^\{2\}at inference time \(e\.g\., filtering out high\-variance predictions\) does not further improve performance in our experiments, reinforcing that the primary value of probabilistic modeling lies in training\-time regularization\.

## Appendix BAdditional Experimental Results

### B\.1Generalization Across Policy Families

The main\-text results use a RAG policy\. To probe whether TTP gains depend on the choice of generator, we additionally evaluate two other policy families:*Persona prompting*\(a user\-style description synthesised from1515history samples by an external LLM\) and*Persona\+RAG*\(concatenating both\)\. Figure[A1](https://arxiv.org/html/2605.10991#A2.F1)shows the corresponding scaling curves; Tables[A2](https://arxiv.org/html/2605.10991#A2.T2)and[A3](https://arxiv.org/html/2605.10991#A2.T3)list the numerical Best\-of\-N=30N\{=\}30results\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x7.png)Figure A1:TTP scaling under two additional policy families across all five tasks\. Top row: Persona prompting \(weakN=1N\{=\}1baseline\)\. Bottom row: Persona\+RAG \(strongN=1N\{=\}1baseline\)\. The selector ranking Probabilistic\>\>Deterministic\>\>Global≈\\approxRandom is preserved on every \(task, policy\) pair, confirming that TTP acts as a policy\-orthogonal amplifier whose effect is determined by the reward model rather than the underlying generator\.“Capture” below denotes\(Prob−Random\)/\(Oracle−Random\)\(\\text\{Prob\}\-\\text\{Random\}\)/\(\\text\{Oracle\}\-\\text\{Random\}\), the fraction of the oracle gap recovered by the probabilistic reward model\.

Table A2:Persona prompting policy: ROUGE atN=30N\{=\}30\.Table A3:Persona\+RAG policy: ROUGE atN=30N\{=\}30\.The probabilistic User RM is the best non\-oracle selector on every \(task, policy\) pair\. Oracle capture exceeds50%50\\%in all1010cells and reaches84%84\\%on Persona\+RAG Product Review\. The capture rate is highest for the strongest policy \(Persona\+RAG\), confirming that stronger generators provide more high\-quality candidates the reward model can exploit\. Crucially, even on Persona\+RAG withN=1N\{=\}1baselines as high as0\.3760\.376, Global RM still hacks \(LaMP\-4:0\.376→0\.3160\.376\\to 0\.316\), confirming that hacking is a property of the reward model rather than of the candidate quality range\.

#### TTP with a per\-user LoRA training\-based policy\.

We further evaluate TTP on top of a strong training\-based personalization method: per\-user LoRA fine\-tuning of the policy\[[21](https://arxiv.org/html/2605.10991#bib.bib4)\], which substantially raises theN=1N\{=\}1baseline\. Figure[A2](https://arxiv.org/html/2605.10991#A2.F2)compares RAG\-based and LoRA\-based generation on LaMP\-4 under the same probabilistic User RM\. The LoRA policy reaches a higher oracle ceiling but itsN=1N\{=\}1baseline is also much higher; the probabilistic User RM continues to scale, with reduced magnitude reflecting distribution shift between the RAG\-trained RM and LoRA\-generated candidates\. This is consistent with Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1), which bounds TTP gain byσ¯​ln⁡N\\bar\{\\sigma\}\\sqrt\{\\ln N\}: a stronger policy reducesσ¯\\bar\{\\sigma\}and therefore the achievable headroom, but does not invalidate the scaling mechanism\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x8.png)Figure A2:TTP scaling curves comparing RAG\-based \(Base, dashed\) and LoRA\-based \(solid\) policy models on LaMP\-4\. The LoRA policy reaches a higher oracle ceiling but its baseline is also much higher; the probabilistic User RM continues to scale, with reduced magnitude reflecting distribution shift between the RAG\-trained RM and LoRA\-generated candidates\.

### B\.2Robustness to Evaluation Metric

ROUGE provides a convenient ground\-truth proxy but may not fully capture semantic or stylistic preferences\. We replicate the central findings under two alternative metrics\.

#### BERTScore correlates with ROUGE and replicates the trend\.

We compute BERTScore \(all\-MiniLM\-L6\-v2embeddings, cosine similarity\) for all generated candidates and recompute the per\-task Pearson correlation with ROUGE\. Table[A4](https://arxiv.org/html/2605.10991#A2.T4)shows moderate\-to\-high correlation across all five tasks\. On LaMP\-4 atN=30N\{=\}30, the BERTScore\-rated outputs of selectors are: Random0\.3720\.372, Global0\.3480\.348\(−5\.7%\-5\.7\\%\), Deterministic0\.3620\.362\(−2\.1%\-2\.1\\%\),Probabilistic0\.3920\.392\(\+6\.0%\+6\.0\\%\), ROUGE\-Oracle0\.4100\.410\. The ranking Prob\>\>Random\>\>Det\>\>Global is preserved, and Global RM’s negative scaling under BERTScore independently confirms reward hacking is not a ROUGE artefact\.

Table A4:BERTScore–ROUGE correlation on candidates pooled across all users and queries\.
#### LLM\-as\-judge fails for stylistic personalization\.

We additionally evaluated a strong general\-purpose LLM \(GPT\-class, given user\-history context\) as a pointwise scoring judge \(11–1010scale\) on LaMP\-4\. The Pearson correlation between judge scores and ROUGE isr=0\.099r=0\.099, essentially random\. The ground\-truth answer ranks at average position12\.7/3012\.7/30\(random would be15\.515\.5\) and lands in the top\-55only34\.7%34\.7\\%of the time\. Under Best\-of\-N=30N\{=\}30judge selection, ROUGE is0\.1440\.144, compared to Random0\.1340\.134and Probabilistic0\.1790\.179\. General\-purpose LLMs cannot capture user\-level stylistic preferences from limited history, motivating per\-user trained reward models\.

### B\.3Ablation Studies

#### Architectural components\.

We conduct ablation studies on LaMP\-4 to understand the contribution of each component in our probabilistic reward model\. Figure[A3](https://arxiv.org/html/2605.10991#A2.F3)illustrates how each variant scales with the number of candidatesNN\. The full model consistently outperforms all ablated variants across all values ofNN, with the performance gap widening asNNincreases\. Notably, the last\-pooling variant shows*negative*scaling—its performance decreases with largerNN—indicating that it learns a reward function that is inversely correlated with true preferences for many queries, a manifestation of the reward hacking phenomenon discussed in Section[4](https://arxiv.org/html/2605.10991#S4)\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x9.png)Figure A3:Ablation study: TTS performance vs number of candidatesNNon LaMP\-4\. The full model \(green\) consistently outperforms all ablated variants\. Last\-token pooling \(purple\) performs worse than random selection, demonstrating catastrophic failure when holistic sequence understanding is lost\.Pooling Strategy is Critical\.The most striking finding is that replacing mean pooling with last\-token pooling causes catastrophic failure, with performance dropping*below*the random baseline \(12\.46% vs 13\.58%\)\. This demonstrates that for reward modeling in personalization tasks, aggregating information across all tokens is essential\. Last\-token pooling, commonly used in decoder\-only language models, fails to capture the holistic semantic relationship between query and candidate, which is crucial for accurate preference prediction\.

LoRA Adaptation Enables Personalization\.Removing the LoRA adapter results in a substantial 3\.43% drop in ROUGE\-1, reducing oracle capture from 49\.1% to 17\.3%\. This confirms that adapting the backbone representations to individual users is necessary for learning personalized preferences\. Without LoRA, the model relies solely on the pre\-trained representations, which cannot distinguish between different users’ preference patterns\.

MLP Head Improves Expressiveness\.Replacing the MLP head with a simpler architecture causes a similar 3\.36% performance drop\. The two\-layer MLP with GELU activation provides the necessary nonlinearity to map pooled representations to accurate reward predictions\. This is particularly important for modeling the complex, user\-specific scoring functions required for personalization\.

Contrastive Learning Provides Additional Gains\.Removing the high\-score contrastive loss results in a 2\.83% drop, the smallest among all ablations\. While contrastive learning helps the model better distinguish between high\-quality candidates, the primary gains come from the architectural choices \(pooling, LoRA, MLP\) rather than the auxiliary loss\. Nevertheless, the contrastive objective provides meaningful improvements by encouraging the model to learn finer\-grained preference distinctions\.

#### Loss decomposition: NLL vs\. contrastive\.

The architectural ablations in Appendix[B\.3](https://arxiv.org/html/2605.10991#A2.SS3)fix the loss as NLL\+contrastive and vary the architecture\. Here we hold the architecture fixed and vary the loss to decompose the contributions of NLL and the contrastive term\. Results on LaMP\-4 are reported in Table[A5](https://arxiv.org/html/2605.10991#A2.T5)\.

Table A5:Loss\-decomposition ablation on LaMP\-4 \(Probabilistic User RM architecture,r=8r\{=\}8\)\.Two findings emerge\. First, NLL is the primary contributor: replacing MSE with NLL alone closes most of the gap from0\.1390\.139to0\.1560\.156\(Pearson lifts from0\.7900\.790to0\.9160\.916\), consistent with the prediction that NLL training prevents collapse via gradient buffering \(Lemma[5\.1](https://arxiv.org/html/2605.10991#S5.Thmtheorem1)\)\. Second, contrastive loss is complementary but only effective on top of NLL: the same contrastive term added to MSE produces no improvement, while added to NLL it lifts ROUGE from0\.1560\.156to0\.1790\.179\. NLL provides the variance\-aware foundation; contrastive refines ranking in the high\-quality region critical for Best\-of\-NN\.

### B\.4General LLMs as Personalized Reward Models

To probe whether a strong general\-purpose LLM with explicit user\-persona context can replace per\-user trained reward models, we evaluated a SynthesizeMe\-style pairwise judge on LaMP\-4: the judge LLM \(Qwen3\-4Bwith user\-persona prompt\) compares two candidates at a time, and a single\-elimination knockout tournament selects the best ofN=30N\{=\}30candidates per query \(2929LLM calls per query\)\.

Table A6:Persona\-conditioned LLM tournament vs\. trained reward models on LaMP\-4 \(Best\-of\-N=30N\{=\}30\)\.The LLM tournament’s pairwise accuracy against ROUGE\-determined preferences is0\.4990\.499, indistinguishable from chance\. The corresponding ROUGE gain is negligible \(\+1\.3%\+1\.3\\%\)\. The pointwise GPT\-class judge performs even worse, with Pearsonr=0\.099r=0\.099to ROUGE \(Appendix[B\.2](https://arxiv.org/html/2605.10991#A2.SS2)\)\. Both findings indicate that general\-purpose LLMs cannot capture per\-user stylistic preferences from limited history\. The trained probabilistic reward model is also33×33\\timesfaster per query\. Our diagnostic framework predicts these failures: withρ≈0\\rho\\approx 0between LLM scores and true rewards, Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4)yields flat scaling regardless ofNN\.

### B\.5Sensitivity Analyses and Statistical Stability

#### Sensitivity to LoRA rank\.

A natural question is whether reducing LoRA rank could mitigate Deterministic User RM’s collapse and hacking by limiting overfitting capacity\. Table[A7](https://arxiv.org/html/2605.10991#A2.T7)sweeps the LoRA rankr∈\{2,4,8,16\}r\\in\\\{2,4,8,16\\\}for Deterministic User RM on LaMP\-4 and contrasts with Probabilistic User RM at the samer=8r\{=\}8baseline\.

Table A7:LoRA rank sweep on Deterministic User RM \(LaMP\-4\)\.Reducing rank does*not*fix the underlying problem: even atr=2r\{=\}2\(four\-fold fewer parameters thanr=8r\{=\}8\), Deterministic RM achieves only\+6\.0%\+6\.0\\%over random, withα=0\.20\\alpha=0\.20andβ=0\.32\\beta=0\.32unchanged\. At the samer=8r\{=\}8, Probabilistic RM achieves\+39\.3%\+39\.3\\%\. The result indicates that the limitation of Deterministic User RM is not a capacity\-driven overfitting issue but a property of the training dynamics under MSE loss with low\-variance labels \(Lemma[5\.1](https://arxiv.org/html/2605.10991#S5.Thmtheorem1)\)\.

#### Sensitivity to per\-user data volume\.

We test whether TTP requires extensive per\-user data by downsampling each user’s reward\-model training set on LaMP\-4 to\{10%,25%,50%,100%\}\\\{10\\%,25\\%,50\\%,100\\%\\\}, leaving the validation split unchanged\. Table[A8](https://arxiv.org/html/2605.10991#A2.T8)reports the resulting Pearson correlation of the trained Probabilistic User RM against ROUGE and the Best\-of\-NNROUGE at several values ofNN\.

Table A8:Probabilistic User RM trained on a fraction of each user’s history, evaluated on a fixed LaMP\-4 validation set\. The Random baseline is approximately0\.1280\.128across allNN\.TTP delivers gains over Random even at10%10\\%data \(\+3\.9%\+3\.9\\%atN=15N\{=\}15\), with super\-linear scaling in data volume: doubling from25%25\\%to50%50\\%nearly doubles the ROUGE gain over Random \(7\.0%→13\.4%7\.0\\%\\to 13\.4\\%atN=15N\{=\}15\)\. This pattern is consistent with our predictive law: more training data raisesρ¯\+\\bar\{\\rho\}\_\{\+\}and reducesα\\alpha, both of which liftρeff\\rho\_\{\\text\{eff\}\}and hence the scaling slope\.

#### Statistical stability across repetitions\.

The main results of Section[6\.2](https://arxiv.org/html/2605.10991#S6.SS2)are averages over1010independent repetitions, each samplingNNcandidates from a pre\-generated pool of size4040\. Figure[A4](https://arxiv.org/html/2605.10991#A2.F4)and Table[A9](https://arxiv.org/html/2605.10991#A2.T9)report mean±\\pmstd at representativeNNon LaMP\-4 \(RAG policy\)\. All standard deviations are below0\.0070\.007, far smaller than the inter\-method differences; pairwise differences between Probabilistic and the other methods are significant atp<0\.01p<0\.01across allN≥5N\\geq 5\. The selector ranking is consistent across all repetitions and allNN, confirming the stability of the conclusions reported in the main text\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x10.png)Figure A4:Best\-of\-NNROUGE on LaMP\-4 \(RAG policy\) with1​σ1\\sigmaerror bars over1010repetitions\. Error bars are smaller than the marker for most points, indicating tight stability across runs\.Table A9:Mean±\\pmstd over1010repetitions on LaMP\-4 \(RAG policy, candidate pool size4040\)\. Absolute values use a different pool size from Section[6\.2](https://arxiv.org/html/2605.10991#S6.SS2); the relative ranking and scaling pattern are identical\.

### B\.6Variance\-based Selection Strategies

A natural question is whether the predicted variance from probabilistic User RM can improve candidate selection at inference time\. We investigate several variance\-based selection strategies and find that none outperform simple mean\-based selection\.

#### Motivation\.

The probabilistic User RM predicts both meanμ​\(x,y\)\\mu\(x,y\)and varianceσ2​\(x,y\)\\sigma^\{2\}\(x,y\)for each candidate\. Intuitively, the variance captures the model’s uncertainty about its prediction\. One might expect that incorporating uncertainty could lead to better selection decisions, similar to exploration\-exploitation trade\-offs in bandit problems\.

#### Strategies Evaluated\.

We evaluate the following variance\-based selection strategies:

- •Mean Only: Selectarg⁡maxy⁡μ​\(x,y\)\\arg\\max\_\{y\}\\mu\(x,y\)\(baseline\)\.
- •LCB\-β\\beta\(Lower Confidence Bound\): Selectarg⁡maxy⁡\[μ​\(x,y\)−β⋅σ​\(x,y\)\]\\arg\\max\_\{y\}\[\\mu\(x,y\)\-\\beta\\cdot\\sigma\(x,y\)\], preferring candidates with high mean and low uncertainty\.
- •UCB\-β\\beta\(Upper Confidence Bound\): Selectarg⁡maxy⁡\[μ​\(x,y\)\+β⋅σ​\(x,y\)\]\\arg\\max\_\{y\}\[\\mu\(x,y\)\+\\beta\\cdot\\sigma\(x,y\)\], allowing exploration of uncertain candidates\.
- •Var\-Filter\-pp: Filter out candidates in the toppp% variance, then select by mean among remaining candidates\.
- •SNR\(Signal\-to\-Noise Ratio\): Selectarg⁡maxy⁡\[μ​\(x,y\)/σ​\(x,y\)\]\\arg\\max\_\{y\}\[\\mu\(x,y\)/\\sigma\(x,y\)\]\.

#### Results\.

Table[A10](https://arxiv.org/html/2605.10991#A2.T10)shows the results on LaMP\-4\. All variance\-based strategies perform worse than or equal to mean\-only selection\.

Table A10:Comparison of variance\-based selection strategies on LaMP\-4\. All strategies underperform simple mean\-based selection, indicating that variance does not provide useful signal for candidate selection at inference time\.
#### Discussion\.

The failure of variance\-based strategies suggests that while variance prediction is beneficial during training \(as part of the NLL loss\), the learned variance lacks proper calibration for use at inference time\. Specifically:

- •LCB strategiesassume that low variance indicates reliable predictions, but this relationship may not hold if variance is not well\-calibrated\.
- •UCB strategiesassume high variance candidates have potential upside, but without proper calibration, high variance may simply indicate poor predictions\.
- •Variance filteringremoves potentially good candidates that happen to have high predicted variance\.
- •SNRis particularly sensitive to variance scale, performing worst among all strategies\.

As our probabilistic User RM is not trained with variance\-aware ranking objectives, there is no guarantee that predicted uncertainty aligns with selection quality\. For high\-quality candidates \(i\.e\., those with largeμ​\(x,y\)\\mu\(x,y\)\), the variance is often already low or relatively uniform\. In this regime, ranking is dominated by differences in the mean, leaving little room for variance\-aware strategies to meaningfully reorder candidates\. This may explain why LCB/UCB with smallβ\\betavalues closely match mean\-only selection, while largerβ\\betavalues increasingly hurt performance\.

### B\.7Computational Efficiency Analysis

We measure the computational costs of training\-based personalization and TTP to understand their efficiency characteristics\.

#### Experimental Setup\.

We benchmark both approaches on LaMP\-4 using an NVIDIA H100 80GB GPU\. For training\-based personalization, we fine\-tune per\-user Qwen3\-4B policy models with LoRA adapters \(rank 16, sequence length 256\)\. For TTP, we train per\-user probabilistic reward models based on Qwen2\.5\-1\.5B \(sequence length 128\)\. We measure per\-sample training time and per\-query inference time across 10 users\. Table[A11](https://arxiv.org/html/2605.10991#A2.T11)reports the measured unit costs\.

Table A11:Measured computational costs for training and inference\.
#### Key Findings\.

Our measurements reveal two important characteristics:

1. 1\.Training efficiency: The reward model trains3\.1×3\.1\\timesfaster per sample than the policy model \(19 ms vs\. 59 ms\), owing to the smaller backbone \(1\.5B vs\. 4B parameters\) and shorter sequence length\.
2. 2\.Inference cost structure: The RM scoring cost is negligible compared to generation cost\. Scoring a single candidate takes only 2\.3 ms, while generating one response takes 1498 ms—a650×650\\timesdifference\. Even withN=30N=30candidates, the total scoring time \(69 ms\) remains less than 5% of a single generation\.

These findings suggest that the computational bottleneck of TTP lies entirely in candidate generation, not in reward model scoring\. In deployment scenarios where candidates can be pre\-generated, cached, or shared across users, TTP’s inference overhead reduces to just the scoring component, making it highly efficient for high\-throughput personalization\.

## Appendix CTheoretical Validation

### C\.1Quantitative Validation of the Predictive Scaling Law

We provide the per\-task evidence supporting the aggregate result reported in Section[6\.2](https://arxiv.org/html/2605.10991#S6.SS2)\. For each task, we calibrate the scale parameterc​σ¯c\\bar\{\\sigma\}once from the Oracle scaling curve, measure\(α,β,ρ¯\+,ρ¯−\)\(\\alpha,\\beta,\\bar\{\\rho\}\_\{\+\},\\bar\{\\rho\}\_\{\-\}\)from each reward model on the validation split, and substitute the values into Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4)to predict the full Best\-of\-NNcurve\. Table[A12](https://arxiv.org/html/2605.10991#A3.T12)reports the four diagnostic quantities, the effective correlationρeff=\(1−β\)​ρ¯\+−β​\|ρ¯−\|\\rho\_\{\\text\{eff\}\}=\(1\-\\beta\)\\bar\{\\rho\}\_\{\+\}\-\\beta\|\\bar\{\\rho\}\_\{\-\}\|, and the relative mean absolute error \(relMAE\) between predicted and observed ROUGE forN∈\{1,5,10,15,20,30\}N\\in\\\{1,5,10,15,20,30\\\}\.

Table A12:Per\-task validation of Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4)\. The four diagnostic quantities are measured directly from each reward model; relMAE compares predicted vs\. observed Best\-of\-NNROUGE\.Across all1515\(task, RM\) configurations, the formula achieves average relMAE of2\.9%2\.9\\%withR2\>0\.9R^\{2\}\>0\.9in1313of1515cases\. The two cases withR2<0\.9R^\{2\}<0\.9are LaMP\-4 Global RM and LaMP\-5 Global RM, where extreme reward hacking \(ρeff≈0\\rho\_\{\\text\{eff\}\}\\approx 0\) yields nearly flat curves whose remaining variance is dominated by sampling noise rather than systematic scaling\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x11.png)Figure A5:Predicted \(dashed\) vs\. observed \(solid\) Best\-of\-NNROUGE for all1515\(task, RM\) configurations\. Predictions use the four diagnostic quantities from Table[A12](https://arxiv.org/html/2605.10991#A3.T12)substituted into Proposition[4\.4](https://arxiv.org/html/2605.10991#S4.Thmtheorem4); the scale parameterc​σ¯c\\bar\{\\sigma\}is calibrated once per task from Oracle and shared across all reward models\.
### C\.2Validating the Theoretical Assumptions

Our derivations rely on two analytical assumptions: \(i\) the per\-user true\-reward distribution is sub\-Gaussian \(used in Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)and inherited by Lemma[4\.2](https://arxiv.org/html/2605.10991#S4.Thmtheorem2)\), and \(ii\) the joint distribution of predicted and true rewards has approximately linear conditional expectation \(used in the Bivariate\-Gaussian\-style derivation of Lemma[4\.2](https://arxiv.org/html/2605.10991#S4.Thmtheorem2)\)\. Below we empirically verify that both assumptions hold to a useful degree on our data\.

#### Sub\-Gaussian property of true rewards\.

For each user we fit the smallest sub\-Gaussian parameterσ\\sigmathat satisfies the moment\-generating\-function bound and compare the resulting tail behavior against the empirical reward distribution\. We illustrate this on LaMP\-4 in Figure[A6](https://arxiv.org/html/2605.10991#A3.F6): the empirical CDF lies close to the Gaussian envelope \(small deviations only in the extreme tails\), consistent with bounded\-but\-light\-tailed ROUGE\-derived rewards; the same pattern holds on the other four tasks\. Although the bound becomes loose at the very tail, this only inflates the constant in front ofln⁡N\\sqrt\{\\ln N\}and does not change the scaling order\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/sub_gaussian_verification.png)Figure A6:Sub\-Gaussian property check on a representative LaMP\-4 user\. The empirical reward CDF \(blue\) is overlaid on the best\-fitting Gaussian envelope \(orange\)\. Deviations are small except in the extreme tails, supporting the sub\-Gaussian assumption used in Theorem[3\.1](https://arxiv.org/html/2605.10991#S3.Thmtheorem1)\.
#### Linear conditional expectation of true reward given predicted reward\.

Lemma[4\.2](https://arxiv.org/html/2605.10991#S4.Thmtheorem2)relies on𝔼​\[r∗∣r^\]\\mathbb\{E\}\[r^\{\*\}\\mid\\hat\{r\}\]being approximately linear inr^\\hat\{r\}\(this is exact under joint Normality and weakly required under any second\-order linear regression structure\)\. We illustrate this on LaMP\-4 in Figure[A7](https://arxiv.org/html/2605.10991#A3.F7), binning predictionsr^\\hat\{r\}and plotting the conditional mean ofr∗r^\{\*\}against the bin centre for each of the three reward\-model variants\. The binned conditional means lie close to a straight line whose slope matches the empirical correlationρ\\rhoused in our analysis\. Departures from linearity are small and concentrated at the rare extremes, consistent with the remark in Appendix[A\.2](https://arxiv.org/html/2605.10991#A1.SS2)that high\-score regions slightly under\-perform the global correlation\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/linear_conditional_expectation.png)Figure A7:Bivariate linearity check on LaMP\-4: empirical𝔼​\[r∗∣r^\]\\mathbb\{E\}\[r^\{\*\}\\mid\\hat\{r\}\]vs\. predicted rewardr^\\hat\{r\}, computed by binning the predictions of each reward model \(Global, Det, Prob\)\. The binned conditional means track a straight line whose slope matches the empirical correlationρ\\rho, supporting the linear\-conditional\-expectation assumption in Lemma[4\.2](https://arxiv.org/html/2605.10991#S4.Thmtheorem2)\.Together, these checks ensure that the formal derivations in Appendices[A\.1](https://arxiv.org/html/2605.10991#A1.SS1)and[A\.2](https://arxiv.org/html/2605.10991#A1.SS2)are not artefacts of overly idealised assumptions\.

### C\.3Detailed Failure\-Mode Analysis

#### Cross\-task user\-level collapse\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x12.png)Figure A8:User\-level collapse analysis across four tasks: \(a\) LaMP\-5, \(b\) LongLaMP\-Abstract, \(c\) LongLaMP\-Product Review, \(d\) LongLaMP\-Topic Writing\. Top row: score standard deviation separating collapsed vs normal users\. Middle/Bottom rows: score distributions for collapsed and normal users respectively, with Pearson correlations for deterministic \(Det\) and probabilistic \(Prob\) RMs\.Figure[A8](https://arxiv.org/html/2605.10991#A3.F8)extends our user\-level collapse analysis to four additional tasks\. The results consistently support our theoretical findings across diverse personalization scenarios\.

Collapse Identification\.Across all tasks, we observe a clear separation in score standard deviation \(σ\\sigma\) between collapsed and normal users\. LongLaMP\-Product Review \(c\) exhibits the most pronounced separation, with collapsed users showing extremely low variance \(σ≈0\.13\\sigma\\approx 0\.13\) compared to normal users \(σ≈0\.17\\sigma\\approx 0\.17\)\. This aligns with the task’s nature where deterministic models tend to predict similar review scores regardless of product differences\.

Deterministic RM Failure on Collapsed Users\.The middle row demonstrates that deterministic RMs consistently fail on collapsed users, producing near\-degenerate predictions concentrated at a single value\. This is most severe in LongLaMP\-Product Review \(c\), where the deterministic RM achieves a*negative*correlation \(ρ=−0\.31\\rho=\-0\.31\) on collapsed users, indicating predictions that are worse than random\. In contrast, the probabilistic RM maintains strong correlations even for collapsed users:ρ=0\.83\\rho=0\.83\(LaMP\-5\),ρ=0\.96\\rho=0\.96\(Abstract\),ρ=0\.98\\rho=0\.98\(Product Review\), andρ=0\.89\\rho=0\.89\(Topic Writing\)\.

Robust Performance on Normal Users\.For normal users \(bottom row\), both RMs perform reasonably well, though the probabilistic RM consistently achieves higher correlations\. Notably, LongLaMP\-Abstract \(b\) shows the smallest gap between methods on normal users \(ρ=0\.75\\rho=0\.75vs0\.950\.95\), suggesting that academic writing styles may be more predictable, reducing the advantage of uncertainty modeling for well\-behaved users\.

#### Cross\-task query\-level reward hacking\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x13.png)Figure A9:Query\-level hacking analysis across four tasks: \(a\) LaMP\-5, \(b\) LongLaMP\-Abstract, \(c\) LongLaMP\-Product Review, \(d\) LongLaMP\-Topic Writing\. Row 1: Per\-query correlation comparison between Det and Prob RMs\. Row 2: Learned variance vs prediction error\. Rows 3–4: Distribution of per\-query correlations for Det and Prob RMs, with percentage of negative correlations indicated\.Figure[A9](https://arxiv.org/html/2605.10991#A3.F9)presents query\-level analysis across four additional tasks, validating the generality of our uncertainty\-aware selection mechanism\.

Correlation Between Uncertainty and Error\.The second row shows strong positive correlations between learned varianceσ2\\sigma^\{2\}and prediction error across all tasks\. LongLaMP\-Product Review \(c\) achieves the highest correlation \(r=0\.81r=0\.81\), indicating that the probabilistic RM learns particularly well\-calibrated uncertainty estimates for this task\. The other tasks show moderate correlations \(r=0\.45r=0\.45–0\.510\.51\), confirming that the model successfully learns to express higher uncertainty for difficult queries\.

Dramatic Reduction in Negative Correlations\.The bottom two rows compare the distribution of per\-query correlations\. The deterministic RM exhibits substantial negative correlation rates: 38\.3% \(LaMP\-5\), 18\.1% \(Abstract\), 28\.6% \(Product Review\), and 26\.5% \(Topic Writing\)\. These represent queries where the RM’s ranking is*inversely*related to true quality, a severe failure mode for Best\-of\-N selection\.

The probabilistic RM dramatically reduces these failure cases: negative correlations drop to 25\.4% \(LaMP\-5\), 0\.4% \(Abstract\), 0\.1% \(Product Review\), and 1\.5% \(Topic Writing\)\. The improvement is most striking for the LongLaMP tasks, where negative correlations are nearly eliminated \(<2%<2\\%\)\.

Task\-Specific Observations\.LaMP\-5 \(a\) shows the smallest improvement \(38\.3%→\\to25\.4%\), likely because title generation from abstracts has inherent ambiguity that even uncertainty\-aware methods cannot fully resolve\. In contrast, LongLaMP tasks benefit more substantially, possibly because longer outputs provide richer signals for uncertainty estimation\.

Rescue Analysis\.The scatter plots \(top row\) reveal that the probabilistic RM “rescues” many queries from the hacking region \(blue points above the diagonal\), while rarely degrading performance \(orange points below diagonal are sparse\)\. This asymmetry confirms that uncertainty\-aware selection provides a safety mechanism against reward hacking without sacrificing performance on well\-predicted queries\.

#### A priori predictability\.

The analyses in Appendices[C\.3](https://arxiv.org/html/2605.10991#A3.SS3.SSS0.Px1)and[C\.3](https://arxiv.org/html/2605.10991#A3.SS3.SSS0.Px2)document*which*users collapse and*which*queries get hacked\. This subsection asks the stronger question: can these failures be predicted*a priori*from training\-data statistics, before running any Best\-of\-NNexperiment?

#### Predicting user\-level collapse\.

Users with low ground\-truth ROUGE variance are predicted to collapse more often, since narrow label distributions provide weak discriminative signal under MSE training \(Lemma[5\.1](https://arxiv.org/html/2605.10991#S5.Thmtheorem1)\)\. Table[A13](https://arxiv.org/html/2605.10991#A3.T13)verifies this on all five tasks: low\-variance users collapse1×1\\times–13×13\\timesmore often than high\-variance users under Deterministic RM, while Probabilistic RM eliminates collapse entirely on every task regardless of label variance\.

Table A13:User\-level collapse rates \(fraction of users withρu<0\.1\\rho\_\{u\}<0\.1\), split by ground\-truth ROUGE variance \(low vs\. high std halves\)\. Probabilistic User RM achieves0%0\\%collapse on every task\.
#### Predicting query\-level hacking\.

Queries with low candidate\-quality spread are predicted to hack more often, since limited training signal drives the model toward spurious patterns \(Lemma[5\.2](https://arxiv.org/html/2605.10991#S5.Thmtheorem2)\)\. Table[A14](https://arxiv.org/html/2605.10991#A3.T14)verifies this prediction\. Across the population of queries, Probabilistic RM reduces hacking44–8×8\\timesrelative to Global RM, with the largest gains precisely on the predicted high\-risk queries \(low candidate\-quality spread\)\.

Table A14:Query\-level hacking rates \(fraction of queries withρq<0\\rho\_\{q\}<0\), split by candidate\-quality spread \(low vs\. high std halves\)\. Selected \(task, RM\) pairs\.These two tables transform the diagnostic framework into an actionable tool: practitioners can identify high\-risk users and queries by computing simple per\-user and per\-query label statistics on the training set, before any Best\-of\-NNexperiment is run\.

## Appendix DPipeline and Implementation Details

Our experimental pipeline consists of three stages, as illustrated in Figure[A10](https://arxiv.org/html/2605.10991#A4.F10): \(1\)Personalized Prompting, where we construct personalized policy models using retrieval\-augmented generation; \(2\)Reward Model Training, where we train user\-specific reward models to capture individual preferences; and \(3\)Inference, where we apply test\-time personalization by sampling multiple candidates and selecting the best one via the reward model\. We describe each stage in detail below\.

![Refer to caption](https://arxiv.org/html/2605.10991v1/x14.png)Figure A10:Overview of the experimental pipeline\.### D\.1Stage 1: Personalized Prompting

The goal of personalized prompting is to generate candidate responses that serve as training data for reward models\. This stage involves dataset preparation and policy model configuration\.

#### Datasets\.

We evaluate on five personalized text generation tasks:News Headline Generation\(LaMP\-4\) andScholarly Title Generation\(LaMP\-5\) from LaMP\[[17](https://arxiv.org/html/2605.10991#bib.bib2)\], andAbstract Generation,Product Review Writing, andTopic Writingfrom LongLaMP\[[4](https://arxiv.org/html/2605.10991#bib.bib12)\]\. We exclude email\-related tasks from both benchmarks as they require additional access to private email datasets\. For all tasks, we adopt the*temporal\-based*data split provided by the original benchmarks, which ensures that test queries are chronologically after training examples to simulate realistic deployment scenarios\.

FollowingTanet al\.\[[21](https://arxiv.org/html/2605.10991#bib.bib4)\], we select the top\-30 users with the longest profiles for LaMP tasks and top\-20 users for LongLaMP tasks\. Representative input\-output examples for each task are shown as follows\.

LaMP\-4: News Headline GenerationInput:Generate a headline for the following article: According to AP, Harry’s days as a pilot are over\. The prince, who spent over three years training as an Apache helicopter…Output:Prince Harry Scores A New Job, Says Goodbye To His Pilot Uniform

LaMP\-5: Scholarly Title GenerationInput:Generate a title for the following abstract of a paper: We propose a novel evolutionary approach to the problem of timing\-driven FPGA placement\. The method used is evolutionary programming \(EP\) with incremental position encoded in the population…Output:An evolutionary approach to timing driven FPGA placement

LongLaMP: Abstract GenerationInput:Generate an abstract for the title “Learned Token Pruning for Transformers” using the following items: 1\. Transformer models 2\. Learned Token Pruning \(LTP\) 3\. Performance improvement 4\. FLOPs Reduction 5\. RobustnessOutput:Efficient deployment of transformer models in practice is challenging due to their inference cost including memory footprint, latency, and power consumption, which scales quadratically with input sequence length\. To address this, we present a novel token reduction method dubbed Learned Token Pruning \(LTP\)…

LongLaMP: Product Review WritingInput:Generate the review text for a product with rating “5\.0” and description “Kate Aster is the bestselling author of the Special Ops: Homefront series…”\. Summary: “Contract with a SEAL”\.Output:It is impossible not to fall in love with the characters in this book, which brings in strong, supportive relationships with family and friends\. Aster has done a remarkable job of making the plot and characters engaging…

LongLaMP: Topic WritingInput:Generate the content for a reddit post titled “good riddance to bad rubbish”\.Output:As an Eagle who worked very hard to encourage BSA to change it’s policy, I have to say this is good news\. This is symbolic of the total defeat of the Christianist/anti\-gay lobby in Scouting…

#### Policy Model Configuration\.

We useQwen3\-4B\-Instruct\[[23](https://arxiv.org/html/2605.10991#bib.bib22)\]as the policy model backbone\. To achieve personalization, we adopt a retrieval\-augmented generation \(RAG\) approach with dense retrieval\[[17](https://arxiv.org/html/2605.10991#bib.bib2)\]\. Specifically, we useall\-MiniLM\-L6\-v2\[[14](https://arxiv.org/html/2605.10991#bib.bib24)\]as the text encoder for embedding\-based retrieval\. Given an input query, we encode it and retrieve the top\-kkmost similar examples from the user’s profile based on cosine similarity\. The retrieved examples are concatenated as context in the prompt\.

For decoding, we usevLLM\[[5](https://arxiv.org/html/2605.10991#bib.bib25)\]to accelerate generation\. For each query, we sampleN=30N=30candidate responses to construct reward model training data\. Task\-specific prompt templates are provided in Appendix[E\.1](https://arxiv.org/html/2605.10991#A5.SS1)\.

#### Generation Hyperparameters\.

Table[A15](https://arxiv.org/html/2605.10991#A4.T15)summarizes the generation hyperparameters and dataset statistics for each task\.

Table A15:Dataset statistics and generation hyperparameters\. Avg Len denotes the average candidate length in characters\.We set the number of candidate generations toN=30N=30for all tasks\. However, for short\-form generation tasks \(LaMP\-4 and LaMP\-5\), the policy model occasionally produces duplicate outputs despite temperature sampling\. We remove exact duplicates from the candidate set, resulting in an average of fewer than 30 unique candidates per query for these tasks\. Long\-form generation tasks \(LongLaMP\) exhibit minimal duplication due to their greater output diversity\.

### D\.2Stage 2: Reward Model Training

This stage describes the training procedure for three reward model variants: Global RM, Deterministic User RM, and Probabilistic User RM\.

#### Model Architecture\.

All reward models share the same backbone architecture:Qwen2\.5\-1\.5B\-Instruct\[[12](https://arxiv.org/html/2605.10991#bib.bib23)\]with masked mean pooling over hidden states\. For probabilistic User RM, an additional variance head predictslog⁡σ2\\log\\sigma^\{2\}with output passed through softplus and clamped to\[10−4,0\.5\]\[10^\{\-4\},0\.5\]\.

#### Reward Label Construction\.

For each candidate response, we compute the ROUGE score against the ground\-truth output as the reward label:r=\(ROUGE\-1\+ROUGE\-L\)/2r=\(\\text\{ROUGE\-1\}\+\\text\{ROUGE\-L\}\)/2\. where ROUGE\-1 measures unigram overlap and ROUGE\-L measures longest common subsequence\. Scores are clipped to\[0,1\]\[0,1\]\.

#### Training Configurations\.

Table[A16](https://arxiv.org/html/2605.10991#A4.T16)summarizes the training hyperparameters for each reward model variant\.

Table A16:Reward model training configurations\.
#### LoRA Configuration\.

For models using LoRA fine\-tuning, we apply low\-rank adaptation with rankr=8r=8, scaling factorα=16\\alpha=16, and dropout0\.050\.05\. Target modules areq\_projandv\_proj\. This results in approximately 4\.6M trainable parameters per user model\.

#### Loss Functions\.

Deterministic User RMis trained with MSE loss\.Probabilistic User RMis trained with Gaussian NLL loss \(Equation[6](https://arxiv.org/html/2605.10991#S5.E6)\)\. Two variants of user RM are also trained with a contrastive term for high\-score discrimination, which the loss weighted parameter isλ=1\.0\\lambda=1\.0\. The contrastive loss operates on sample pairs where both have ground\-truth scores above 0\.5, with marginm=0\.02m=0\.02\. We also apply linear sample weighting with weights in\[0\.3,1\.0\]\[0\.3,1\.0\]proportional to ground\-truth scores, emphasizing high\-quality samples\.Global RMis trained with BCE loss with sample reweighting for extreme scores\.

#### User Selection and Data Split\.

For LaMP tasks, we select the top\-30 users with the most data, using up to 3,000 samples per user\. For LongLaMP tasks, we select the top\-20 users with up to 5,000 samples per user\. All data is split 80%/20% for training/validation per user\.

#### Implementation Details\.

All experiments are conducted on a single NVIDIA H100 PCIe GPU\. We use BF16 mixed precision training with Flash Attention 2\[[2](https://arxiv.org/html/2605.10991#bib.bib26)\]for efficiency\. The optimizer is AdamW\[[8](https://arxiv.org/html/2605.10991#bib.bib27)\]for all models\. All RMs share the same prompt templates, which are provided in Appendix[E\.2](https://arxiv.org/html/2605.10991#A5.SS2)\. No retrieval\-based method is used for RM training\.

### D\.3Stage 3: Inference

This stage describes the Test\-Time Personalization \(TTP\) inference procedure and evaluation protocol\.

#### TTP Inference Procedure\.

Given a test queryxxfor useruu, TTP operates as follows:

1. 1\.SampleNNcandidates from the pre\-generated candidate pool \(up to 30 candidates per query\)\.
2. 2\.Score each candidate using the user\-specific reward modelRuR\_\{u\}\.
3. 3\.Select the candidate with the highest predicted reward:y^=arg⁡maxy∈𝒴N⁡Ru​\(x,y\)\\hat\{y\}=\\arg\\max\_\{y\\in\\mathcal\{Y\}\_\{N\}\}R\_\{u\}\(x,y\)\.

For probabilistic User RM, we use only the predicted meanμ​\(x,y\)\\mu\(x,y\)for selection\.

#### Evaluation Protocol\.

Table[A17](https://arxiv.org/html/2605.10991#A4.T17)summarizes the inference configuration\. For each \(user, query,NN\) combination, we repeat the random candidate sampling 3 times and report the average to reduce variance from candidate selection\.

Table A17:Inference and evaluation configuration\.

## Appendix EPrompt Templates

### E\.1Policy Model Templates

We adopt a retrieval\-augmented generation \(RAG\) approach for personalized generation\. For each query, we retrieve relevant examples from the user’s history and construct the prompt using task\-specific templates\. Below we present the prompt templates for each task\.

LaMP\-4: News Headline Generation``` {history} ### Instruction: Generate a headline for the following article. Match the user’s headline style. ### Article: {input} ### Headline: ```

LaMP\-5: Scholarly Title Generation``` {history} ### Instruction: Generate a title for the following abstract. Match the user’s academic titling style. Output ONLY the title. ### Abstract: {input} ### Title: ```

LongLaMP: Abstract Generation``` {history} ### Instruction: Write a scientific abstract for the following title, matching the user’s academic writing style found in the history. ### Title: {input} ### Abstract: ```

LongLaMP: Product Review Writing``` {history} ### Instruction: You are the user described in the [User History] above. The User Request below specifies the product description, the rating you gave, and the summary you wrote. Write the full Review Text that corresponds to these details. Mimic your detailed, narrative writing style found in the history. Write a full body review (around 50-100 words), telling the story of your experience. ### User Request: {input} ### Review Content: ```

LongLaMP: Topic Writing``` {history} ### Instruction: You are the user described in the [User History] above. Write a Reddit post about the topic below. Mimic the user’s vocabulary and tone, but keep the response relatively short and punchy (around 100-200 words). Do not write a wall of text. ### Topic: {input} ### Post Content: ```

In all templates,\{history\}is replaced with retrieved user history examples, and\{input\}is replaced with the current query\. The model generates text following the final prompt marker \(e\.g\.,\#\#\# Headline:\)\.

### E\.2Reward Model Prompt

The reward model receives a structured prompt consisting of a task\-specific instruction, the input query, and the candidate response:

Reward Model Prompt Template``` [Instruction]: {task_instruction} [Input]: {query} [Document]: {candidate} ```

where\{task\_instruction\}is one of the following task\-specific instructions:

LaMP\-4: News Headline GenerationJudge how good the Document is as a headline for the Article\. Higher score = better\.

LaMP\-5: Scholarly Title GenerationJudge how good the Document is as a title for the Abstract\. Higher score = better\.

LongLaMP: Abstract GenerationJudge how good the Document is as an abstract for the Title\. Higher score = better\.

LongLaMP: Product Review WritingJudge how good the Document is as a review for the Product\. Higher score = better\.

LongLaMP: Topic WritingJudge how good the Document is as a post for the Topic\. Higher score = better\.

Similar Articles

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Hugging Face Daily Papers

This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.

Agentic Test-Time Scaling (GitHub Repo)

TLDR AI

AutoTTS is an open-source tool that uses agentic discovery to automatically find optimal test-time scaling strategies for LLMs, significantly reducing token usage and cost through replay-based evaluation.

Test-Time Training Undermines Safety Guardrails

arXiv cs.LG

This paper identifies three threat models for test-time training (TTT) that adversaries can exploit to bypass safety filters in LLMs, achieving high attack success rates. The findings reveal that TTT introduces new vulnerabilities that undermine existing safety guardrails.