PAFO: Pareto Fairness Optimization for Personalized Reward Modeling

arXiv cs.AI Papers

Summary

This paper proposes PAFO, a Pareto fairness optimization framework to mitigate personalized reward bias in reward models for LLMs, improving accuracy for minority user groups without harming majority groups.

arXiv:2606.07988v1 Announce Type: new Abstract: Large language models (LLMs) increasingly rely on reward models to align their outputs with diverse user preferences. While personalized reward models aim to capture such heterogeneity, they are often trained on imbalanced user preference data and may therefore favor users whose preferences are more common in the training population. In this paper, we identify this failure mode as personalized reward bias, where reward modeling quality varies systematically with preference support rate. We formulate its mitigation as a Pareto fairness problem over group utilities, aiming to improve under-served users without degrading other user groups. To this end, we propose PAFO, a Pareto fairness optimization framework for personalized reward modeling. PAFO first trains group-specialized reward models for majority and minority preference groups, then constructs conditional margin-level supervision to distill their heterogeneous preference boundaries into a single unified model. The resulting model uses group information only during training and requires no explicit group labels at inference time. Experiments on Personal-LLM and DSP show that PAFO improves both minority-group and majority-group accuracy while reducing user-level unfairness across multiple metrics, demonstrating its effectiveness for fairer LLM personalization.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:54 AM

# Pareto Fairness Optimization for Personalized Reward Modeling
Source: [https://arxiv.org/html/2606.07988](https://arxiv.org/html/2606.07988)
Xiaoyan Zhao1, Haoting Ni211footnotemark:1, Yang Zhang1, Chunyuan Zheng3, Haoxuan Li3, Fuli Feng2 1National University of Singapore 2University of Science and Technology of China3Peking University xiaoyanzhao\.ai@gmail\.com

###### Abstract

Large language models \(LLMs\) increasingly rely on reward models to align their outputs with diverse user preferences\. While personalized reward models aim to capture such heterogeneity, they are often trained on imbalanced user preference data and may therefore favor users whose preferences are more common in the training population\. In this paper, we identify this failure mode as personalized reward bias, where reward modeling quality varies systematically with preference support rate\. We formulate its mitigation as a Pareto fairness problem over group utilities, aiming to improve under\-served users without degrading other user groups\. To this end, we proposePAFO, a Pareto fairness optimization framework for personalized reward modeling\. PAFO first trains group\-specialized reward models for majority and minority preference groups, then constructs conditional margin\-level supervision to distill their heterogeneous preference boundaries into a single unified model\. The resulting model uses group information only during training and requires no explicit group labels at inference time\. Experiments on Personal\-LLM and DSP show that PAFO improves both minority\-group and majority\-group accuracy while reducing user\-level unfairness across multiple metrics, demonstrating its effectiveness for fairer LLM personalization\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.07988v1/x1.png)Figure 1:Minority users suffer substantially lower reward prediction accuracy than majority users, even under personalized RM\.Reward models are central to the alignment of large language models, providing the scalar signals used to optimize policies in RLHF\[[1](https://arxiv.org/html/2606.07988#bib.bib1),[2](https://arxiv.org/html/2606.07988#bib.bib2),[3](https://arxiv.org/html/2606.07988#bib.bib3)\]and to select responses at test time\[[4](https://arxiv.org/html/2606.07988#bib.bib4),[5](https://arxiv.org/html/2606.07988#bib.bib5)\]\. However, reward models are not neutral estimators of human judgment\. Existing studies have shown that they can encode systematic biases\[[6](https://arxiv.org/html/2606.07988#bib.bib6),[7](https://arxiv.org/html/2606.07988#bib.bib7),[8](https://arxiv.org/html/2606.07988#bib.bib8),[7](https://arxiv.org/html/2606.07988#bib.bib7),[9](https://arxiv.org/html/2606.07988#bib.bib9)\], such as length bias, sycophancy, and social bias\. These findings suggest that the reward signal used for alignment may reflect artifacts of the training or annotation process rather than the preferences that users actually intend to express\[[10](https://arxiv.org/html/2606.07988#bib.bib10),[11](https://arxiv.org/html/2606.07988#bib.bib11)\]\.

Most existing analyses study reward model bias in the population\-level setting\[[10](https://arxiv.org/html/2606.07988#bib.bib10),[12](https://arxiv.org/html/2606.07988#bib.bib12)\], where a single reward function is shared across all users\[[13](https://arxiv.org/html/2606.07988#bib.bib13)\]and bias is examined as a shared distortion or along predefined demographic axes\. This perspective is insufficient for personalized alignment, where different users may naturally prefer different response styles, reasoning patterns, or interaction norms\[[14](https://arxiv.org/html/2606.07988#bib.bib14),[15](https://arxiv.org/html/2606.07988#bib.bib15),[16](https://arxiv.org/html/2606.07988#bib.bib16),[17](https://arxiv.org/html/2606.07988#bib.bib17)\]\. However, personalization addresses preference heterogeneity without addressing preference\-support imbalance\. When preference data are imbalanced, the training objective can still be dominated by users whose preferences are more common, leaving rare or less\-supported preference patterns insufficiently represented\[[18](https://arxiv.org/html/2606.07988#bib.bib18)\]\.

We refer to this failure mode aspersonalization unfairnessin reward modeling: a systematic disparity in preference accuracy across users, structured by how well a user’s preference pattern is supported in the training distribution\. Unlike conventional reward model biases, which manifest as a shared distortion across the population \(e\.g\., length bias\)\[[6](https://arxiv.org/html/2606.07988#bib.bib6)\]or along pre\-specified demographic axes\[[13](https://arxiv.org/html/2606.07988#bib.bib13),[19](https://arxiv.org/html/2606.07988#bib.bib19)\], personalization unfairness issupport\-structured—its incidence on a given user is determined by the rarity of that user’s preference pattern, a quantity that is neither annotated nor necessarily tied to explicit demographic attributes\. Figure[1](https://arxiv.org/html/2606.07988#S1.F1)illustrates this phenomenon empirically\. On the Personal\-LLM benchmark\[[20](https://arxiv.org/html/2606.07988#bib.bib20)\], representative personalized reward modeling methods exhibit a strong dependence between per\-user accuracy and per\-user support rate; on the DSP benchmark\[[21](https://arxiv.org/html/2606.07988#bib.bib21)\], accuracy on minority preference styles is substantially lower than on majority styles\. Therefore, personalization may inherit preference\-support imbalance rather than eliminate it\.

Mitigating personalization unfairness is challenging\. Users with minority preferences provide few supervision signals, so their reward boundaries are hard to learn from data alone\. Training a separate model per group would help, but separate models are unusable at inference in practice, where group membership is typically unknown\. Single\-model fixes such as reweighting or fairness regularization avoid this problem but tend to lift minority accuracy at the expense of majority accuracy, redistributing error rather than reducing it\[[22](https://arxiv.org/html/2606.07988#bib.bib22)\]\. We therefore formulate the objective asParetoimprovement over preference groups: only no\-regret gains for minorities count as solutions\.

We propose PAFO \(PAretoFairnessOptimization\), a framework for personalized reward modeling\. The core idea is to let a reward model first specialize on each preference group and then consolidate these group\-specific abilities back into itself through self\-distillation, so that group information is consumed only during training\. Concretely, PAFO partitions data into minority\-to\-majority preference groups and finetunes a group\-specialized reward model for each group, so that each minority’s preference structure is captured from within\-group signal rather than diluted across the full population\. A conditional teacher then routes each preference pair to its group\-specialized model and uses the resulting reward margin as the supervision target for self\-distillation\. The same model, now serving as the student, is trained to match these group\-specific margins while taking only the user\-conditioning signals available at inference as input\. This design realizes Pareto improvement: minority groups thus benefit from dedicated specialization, and majority groups are equally served by their own specialist, so the student inherits an undiluted majority signal instead of one reweighted against minority groups\. Group information, used only as a training\-time scaffold, is no longer needed at deployment\.

Our main contributions can be summarized as follows:

- •Problem\.We identify and formalizepersonalization unfairnessin reward modeling, where per\-user reward quality is governed by preference\-support imbalance, distinct from prior reward model biases that affect the population uniformly or along demographic axes\.
- •Method\.We frame mitigation as Pareto improvement over preference groups and propose PAFO, which specializes a reward model on each group and consolidates these abilities back via conditional margin\-level self\-distillation, requiring group information only at training time\.
- •Results\.On Personal\-LLM and DSP, PAFO improves minority\-group accuracy, preserves majority\-group accuracy, and reduces user\-level unfairness, showing that Pareto improvement is achievable in personalized reward modeling\.

## 2Preliminary Analysis

#### Problem Formulation\.

We study the personalized reward modeling task\. Given a queryxx, user historical informationhh, and a candidate responseyy, a personalized reward modelrθ​\(x,h,y\)r\_\{\\theta\}\(x,h,y\)assigns a scalar score to the response conditioned on both the query context and the user\-specific information\. The preference\-labeled dataset is𝒟=\{\(siA,siB,𝟙​\{siA≻siB\}\)\}i=1N\\mathcal\{D\}=\\\{\(s\_\{i\}^\{A\},s\_\{i\}^\{B\},\\mathbbm\{1\}\\\{s\_\{i\}^\{A\}\\succ s\_\{i\}^\{B\}\\\}\)\\\}\_\{i=1\}^\{N\}, wheresA=\[x,yA\]s^\{A\}=\[x,y^\{A\}\]andsB=\[x,yB\]s^\{B\}=\[x,y^\{B\}\]denote two candidate responses under the same promptxx, and𝟙​\{siA≻siB\}\\mathbbm\{1\}\\\{s\_\{i\}^\{A\}\\succ s\_\{i\}^\{B\}\\\}denotes the user’s preference\. The standard training objective is the Bradley–Terry–Luce \(BTL\) loss\[[23](https://arxiv.org/html/2606.07988#bib.bib23)\]

ℒBTL​\(θ\)=−𝔼\(sA,sB\)∼𝒟​\[log⁡σ​\(rθ​\(sA\)−rθ​\(sB\)\)\],\\mathcal\{L\}\_\{\\mathrm\{BTL\}\}\(\\theta\)\\;=\\;\-\\,\\mathbb\{E\}\_\{\(s^\{A\},s^\{B\}\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\log\\sigma\\\!\\left\(r\_\{\\theta\}\(s^\{A\}\)\-r\_\{\\theta\}\(s^\{B\}\)\\right\)\\right\],\(1\)where we abbreviaterθ​\(s\)=rθ​\(x,h,y\)r\_\{\\theta\}\(s\)=r\_\{\\theta\}\(x,h,y\)for compactness andσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid\. Unlike the conventional setting that assumes a single latent reward function shared across all users, our central question is whether a personalized reward model can leverage user\-specific information to improve modeling fidelity while*simultaneously*avoiding systematic disparities across user groups when preferences are heterogeneous\.

![Refer to caption](https://arxiv.org/html/2606.07988v1/x2.png)\(a\)Personal\-LLM: Accuracy increases with user support rate\.
![Refer to caption](https://arxiv.org/html/2606.07988v1/x3.png)\(b\)DSP: Minority preference styles receive substantially lower accuracy than majority styles\.

Figure 2:Preliminary analysis on the Personal\-LLM and DSP datasets across P\-RLHF, VPL, and GPO methods, showing user\-level unfairness in existing personalized RMs\.
#### Analysis\.

We first examine whether existing personalized RMs provide comparable reward modeling quality across users with different levels of preference support\. Figure[2](https://arxiv.org/html/2606.07988#S2.F2)reports user\-level accuracy of three representative personalized reward modeling methods, P\-RLHF, VPL, and GPO, on Personal\-LLM and DSP\. On Personal\-LLM, accuracy increases strongly with user support rate, indicating that users whose preferences are less supported in the population receive lower accuracy\. On DSP, minority preference styles obtain substantially lower accuracy than majority styles across all three methods\. These results reveal a support\-structured disparity in personalized reward modeling\.

This phenomenon is notable because personalized RMs are designed to handle preference heterogeneity\. A natural expectation is that conditioning the reward model on user\-specific informationhhshould allowrθ​\(x,h,y\)r\_\{\\theta\}\(x,h,y\)to express different preferences for different users\. However, Figure[2](https://arxiv.org/html/2606.07988#S2.F2)shows that personalization capacity alone does not guarantee equitable reward modeling quality: users with low\-support preference patterns still receive systematically lower accuracy\.

We attribute this phenomenon to preference\-support imbalance in the training data and the aggregate nature of the BTL objective\. First, user\-specific signals differ in both quantity and quality\. Some users provide stable preference pairs that align with widely shared patterns, while others provide fewer or noisier pairs that reflect less\-supported preferences\. Second, the BTL objective aggregates over the full dataset, so preferences that are better supported in the population contribute more frequent and more consistent gradient signals\. As a result, even when the model architecture supports personalization, training can still be dominated by majority\-aligned preference patterns, leaving low\-support users under\-supervised\. Thus, personalization may inherit preference\-support imbalance rather than eliminate it\.

## 3Method

![Refer to caption](https://arxiv.org/html/2606.07988v1/x4.png)Figure 3:Overview of PAFO\. \(A\) Imbalanced preference support leads to user\-level unfairness in personalized reward modeling\. \(B\) PAFO targets Pareto improvement, improving minority users without degrading majority users\. \(C\) It trains group\-specialized reward models and distills group\-conditioned margin supervision into a single deployable personalized RM\.In this section, to solve the imbalance problem, we first introduce the concept of Pareto fairness, then propose a principled distillation method and a lightweight model structure\.

### 3\.1Pareto Fairness

We assume the users can be divided intonngroups, denoted as𝒢=\{1,2,…,n\}\\mathcal\{G\}=\\\{1,2,\\ldots,n\\\}, with each group showing a different preference\. Without loss of generality, let the first group \(g=1g=1\) be the majority group, and all remaining groups are minorities\. Let𝒟g\\mathcal\{D\}\_\{g\}denote the data distribution of groupg∈𝒢g\\in\\mathcal\{G\}, and letUθ​\(g\)U\_\{\\theta\}\(g\)denote the expected utility of groupggunder model parametersθ\\theta\. For example, when we train the reward model, we can take the negative BTL loss, i\.e\.,Uθ​\(g\)=𝔼\(sA,sB\)∼𝒟g​\[log⁡σ​\(rθ​\(sA\)−rθ​\(sB\)\)\]U\_\{\\theta\}\(g\)=\\mathbb\{E\}\_\{\(s^\{A\},s^\{B\}\)\\sim\\mathcal\{D\}\_\{g\}\}\\\!\\left\[\\log\\sigma\\\!\\left\(r\_\{\\theta\}\(s^\{A\}\)\-r\_\{\\theta\}\(s^\{B\}\)\\right\)\\right\], as the utility\. Based on this formulation, we define a Pareto improvement as follows: relative to a baseline modelθbase\\theta\_\{\\mathrm\{base\}\}, a new modelθ\\thetadoes not decrease the utility of any group of interest and achieves a strict improvement for at least one group, which can be formulated as:

∀g∈𝒢,Uθ​\(g\)≥Uθbase​\(g\),∃g′∈𝒢​s\.t\.​Uθ​\(g′\)\>Uθbase​\(g′\)\.\\forall g\\in\\mathcal\{G\},\\;U\_\{\\theta\}\(g\)\\geq U\_\{\\theta\_\{\\mathrm\{base\}\}\}\(g\),\\qquad\\exists g^\{\\prime\}\\in\\mathcal\{G\}\\;\\text\{s\.t\.\}\\;U\_\{\\theta\}\(g^\{\\prime\}\)\>U\_\{\\theta\_\{\\mathrm\{base\}\}\}\(g^\{\\prime\}\)\.\(2\)
Meanwhile, we define Pareto efficiency \(also called Pareto fairness in our paper\) as an ideal state in which the model cannot be further improved by any Pareto improvement, i\.e\., improving one group’s utility must harm another group’s utility\. A naive way is to train models for each user group separately to solve this problem\. However, it may significantly increase the costs of model storage, deployment, and inference\. In addition, the group label of users is often unavailable at inference time\. Therefore, we need a new training mechanism to train a unified model to make predictions in the following sections\.

### 3\.2Necessary Condition of Pareto Fairness

Because sufficient and necessary conditions are difficult to derive, many previous works have focused only on the necessary conditions\. We first introduce the following lemma:

###### Lemma 1\(Necessary condition\[[24](https://arxiv.org/html/2606.07988#bib.bib24)\]\)\.

Any solutionθ\\thetathat satisfies Pareto fairness must satisfy the following condition:

∃αg\>0,g∈\{1,2,…,n\},∑g=1nαg=1,s\.t\.∑g=1nαg​∇θUθ​\(g\)=0\.\\exists~\\alpha\_\{g\}\>0,~g\\in\\\{1,2,\\ldots,n\\\},~\\sum\_\{g=1\}^\{n\}\\alpha\_\{g\}=1,~\\text\{s\.t\.\}\\quad\\sum\_\{g=1\}^\{n\}\\alpha\_\{g\}\\nabla\_\{\\theta\}U\_\{\\theta\}\(g\)=0\.\(3\)

Therefore, given the model parameterθ\\theta, the problem of judging whetherθ\\thetasatisfies the necessary condition of Pareto fairness can be transferred to findαt\\alpha\_\{t\}to minimize the following equation and examining whether the minimum of the following optimization problem is greater than0:

minα1,…,αn⁡\{‖∑g=1nαt​∇θUθ​\(g\)‖22\|∑g=1nαg=1,αg≥0,∀g\}\\min\_\{\\alpha\_\{1\},\\dots,\\alpha\_\{n\}\}\\left\\\{\\left\\\|\\sum\_\{g=1\}^\{n\}\\alpha\_\{t\}\\nabla\_\{\\theta\}U\_\{\\theta\}\(g\)\\right\\\|\_\{2\}^\{2\}\\;\\middle\|\\;\\sum\_\{g=1\}^\{n\}\\alpha\_\{g\}=1,\\alpha\_\{g\}\\geq 0,\\;\\forall g\\right\\\}\(4\)
Previous study\[[25](https://arxiv.org/html/2606.07988#bib.bib25)\]shows that either the solution to this optimization problem is 0, resulting in a point that satisfies the necessary conditions, or the solution gives a descent direction that improves the utility of all groups\. That is to say, we have a guaranteed model update direction once obtaining the weights\. However, in the reward modeling setting of large language models, directly computing the explicit weight is usually infeasible\. On the one hand, it is impractical to find an explicit solution for Equation[4](https://arxiv.org/html/2606.07988#S3.E4)whenn\>2n\>2\. On the other hand, explicitly solving for the corresponding weights at every training step would introduce high computational cost and optimization instability\. Therefore, rather than directly solving for this explicit direction, we design a trainable approximation method to solve this problem\.

### 3\.3Framework Overview

In this section, we provide a brief overview of the proposed Pareto Fairness Optimization \(PAFO\) framework for personalized reward modeling to approximate the weight\. The core intuition is that a Pareto\-aligned update direction can be achieved by dynamically assigning higher optimization weights to groups that fall behind an implicit ideal target, until achieving the balance, i\.e\., all groups have similar performance\. Overall, PAFO consists of three stages\. First, we construct reward models in each group to mimic the best possible performance\. Second, we randomly choose a minority group to initialize the parameters, then further finetune the model based on the naive BTL loss and distillation loss to achieve the Pareto fairness\.

For the model structure, as shown in Figure[3](https://arxiv.org/html/2606.07988#S3.F3), we jointly use two types of personalized signals\. The first is a parameterized personalized representation based on the user ID: inspired by prior work\[[26](https://arxiv.org/html/2606.07988#bib.bib26)\], we introduce a lightweight user model that maps each user identifier to a learnable user embedding, and prepend this embedding to the very beginning of the model input so as to explicitly encode user preferences and guide the model to form personalized reward judgments\. The second is contextual personalization based on user historical behavior: for the current queryxx, we additionally select multiple historical interaction samples of the same user and prepend them beforexxas explicit historical context for the model\. Together, these two types of signals constitute the personalized conditional input of the reward model\.

### 3\.4Group\-Specialized Training

Since the group labels are available in the training stage, we can train group\-specialized reward models in each group\. Specifically, given a training sampledi=\(xi,hi,yiw,yil,gi\)d\_\{i\}=\(x\_\{i\},h\_\{i\},y\_\{i\}^\{w\},y\_\{i\}^\{l\},g\_\{i\}\), wherexix\_\{i\}denotes the input query,hih\_\{i\}denotes the user history,yiwy\_\{i\}^\{w\}andyily\_\{i\}^\{l\}denote the preferred response and rejected response, andgig\_\{i\}denotes the user’s group, respectively\. The reward modelrθr\_\{\\theta\}assigns scores to candidate responses, and the reward margin is defined asmθ​\(i\)=rθ​\(xi,hi,yiw\)−rθ​\(xi,hi,yil\)m\_\{\\theta\}\(i\)=r\_\{\\theta\}\(x\_\{i\},h\_\{i\},y\_\{i\}^\{w\}\)\-r\_\{\\theta\}\(x\_\{i\},h\_\{i\},y\_\{i\}^\{l\}\)\. We optimize the reward model with the standard Bradley–Terry loss:

ℒBTL​\(θ\)=−𝔼i∼𝒟​\[log⁡σ​\(mθ​\(i\)\)\],\\mathcal\{L\}\_\{\\mathrm\{BTL\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{i\\sim\\mathcal\{D\}\}\\left\[\\log\\sigma\(m\_\{\\theta\}\(i\)\)\\right\],\(5\)whereσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\. Since the majority group usually dominates in the training data, a personalized reward model trained directly on the full dataset tends to learn the preference structure of the majority group\. Therefore, we regard the reward model trained on the mixed data as a majority\-oriented reward model, denoted byrθ1r\_\{\\theta\_\{1\}\}, and the corresponding margin asmθ1m\_\{\\theta\_\{1\}\}\. This model can capture high\-support preference patterns relatively well, but is insufficient for modeling low\-support preference users\.

To further enhance the model’s ability to capture the preference structure of the minority group, we continue fine\-tuningrθ1r\_\{\\theta\_\{1\}\}on minority\-group data𝒟ming\\mathcal\{D\}^\{g\}\_\{\\mathrm\{min\}\}, obtaining a minority\-specialized reward model:

θg=arg⁡minθ⁡ℒBTLmin​\(θ\),where​ℒBTLmin​\(θ\)=−𝔼i∼𝒟ming​\[log⁡σ​\(mθ​\(i\)\)\],\\theta\_\{g\}=\\arg\\min\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{BTL\}\}^\{\\mathrm\{min\}\}\(\\theta\),\\text\{ where \}\\mathcal\{L\}\_\{\\mathrm\{BTL\}\}^\{\\mathrm\{min\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{i\\sim\\mathcal\{D\}\_\{\\mathrm\{min\}\}^\{g\}\}\\left\[\\log\\sigma\(m\_\{\\theta\}\(i\)\)\\right\],\(6\)whereg∈\{2,…,n\}g\\in\\\{2,\\ldots,n\\\}\. The minority\-specialized modelrθgr\_\{\\theta\_\{g\}\}can capture the preference patterns of low\-support users more adequately\. Until now, we have obtainednnmodels, representing the best possible performance onnngroups, respectively\.

### 3\.5Distillation Objective

After finetuning the models, we aim to dynamically assign optimization weights to groups based on the performance to train a unified model by distillation\. Different from the traditional single\-teacher distillation paradigm, we do not simply use the output of a single model as the supervision signal\. Instead, we propose to construct a conditional teacher based on group priors at the reward margin level\. Specifically, for any training sample, we define the final teacher margin asmT​\(xi,hi,yiw,yil\)=mθgm\_\{T\}\(x\_\{i\},h\_\{i\},y\_\{i\}^\{w\},y\_\{i\}^\{l\}\)=m\_\{\\theta\_\{g\}\}ifgi=gg\_\{i\}=g, withg∈\{1,2,…,n\}g\\in\\\{1,2,\\ldots,n\\\}\.

To protect the utility in the minority group, we randomly chooseθg\\theta\_\{g\}withg∈\{2,…,n\}g\\in\\\{2,\\ldots,n\\\}to initialize the student, i\.e\.,θS←θg\.\\theta\_\{S\}\\leftarrow\\theta\_\{g\}\.For any training sampledid\_\{i\}, the student model produces the corresponding reward marginmS​\(di\)m\_\{S\}\(d\_\{i\}\)\. To enable the student to learn the group preference boundary defined by the teacher, we impose a soft\-label distillation constraint at the level of pairwise preference probability\. Specifically, we map the teacher margin and the student margin into pairwise preference probabilities through the sigmoid functionqT​\(i\)=σ​\(mT​\(di\)\),qS​\(i\)=σ​\(mS​\(di\)\)q\_\{T\}\(i\)=\\sigma\(m\_\{T\}\(d\_\{i\}\)\),\\ q\_\{S\}\(i\)=\\sigma\(m\_\{S\}\(d\_\{i\}\)\)\. We then define the margin\-level distillation loss as

ℒdistill=𝔼di∼𝒟​\[CE​\(qT​\(i\),qS​\(i\)\)\],\\mathcal\{L\}\_\{\\mathrm\{distill\}\}=\\mathbb\{E\}\_\{d\_\{i\}\\sim\\mathcal\{D\}\}\\left\[\\mathrm\{CE\}\\left\(q\_\{T\}\(i\),q\_\{S\}\(i\)\\right\)\\right\],\(7\)whereCE​\(⋅,⋅\)\\mathrm\{CE\}\(\\cdot,\\cdot\)denotes the cross\-entropy loss\. Meanwhile, to prevent the student from drifting away from the true preference\-pair distribution during distillation, we retain the standard BTL loss as hard\-label supervision\. Our final optimization objective is to train a unified model using the following loss:

ℒP​A​F​O=α⋅ℒB​T​L\+\(1−α\)⋅ℒdistill\\mathcal\{L\}\_\{PAFO\}=\\alpha\\cdot\\mathcal\{L\}\_\{BTL\}\+\(1\-\\alpha\)\\cdot\\mathcal\{L\}\_\{\\mathrm\{distill\}\}\(8\)whereα\\alphais a hyperparameter controlling the strength of distillation\.

### 3\.6Theoretical Results

In Section 3\.2, we show that explicit Pareto optimization requires solving an optimization problem, which is impractical for large language models\. In this section, we provide a theoretical analysis to show how the proposed distillation method approximates the optimal Pareto\-aligned optimization\. First, we introduce some mild assumptions below:

Assumption 1\(Local Smoothness\)\.The utility functionsUg​\(θ\)U\_\{g\}\(\\theta\)are continuously differentiable andLL\-smooth for allg∈𝒢g\\in\\mathcal\{G\}\. Furthermore, the sample\-level margin gradients are bounded, such that‖∇θmθ​\(i\)‖≤Cϕ\\\|\\nabla\_\{\\theta\}m\_\{\\theta\}\(i\)\\\|\\leq C\_\{\\phi\}for all data instances\.

The most important thing we need to show is that our method can approximate the ground\-truth update direction\.

###### Theorem 1\(Update direction\)\.

The negative gradient of the distillation lossv=−∇θSℒd​i​s​t​i​l​lv=\-\\nabla\_\{\\theta\_\{S\}\}\\mathcal\{L\}\_\{distill\}is exactly composed of an adaptive conic combination of the true utility gradients plus a covariance term:

v=∑g=1nλg​∇UθS​\(g\)\+ℰc​o​v,with​λg≥0,v=\\sum\_\{g=1\}^\{n\}\\lambda\_\{g\}\\nabla U\_\{\\theta\_\{S\}\}\(g\)\+\\mathcal\{E\}\_\{cov\},\\quad\\text\{with \}\\lambda\_\{g\}\\geq 0,\(9\)whereλg=𝔼𝒟g​\[αg​\(i\)\]\\lambda\_\{g\}=\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{g\}\}\[\\alpha\_\{g\}\(i\)\],αg​\(i\)=σ​\(mg​\(i\)\)−σ​\(mS​\(i\)\)1−σ​\(mS​\(i\)\)\\alpha\_\{g\}\(i\)=\\frac\{\\sigma\(m\_\{g\}\(i\)\)\-\\sigma\(m\_\{S\}\(i\)\)\}\{1\-\\sigma\(m\_\{S\}\(i\)\)\},ℰc​o​v=∑g=1neg\\mathcal\{E\}\_\{cov\}=\\sum\_\{g=1\}^\{n\}e\_\{g\}, andeg=Cov𝒟g​\(αg​\(i\),\(1−σ​\(mS​\(i\)\)\)​∇θSmS​\(i\)\)e\_\{g\}=\\text\{Cov\}\_\{\\mathcal\{D\}\_\{g\}\}\(\\alpha\_\{g\}\(i\),\(1\-\\sigma\(m\_\{S\}\(i\)\)\)\\nabla\_\{\\theta\_\{S\}\}m\_\{S\}\(i\)\)\.

The proof is in Appendix[E](https://arxiv.org/html/2606.07988#A5)\. Building upon this decomposition, we can directly derive the following theorem to ensure a large enough cosine similarity between our update direction and the optimal direction when the covariance is small\.

###### Theorem 2\(Similarity between ours and optimal\)\.

Letd∗​\(θS\)d^\{\*\}\(\\theta\_\{S\}\)be the Pareto optimal direction, with weight\(α1,α2,…,αg\)\(\\alpha\_\{1\},\\alpha\_\{2\},\\ldots,\\alpha\_\{g\}\)solved from Equation[4](https://arxiv.org/html/2606.07988#S3.E4)\. Letvc​o​n​e=∑g=1nλg​∇UθS​\(g\)v\_\{cone\}=\\sum\_\{g=1\}^\{n\}\\lambda\_\{g\}\\nabla U\_\{\\theta\_\{S\}\}\(g\),μ=cos⁡\(vc​o​n​e,d∗​\(θS\)\)\\mu=\\cos\(v\_\{cone\},d^\{\*\}\(\\theta\_\{S\}\)\), andν=‖ℰc​o​v‖‖vc​o​n​e‖\\nu=\\frac\{\\\|\\mathcal\{E\}\_\{cov\}\\\|\}\{\\\|v\_\{cone\}\\\|\}\. As long as the current parameterθS\\theta\_\{S\}is not at a Pareto stationary point, we haveμ\>0\\mu\>0\. Furthermore, we have

cos⁡\(v,d∗​\(θS\)\)≥μ−ν1\+ν\\cos\(v,d^\{\*\}\(\\theta\_\{S\}\)\)\\geq\\frac\{\\mu\-\\nu\}\{1\+\\nu\}\(10\)

A critical question arises whenμ\\muis close to zero\. However, PAFO prevents orthogonal degeneration \(μ→0\\mu\\to 0\) through a dynamic negative feedback loop\. Since both the Pareto optimum direction and our directions assign higher weights to underperforming groups to prevent utility degradation, if the update trajectory temporarily diverges fromd∗d^\{\*\}, the student’s prediction marginσ​\(mS\)\\sigma\(m\_\{S\}\)on that group deteriorates\. Consequently, the implicit weight vector\(λ1,λ2,…,λg\)\(\\lambda\_\{1\},\\lambda\_\{2\},\\ldots,\\lambda\_\{g\}\)and the optimal Pareto weight vector\(α1,α2,…,αg\)\(\\alpha\_\{1\},\\alpha\_\{2\},\\ldots,\\alpha\_\{g\}\)exhibit strong positive correlation throughout the trajectory\.

## 4Experiments

### 4\.1Experimental Settings

Datasets\.We use two datasets,Personal\-LLM\[[20](https://arxiv.org/html/2606.07988#bib.bib20)\]andDSP\[[21](https://arxiv.org/html/2606.07988#bib.bib21)\], both adapted to expose support\-imbalanced user populations\. On Personal\-LLM, the bottom 15% of users by support rate form the minority\. On DSP, each user is assigned one of five preference types, with three minority types each held by 5% of users and two majority types held by 25% and 60%\. Dataset construction, support\-rate definitions, and group statistics are in Appendix[C\.1](https://arxiv.org/html/2606.07988#A3.SS1)–[C\.3](https://arxiv.org/html/2606.07988#A3.SS3)\.

Table 1:Performance comparison on Personal\-LLM and DSP\. We report utility metrics \(↑\\uparrow, higher is better\) and fairness metrics \(↓\\downarrow, lower is better\)\. Best results per row are inbold\.#### Compared Methods\.

We compare PAFO with six baselines covering three categories\. 1\)Standard reward modeling:Bradley\-Terry\-Luce \(BTL\)\[[23](https://arxiv.org/html/2606.07988#bib.bib23)\]adds a linear scoring head on the base LLM, the standard RLHF reward modeling approach\. 2\)Personalized reward modeling:Variational Preference Learning \(VPL\)\[[27](https://arxiv.org/html/2606.07988#bib.bib27)\]learns latent user representations via a variational autoencoder;Group Preference Optimization \(GPO\)\[[28](https://arxiv.org/html/2606.07988#bib.bib28)\]learns user representations through a Transformer with meta\-learning;Personalized\-RLHF \(P\-RLHF\)\[[20](https://arxiv.org/html/2606.07988#bib.bib20)\]injects user embeddings into the base LLM’s input representations through a learnable user model\. 3\)Fairness interventions:Reweightingupweights minority\-group samples during training;Regularization \(Reg\)adds a regularizer on user\-level reward distributions to reduce cross\-user disparity\. The implementation details for all methods are in Appendix[C\.4](https://arxiv.org/html/2606.07988#A3.SS4)\.

#### Evaluation Metrics\.

We evaluate both utility and fairness\. For utility, we reportOverall Accuracy\(Acc\.\),Minority Accuracy\(Min Acc\.\), andMajority Accuracy\(Maj Acc\.\); the latter two reveal whether minority gains come at the majority’s expense\. For fairness, we report three metrics on the per\-user Accuracy distribution:Coefficient of Variation \(CV\)for the relative dispersion of user Acc\.uracies,Gini coefficient \(GINI\)for inequality, andAccuracy–Support Rate Slope \(Slope\)from a linear fit of user Accuracy on support rate, where a larger slope indicates stronger bias toward mainstream users\. Utility and fairness must be read jointly: utility gains may come from sacrificing minorities, while fairness gains may come from degrading overall utility\. Full formulas are in Appendix[C\.5](https://arxiv.org/html/2606.07988#A3.SS5)\.

### 4\.2Main Results

We compare PAFO with the baselines on both datasets, with results reported in Table[1](https://arxiv.org/html/2606.07988#S4.T1)\. PAFO achieves Pareto\-style improvement over both groups simultaneously: Min Acc\. rises by \+18\.3 on Personal\-LLM and \+43\.0 on DSP over BTL, while Maj Acc\. is improved on Personal\-LLM and essentially preserved on DSP\. PAFO is best on every metric on Personal\-LLM, and leads on Acc\., Min Acc\., CV, and Gini on DSP, with AS\-Slope virtually tied with Reweight\. Notably, AS\-Slope drops sharply on both datasets even as Min Acc\. rises and Maj Acc\. holds, indicating that PAFO breaks the dependency between user Accuracy and preference popularity\.

The baselines reveal why this improvement is non\-trivial\. Personalization alone does not produce it: VPL and GPO degrade Min Acc\. on Personal\-LLM relative to the standard BTL baseline, and even P\-RLHF—PAFO’s own personalization backbone—falls notably short of PAFO on minorities \(70\.33 vs\. 75\.00 on Personal\-LLM, 69\.67 vs\. 74\.47 on DSP\), confirming that user conditioning does not automatically absorb support imbalance\. Existing Fairness interventions trade rather than inherently improve: Compared to their personalization backbone \(P\-RLHF\), Reweight on DSP lifts Min Acc\. only at a clear cost to Maj Acc\.; while Reg secures the highest Maj Acc\. on DSP at the price of leaving Min Acc\. well below PAFO’s\. PAFO is the only method that substantially improves minorities while matching or exceeding its own backbone on majorities, validating that Pareto\-style improvement is achievable in personalized reward modeling under a single deployable model\.

### 4\.3In\-depth Analyses

Table 2:Ablation of PAFO\. Min Only: minority specialist alone\. Maj→\\rightarrowMin: Maj to Min distillation\. Min \+ SFT: SFT replacing margin distillation\. RI: relative improvement \(%\) over Base\.#### Ablation Studies\.

Table[2](https://arxiv.org/html/2606.07988#S4.T2)ablates PAFO’s components on top of the Base model, which is trained on all data and inherits the majority bias \(effectively serving as PAFO’s majority specialist\)\. We have four main results: 1\) A minority specialist alone trades majorities for minorities\. "Min Only" continues training Base on minority data, lifting Min Acc\. dramatically \(\+23\.2 / \+18\.8\) but collapsing Maj Acc\. \(\-16\.6 / \-37\.0\)—a textbook Pareto failure111Notably, PAFO’s task is not to match the Min Specialist on minorities, but to deliver as much of its minority benefit as possible without surrendering the majority\.; 2\) Non\-conditional distillation fails to lift minorities\. "Maj→\\rightarrowMin" distills Base into the minority specialist without conditional routing\. Maj Acc\. recovers, but Min Acc\. barely moves \(\+1\.4 / \-0\.5\)—majority preference structure is precisely what minorities do not share; 3\) SFT cannot substitute for margin\-level distillation\. "Min \+ SFT" replaces margin distillation with continued SFT\. Performance is uneven \(\+4\.8 on Personal\-LLM, \-3\.3 on DSP\), showing that response\-level imitation cannot consolidate group\-specific preference structure; 4\) In contrast, PAFO improves both groups on both datasets, confirming that its three components—group\-specialized modeling, conditional routing, and margin\-level distillation—are jointly necessary\.

![Refer to caption](https://arxiv.org/html/2606.07988v1/x5.png)Figure 4:Sensitivity of PAFO to the minority ratio\. P\-RLHF is shown as a reference\.
#### Sensitivity to Minority Ratio\.

PAFO relies on a training\-time partition of users into majority and minority groups to train the specialists used for distillation\. We therefore study how sensitive PAFO is to the minority ratio used at this stage\. As shown in Figure[4](https://arxiv.org/html/2606.07988#S4.F4), we vary the ratio over \{0\.10, 0\.15, 0\.20\}\. PAFO achieves Pareto improvement over its personalization backbone \(P\-RLHF\) at every setting, with Min Acc\., Maj Acc\., and Overall Acc\. all rising consistently across the three ratios\. This confirms that PAFO’s advantage stems from the distillation mechanism itself and is robust to the choice of minority threshold\.

![Refer to caption](https://arxiv.org/html/2606.07988v1/x6.png)Figure 5:Effect of the distillation weightα\\alpha\(α↓\\alpha\\downarrow, strength↑\\uparrow\)\.
#### Effects of Distillation Strength\.

The coefficientα\\alphain Eq\.[8](https://arxiv.org/html/2606.07988#S3.E8)controls the trade\-off between the distillation loss and the original BT reward modeling loss\. We varyα∈\{0\.1,0\.25,0\.5,0\.8\}\\alpha\\in\\\{0\.1,0\.25,0\.5,0\.8\\\}, with results shown in Figure[5](https://arxiv.org/html/2606.07988#S4.F5)\. The distillation strength has a much larger impact on Min Acc\. than on Maj Acc\.: Min Acc\. increases first, then decreases asα\\alphaincreases, while Maj Acc\. stays essentially flat across all settings\. This asymmetry is intuitive—the majority is already well\-supported by the base model, so it benefits little from additional distillation signal and is also not harmed by it; the minority, by contrast, depends almost entirely on the distillation signal to recover its group\-specific preference structure, making it sensitive to how strongly that signal is weighted\.

## 5Related Work

Our work intersects three lines of research\.Personalized reward modelinglearns user\-dependent preference signals through explicit user representations, latent preference modeling, reward factorization, or low\-rank decomposition\[[29](https://arxiv.org/html/2606.07988#bib.bib29),[27](https://arxiv.org/html/2606.07988#bib.bib27),[30](https://arxiv.org/html/2606.07988#bib.bib30),[31](https://arxiv.org/html/2606.07988#bib.bib31),[32](https://arxiv.org/html/2606.07988#bib.bib32)\], with some approaches operating at decoding or prompting time\[[33](https://arxiv.org/html/2606.07988#bib.bib33)\]\. These methods primarily target average personalization quality and rarely audit utility distribution across users\[[34](https://arxiv.org/html/2606.07988#bib.bib34)\]\.Fairness in preference learningexamines disparities induced by heterogeneous feedback\. PRISM documents substantial individual\- and group\-level variation in alignment data\[[35](https://arxiv.org/html/2606.07988#bib.bib35)\]; MaxMin\-RLHF and Group Robust Preference Optimization improve worst\-group utility via max\-min or distributionally\-robust objectives\[[34](https://arxiv.org/html/2606.07988#bib.bib34),[36](https://arxiv.org/html/2606.07988#bib.bib36)\];P\-GRPO enhances policy optimization for minority groups through group\-wise advantage normalization in the GRPO learning process\[[37](https://arxiv.org/html/2606.07988#bib.bib37)\]; theoretical analyses caution against naive preference aggregation\[[38](https://arxiv.org/html/2606.07988#bib.bib38),[39](https://arxiv.org/html/2606.07988#bib.bib39)\]and recent works indicate that unfairness can originate at the reward\-modeling stage\[[22](https://arxiv.org/html/2606.07988#bib.bib22),[13](https://arxiv.org/html/2606.07988#bib.bib13)\]\.Efforts have been made to fair reward modeling; however, they mainly address demographic fairness and bias mitigation\[[19](https://arxiv.org/html/2606.07988#bib.bib19),[22](https://arxiv.org/html/2606.07988#bib.bib22)\]\. In contrast, our work focuses on support\-structured personalization unfairness, a distinct challenge that has received limited attention\. Moreover, to our knowledge, existing approaches have not considered distillation\-based methods for addressing this issue\.Multi\-objective alignment, including MODPO, DPA, RiC, MetaAligner, Panacea, ArmoRM, HaM, and PARM, optimizes over multiple preference dimensions or seeks Pareto\-style trade\-offs, but typically on*pre\-specified*objective axes rather than user\-induced groups\[[40](https://arxiv.org/html/2606.07988#bib.bib40),[41](https://arxiv.org/html/2606.07988#bib.bib41),[42](https://arxiv.org/html/2606.07988#bib.bib42),[43](https://arxiv.org/html/2606.07988#bib.bib43),[44](https://arxiv.org/html/2606.07988#bib.bib44),[45](https://arxiv.org/html/2606.07988#bib.bib45),[46](https://arxiv.org/html/2606.07988#bib.bib46),[47](https://arxiv.org/html/2606.07988#bib.bib47)\]\. PAFO differs from all three\. Unlike personalized RM work, it explicitly targets utility*fairness*across users\. Unlike fairness\-in\-preference\-learning work, it pursues*Pareto improvement*, lifting minority utility without degrading majority utility, rather than max\-min reallocation\. Unlike multi\-objective alignment, its objectives are induced by preference\-support imbalance and require no group label at inference\. A more comprehensive review of related work is provided in Appendix[H](https://arxiv.org/html/2606.07988#A8)\.

## 6Conclusion

We studied personalized reward modeling through the lens of*preference\-support imbalance*and identified*personalization unfairness*: a systematic disparity in reward modeling quality governed by how well a user’s preference pattern is represented in training\. We argued that mitigating this disparity should be framed as Pareto improvement over preference groups, not as a fairness–utility trade\-off, and instantiated this view in PAFO, which trains group\-specialized reward models and consolidates them into a single deployable model via conditional margin\-level distillation\. PAFO is the only method, among six baselines on two benchmarks, that simultaneously lifts minority\-group accuracy and preserves majority\-group accuracy while reducing user\-level unfairness on three metrics; theoretically, its update direction approximates the implicit Pareto\-optimal direction up to a covariance term\. More broadly, our findings suggest that aggregate accuracy alone is an insufficient measure of personalization quality, and that user\-level fairness deserves to be treated as a first\-class objective in personalized alignment\.

## References

- Ouyang et al\. \[2022\]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.*Advances in neural information processing systems*, 35:27730–27744, 2022\.
- Stiennon et al\. \[2020\]Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano\.Learning to summarize with human feedback\.*Advances in neural information processing systems*, 33:3008–3021, 2020\.
- Kim and Seo \[2024\]Sungdong Kim and Minjoon Seo\.Rethinking the role of proxy rewards in language model alignment\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 20656–20674, 2024\.
- Liu et al\. \[2025\]Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li\.Pairwise rm: Perform best\-of\-n sampling with knockout tournament\.*arXiv e\-prints*, pages arXiv–2501, 2025\.
- Hung et al\. \[2025\]Chia\-Yu Hung, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria\.Reward\-guided tree search for inference time alignment of large language models\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 12575–12593, 2025\.
- Huang et al\. \[2024\]Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo M Ponti, and Ivan Titov\.Post\-hoc reward calibration: A case study on length bias\.*arXiv preprint arXiv:2409\.17407*, 2024\.
- Gallegos et al\. \[2024\]Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed\.Bias and fairness in large language models: A survey\.*Computational linguistics*, 50\(3\):1097–1179, 2024\.
- Singhal et al\. \[2023\]Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett\.A long way to go: Investigating length correlations in rlhf\.*arXiv preprint arXiv:2310\.03716*, 2023\.
- Hayes et al\. \[2024\]William M Hayes, Nicolas Yax, and Stefano Palminteri\.Large language models are biased reinforcement learners\.*arXiv preprint arXiv:2405\.11422*, 2024\.
- Lambert et al\. \[2025\]Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al\.Rewardbench: Evaluating reward models for language modeling\.In*Findings of the Association for Computational Linguistics: NAACL 2025*, pages 1755–1797, 2025\.
- Sharma et al\. \[2023\]Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Esin Durmus, Zac Hatfield\-Dodds, Scott R Johnston, Shauna M Kravec, et al\.Towards understanding sycophancy in language models\.In*The Twelfth International Conference on Learning Representations*, 2023\.
- Xiao et al\. \[2024\]Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, and Weijie J Su\.On the algorithmic bias of aligning large language models with rlhf: Preference collapse and matching regularization\.*arXiv preprint arXiv:2405\.16455*, 2024\.
- Song et al\. \[2025\]Kefan Song, Jin Yao, Runnan Jiang, Rohan Chandra, and Shangtong Zhang\.Towards large language models that benefit for all: Benchmarking group fairness in reward models\.*arXiv preprint arXiv:2503\.07806*, 2025\.
- Zhao et al\. \[2026\]Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See\-Kiong Ng, and Tat\-Seng Chua\.Nextquill: Causal preference modeling for enhancing llm personalization\.*ICLR*, 2026\.
- Zhang et al\. \[2026\]Yang Zhang, Wenxin Xu, Xiaoyan Zhao, Wenjie Wang, Fuli Feng, Xiangnan He, and Tat\-Seng Chua\.Reinforced latent reasoning for llm\-based recommendation\.*ICLR*, 2026\.
- Zhao et al\. \[2025a\]Xiaoyan Zhao, Ming Yan, Yilun Qiu, Haoting Ni, Yang Zhang, Fuli Feng, Hong Cheng, and Tat\-Seng Chua\.Steerx: Disentangled steering for llm personalization\.*arXiv preprint arXiv:2510\.22256*, 2025a\.
- Qiu et al\. \[2025a\]Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, and Fuli Feng\.Latent inter\-user difference modeling for llm personalization\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 10610–10628, 2025a\.
- Dong et al\. \[2025\]Yijiang River Dong, Tiancheng Hu, Yinhong Liu, Ahmet Üstün, and Nigel Collier\.When personalization meets reality: A multi\-faceted analysis of personalized preference learning\.*arXiv preprint arXiv:2502\.19158*, 2025\.
- Choi et al\. \[2026\]Ching Lam Choi, Vighnesh Subramaniam, Phillip Isola, Antonio Torralba, and Stefanie Jegelka\.Fairness aware reward optimization\.*arXiv preprint arXiv:2602\.07799*, 2026\.
- Zollo et al\. \[2024\]Thomas P Zollo, Andrew Wei Tung Siah, Naimeng Ye, Ang Li, and Hongseok Namkoong\.Personalllm: Tailoring llms to individual preferences\.*arXiv preprint arXiv:2409\.20296*, 2024\.
- Cheng et al\. \[2023\]Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, and Nan Du\.Everyone deserves a reward: Learning customized human preferences\.*arXiv preprint arXiv:2309\.03126*, 2023\.
- Ouyang et al\. \[2025\]Sheng Ouyang, Yulan Hu, Ge Chen, Qingyang Li, Fuzheng Zhang, and Yong Liu\.Towards reward fairness in rlhf: From a resource allocation perspective\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 3247–3259, 2025\.
- Bradley and Terry \[1952\]Ralph Allan Bradley and Milton E Terry\.Rank analysis of incomplete block designs: I\. the method of paired comparisons\.*Biometrika*, 39\(3/4\):324–345, 1952\.
- Sener and Koltun \[2018\]Ozan Sener and Vladlen Koltun\.Multi\-task learning as multi\-objective optimization\.*Advances in neural information processing systems*, 31, 2018\.
- Désidéri \[2012\]Jean\-Antoine Désidéri\.Multiple\-gradient descent algorithm \(mgda\) for multiobjective optimization\.*Comptes Rendus\. Mathématique*, 350\(5\-6\):313–318, 2012\.
- Li et al\. \[2024\]Xinyu Li, Ruiyang Zhou, Zachary C Lipton, and Liu Leqi\.Personalized language modeling from personalized human feedback\.*arXiv preprint arXiv:2402\.05133*, 2024\.
- Poddar et al\. \[2024\]Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques\.Personalizing reinforcement learning from human feedback with variational preference learning\.*Advances in Neural Information Processing Systems*, 37:52516–52544, 2024\.
- Zhao et al\. \[2023\]Siyan Zhao, John Dang, and Aditya Grover\.Group preference optimization: Few\-shot alignment of large language models\.*arXiv preprint arXiv:2310\.11523*, 2023\.
- Zhao et al\. \[2025b\]Xiaoyan Zhao, Ming Yan, Yang Zhang, Yang Deng, Jian Wang, Fengbin Zhu, Yilun Qiu, Hong Cheng, and Tat\-Seng Chua\.Reinforced strategy optimization for conversational recommender systems via network\-of\-experts\.*arXiv e\-prints*, pages arXiv–2509, 2025b\.
- Bose et al\. \[2025\]Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, and Maryam Fazel\.Lore: Personalizing llms via low\-rank reward modeling\.*arXiv preprint arXiv:2504\.14439*, 2025\.
- Wang et al\. \[2026a\]Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, and Tat\-Seng Chua\.Think\-while\-generating: On\-the\-fly reasoning for personalized long\-form generation\.*ICLR*, 2026a\.
- Shenfeld et al\. \[2025\]Idan Shenfeld, Felix Faltings, Pulkit Agrawal, and Aldo Pacchiano\.Language model personalization via reward factorization\.*arXiv preprint arXiv:2503\.06358*, 2025\.
- Chen et al\. \[2024\]Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, and Zuozhu Liu\.Pad: Personalized alignment of llms at decoding\-time\.*arXiv preprint arXiv:2410\.04070*, 2024\.
- Chakraborty et al\. \[2024\]Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Bedi, and Mengdi Wang\.Maxmin\-rlhf: Towards equitable alignment of large language models with diverse human preferences\.In*ICML 2024 Workshop on Models of Human Feedback for AI Alignment*, 2024\.
- Kirk et al\. \[2024\]Hannah R Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, et al\.The prism alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models\.*Advances in Neural Information Processing Systems*, 37:105236–105344, 2024\.
- Ramesh et al\. \[2024\]Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, and Ilija Bogunovic\.Group robust preference optimization in reward\-free rlhf\.*Advances in Neural Information Processing Systems*, 37:37100–37137, 2024\.
- Wang et al\. \[2026b\]Jialu Wang, Heinrich Peters, Asad A Butt, Navid Hashemi, Alireza Hashemi, Pouya M Ghari, Joseph Hoover, James Rae, and Morteza Dehghani\.Personalized group relative policy optimization for heterogenous preference alignment\.*arXiv preprint arXiv:2603\.10009*, 2026b\.
- Park et al\. \[2024\]Chanwoo Park, Mingyang Liu, Dingwen Kong, Kaiqing Zhang, and Asuman Ozdaglar\.Rlhf from heterogeneous feedback via personalization and preference aggregation\.*arXiv preprint arXiv:2405\.00254*, 2024\.
- Shirali et al\. \[2025\]Ali Shirali, Arash Nasr\-Esfahany, Abdullah Alomar, Parsa Mirtaheri, Rediet Abebe, and Ariel Procaccia\.Direct alignment with heterogeneous preferences\.*arXiv preprint arXiv:2502\.16320*, 2025\.
- Zhou et al\. \[2023\]Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue, Wanli Ouyang, and Yu Qiao\.Beyond one\-preference\-for\-all: Multi\-objective direct preference optimization\.2023\.
- Wang et al\. \[2024a\]Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang\.Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi\-objective rewards\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 8642–8655, 2024a\.
- Yang et al\. \[2024a\]Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen\.Rewards\-in\-context: Multi\-objective alignment of foundation models with dynamic preference adjustment\.*arXiv preprint arXiv:2402\.10207*, 2024a\.
- Yang et al\. \[2024b\]Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Tianlin Zhang, and Sophia Ananiadou\.Metaaligner: Towards generalizable multi\-objective alignment of language models\.*Advances in Neural Information Processing Systems*, 37:34453–34486, 2024b\.
- Zhong et al\. \[2024\]Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang\.Panacea: Pareto alignment via preference adaptation for llms\.*Advances in Neural Information Processing Systems*, 37:75522–75558, 2024\.
- Wang et al\. \[2024b\]Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang\.Interpretable preferences via multi\-objective reward modeling and mixture\-of\-experts\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 10582–10592, 2024b\.
- Mukherjee et al\. \[2024\]Subhojyoti Mukherjee, Anusha Lalitha, Sailik Sengupta, Aniket Deshmukh, and Branislav Kveton\.Multi\-objective alignment of large language models through hypervolume maximization\.*arXiv preprint arXiv:2412\.05469*, 2024\.
- Lin et al\. \[2025\]Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, and Ying\-Cong Chen\.Parm: Multi\-objective test\-time alignment via preference\-aware autoregressive reward model\.*arXiv preprint arXiv:2505\.06274*, 2025\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Hu et al\. \[2022\]Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al\.Lora: Low\-rank adaptation of large language models\.*Iclr*, 1\(2\):3, 2022\.
- Ryan et al\. \[2025\]Michael J Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Barr Held, and Diyi Yang\.Synthesizeme\! inducing persona\-guided prompts for personalized reward models in llms\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 8045–8078, 2025\.
- Qiu et al\. \[2025b\]Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yimeng Bai, Wenjie Wang, Hong Cheng, Fuli Feng, and Tat\-Seng Chua\.Measuring what makes you unique: Difference\-aware user modeling for enhancing llm personalization\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 21258–21277, 2025b\.
- Zhang et al\. \[2025\]Jizhi Zhang, Haoyu Shen, Tianhao Shi, Keqin Bao, Xin Chen, Yang Zhang, and Fuli Feng\.Fair recommendation with biased\-limited sensitive attribute\.In*Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 1717–1727, 2025\.
- Shi et al\. \[2024\]Tianhao Shi, Yang Zhang, Jizhi Zhang, Fuli Feng, and Xiangnan He\.Fair recommendations with limited sensitive attributes: A distributionally robust optimization approach\.In*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 448–457, 2024\.

## Appendix ACase Study

To more intuitively illustrate the effect of PAFO, we further provide a case study on the scoring margins of a specific preference pair, as shown below\. We select a test example from a minority\-group user in the DSP dataset and present the scoring margins assigned by PAFO and the base model \(P\-RLHF\) to the positive and negative candidate responses, respectively\.

It can be clearly observed that the base model fails to recognize that the positive response, which exhibits a poetic style, is exactly the type of answer preferred by a user in the literature category, and instead assigns a relatively high score to the negative response, which is more restrained and less engaging in style\. In contrast, PAFO is able to capture the user’s personalized preference more accurately, thereby making a distinction between the positive and negative responses that is better aligned with the user’s preference\. This demonstrates its stronger capability in personalized reward modeling\.

Case StudyPrompt\.Name three things a person needs for a successful job hunt\.User Preference\.literatureBase preferred response with margin = \-3\.7656\.To begin the search, you need to prepare yourself well, dress like a star,Polish your resume and skills, polish and brush up, you need to stand out and go far\.You will need some patience too, as the journey may be long,But persistence and determination will help you stay strong\.Networking is a key factor that should not be ignored,Connections can open doors, that you may have ignored\.So in summary, preparation, patience, and networking can be,The essential tools that you need for a successful job hunt spree\.PAFO preferred response with margin = 0\.6562\.As an experienced researcher, I would say that three things a person needs for a successful job hunt are:1\. A clear understanding of their career goals and aspirations: Having clarity about the type of job, industry, company, and role they want to pursue will help individuals target their job search efforts and tailor their applications accordingly\.2\. A professional and well\-crafted resume/CV and cover letter: A well\-prepared resume/CV and cover letter perfectly tailored to the job of interest can significantly increase the chances of getting shortlisted for an interview\. Hence, it is important to invest time in crafting these documents\.3\. Networking: While submitting online applications is an essential component of a job search, networking with industry professionals and building relationships with potential employers \(through events, social media, etc\.\) can help individuals learn about job openings that may not be publicly advertised, and also get insider information about the company culture and job requirements\.

## Appendix BPrompts

We use the following prompt template to concatenate the query x and the response y, and feed the resulting sequence to the model as the complete input\.

Prompt Template"role": "user""content": "The following context contains preference demonstrations provided by the same user\. Each example shows responses that this user considers acceptable or preferred for a given question\.Question:\{query\}Liked Response:\{chosen response\}Question:\{query\}Liked Response:\{chosen response\}Use the contexts above to generate a good response for the user prompt below\.Now the question is:\{query\}""role": "assistant""content": "\{response\}"

## Appendix CExperiment Details

### C\.1Datasets Details

Personal\-LLM\. This is a controllable benchmark for simulating heterogeneous user preferences via reward model interpolation\. This allows us to construct personalized preference pair data with quantifiable preference bias on top of this dataset\. Specifically, each prompt in the dataset is associated with eight candidate responses generated by advanced large language models such as GPT\-4o\. In addition, each response is scored by ten reward models that perform strongly on RewardBench\. Therefore, given a prompt x and a candidate response y, we obtain a 10\-dimensional reward vector𝐑​\(x,y\)∈ℝ10\\mathbf\{R\}\(x,y\)\\in\\mathbb\{R\}^\{10\}\.

To simulate diverse user preferences, we sample a 10\-dimensional user vector𝐮\\mathbf\{u\}from a Dirichlet distribution\. This vector represents the user\-specific preference weights over the ten reward models\. We then compute𝐮T​𝐑​\(x,y\)\\mathbf\{u\}^\{T\}\\mathbf\{R\}\(x,y\)as the personalized score of the user for each prompt\-response pair\. For each prompt, the highest\-scoring response among the eight candidates is selected as the positive sample, while the lowest is selected as the negative sample, thereby forming a personalized preference pair for that user\.

Finally, we simulate 100 users, with 20 samples for each user\. To reduce randomness in user\-level evaluation, we set the test set to have the same size as the training set, i\.e\., each user has 20 training samples and 20 test samples\. To construct a dataset with stronger preference bias, we set the concentration parameter of the Dirichlet distribution used to simulate user vectors toα=0\.001\\alpha=0\.001\.

DSP\. This is a domain\-specific preference dataset containing multiple stylistic response preferences\. Specifically, for each prompt, there are four stylistically different responses generated by GPT models, together with one original response without any role\-play setting\. We assume that different users correspond to one of five preference types\. For each prompt, the response matching the user’s preference type is treated as the positive sample, while one response is randomly selected from the remaining four as the negative sample\. Specifically, we simulate 500 users, and each user is also assigned 20 samples\.

### C\.2Computation of User Support Rate

#### Support Rate in Personal\-LLM\.

For each simulated useruu, we first construct personalized preference pairs according to the user vector\. Let𝒟u=\{\(xi,yi,u\+,yi,u−\)\}i=1Nu\\mathcal\{D\}\_\{u\}=\\\{\(x\_\{i\},y\_\{i,u\}^\{\+\},y\_\{i,u\}^\{\-\}\)\\\}\_\{i=1\}^\{N\_\{u\}\}denote the preference dataset of useruu, whereyi,u\+y\_\{i,u\}^\{\+\}andyi,u−y\_\{i,u\}^\{\-\}are respectively the highest\-scoring and lowest\-scoring responses under useruu’s personalized reward function\. Given the user vector𝐮\\mathbf\{u\}and the reward vector𝐑​\(x,y\)\\mathbf\{R\}\(x,y\), the personalized score of useruufor a prompt\-response pair\(x,y\)\(x,y\)is defined as

su​\(x,y\)=𝐮⊤​𝐑​\(x,y\)\.s\_\{u\}\(x,y\)=\\mathbf\{u\}^\{\\top\}\\mathbf\{R\}\(x,y\)\.
To measure how common useruu’s preferences are among the overall user population, we evaluate whether other users agree with useruu’s preference pairs\. For another userv≠uv\\neq u, we define the agreement indicator on theii\-th preference pair of useruuas

Av\(u,i\)=𝕀​\[sv​\(xi,yi,u\+\)\>sv​\(xi,yi,u−\)\],A\_\{v\}^\{\(u,i\)\}=\\mathbb\{I\}\\left\[s\_\{v\}\(x\_\{i\},y\_\{i,u\}^\{\+\}\)\>s\_\{v\}\(x\_\{i\},y\_\{i,u\}^\{\-\}\)\\right\],where𝕀​\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function\. This indicator equals 1 if uservvassigns a higher score to useruu’s positive response than to the negative response, and 0 otherwise\.

The support rate of useruuis then computed as the average agreement from all other users over all preference pairs:

Supp​\(u\)=1Nu​\(\|𝒰\|−1\)​∑i=1Nu∑v∈𝒰,v≠uAv\(u,i\),\\mathrm\{Supp\}\(u\)=\\frac\{1\}\{N\_\{u\}\(\|\\mathcal\{U\}\|\-1\)\}\\sum\_\{i=1\}^\{N\_\{u\}\}\\sum\_\{v\\in\\mathcal\{U\},v\\neq u\}A\_\{v\}^\{\(u,i\)\},where𝒰\\mathcal\{U\}denotes the set of all simulated users\. A lower support rate indicates that the preferences of useruuare less aligned with the majority of users and are therefore more likely to represent rare or outlier preferences\. In our experiments, users with the lowest 15% support rates are categorized as the minority group, while the remaining users are categorized as the majority group\.

#### Support Rate in DSP\.

In DSP, each user is assigned one of several predefined preference types\. Since users with the same preference type share the same preferred response style, the support rate of a user is directly defined by the population proportion of the corresponding preference type\. Formally, letc​\(u\)c\(u\)denote the preference type of useruu, and let𝒰c​\(u\)\\mathcal\{U\}\_\{c\(u\)\}denote the set of users whose preference type isc​\(u\)c\(u\)\. The support rate of useruuis computed as

Supp​\(u\)=\|𝒰c​\(u\)\|\|𝒰\|\.\\mathrm\{Supp\}\(u\)=\\frac\{\|\\mathcal\{U\}\_\{c\(u\)\}\|\}\{\|\\mathcal\{U\}\|\}\.Thus, users belonging to less frequent preference types have lower support rates\. In our experiments, three preference types \(entertainment, literature, normal\) are each assigned a proportion of 5% and are treated as minority groups, while the remaining two preference types \(business, academy\) are assigned proportions of 25% and 60%, respectively, and are treated as majority groups\.

### C\.3Dataset Bias Settings

To construct group\-biased data distributions, we set the concentration parameter of the Dirichlet distribution used to simulate user vectors toα=0\.001\\alpha=0\.001\. We then use user support rates to partition users into majority and minority groups\. In Personal\-LLM, the 15% of users with the lowest support rates are defined as the minority group, while the remaining users are treated as the majority group\.

In DSP, we construct biased data distributions by controlling the proportions of different preference types\. Specifically, three preference types are each assigned a user proportion of 5% and are treated as minority groups, while the remaining two preference types are assigned proportions of 25% and 60%, respectively, and are treated as majority groups\. In this dataset, the user support rate directly corresponds to the population proportion of the associated preference type\. More dataset details are provided in Appendix[C\.2](https://arxiv.org/html/2606.07988#A3.SS2)\.

### C\.4Implementation Details

#### Baseline Details\.

For the Reweighting baseline, we assign a larger training weight to minority\-group samples in the original personalized reward modeling objective\. Formally, the training objective is defined as

ℒreweight=𝔼\(x,p,y\+,y−\)∼𝒟​\[wg⋅ℒBT​\(x,p,y\+,y−\)\],\\mathcal\{L\}\_\{\\mathrm\{reweight\}\}=\\mathbb\{E\}\_\{\(x,p,y^\{\+\},y^\{\-\}\)\\sim\\mathcal\{D\}\}\\left\[w\_\{g\}\\cdot\\mathcal\{L\}\_\{\\mathrm\{BT\}\}\(x,p,y^\{\+\},y^\{\-\}\)\\right\],wherewgw\_\{g\}denotes the group\-specific sample weight\. For majority\-group samples, we setwg=1w\_\{g\}=1, while for minority\-group samples, we searchwgw\_\{g\}over\{1\.1,1\.2,1\.3\}\\\{1\.1,1\.2,1\.3\\\}\.

For the Regularization baseline, we add an additional regularization term to the original personalized reward modeling objective to reduce performance disparities across users\. The objective is defined as

ℒreg=ℒBT\+λreg​ℒfair,\\mathcal\{L\}\_\{\\mathrm\{reg\}\}=\\mathcal\{L\}\_\{\\mathrm\{BT\}\}\+\\lambda\_\{\\mathrm\{reg\}\}\\mathcal\{L\}\_\{\\mathrm\{fair\}\},whereℒfair\\mathcal\{L\}\_\{\\mathrm\{fair\}\}denotes the regularization term over user reward distributions, andλreg\\lambda\_\{\\mathrm\{reg\}\}controls the strength of the regularization\. We searchλreg\\lambda\_\{\\mathrm\{reg\}\}over\{0\.01,0\.001,0\.0001\}\\\{0\.01,0\.001,0\.0001\\\}\.

Since Pareto fairness requires improving disadvantaged users without substantially degrading advantaged users, we select hyperparameters for Reweighting and Regularization according to the majority\-group performance under comparable fairness improvement\. Specifically, among the searched hyperparameters, we report the result that achieves the best majority\-group accuracy, or equivalently the smallest majority\-group degradation, while still improving fairness\-related metrics\.

#### LLM Backbone and Hyper\-parameters\.

We use the open\-source Qwen3\-4B\[[48](https://arxiv.org/html/2606.07988#bib.bib48)\]as the backbone model\. During training, we adopt low\-rank adaptation \(LoRA\)\[[49](https://arxiv.org/html/2606.07988#bib.bib49)\]to train our models, with LoRA alpha of 16, LoRA rank of 8, and LoRA dropout of 0\.1\. We use a learning rate of5×10−45\\times 10^\{\-4\}for all the models\. On the Personal\-LLM dataset, the train epochs is set to 2, while on the DSP dataset, the train epochs is set to 1\. For our framework, the parameterα\\alpha, which controls the trade\-off between the distillation loss and the original reward modeling loss, is tuned in the range\{0\.1,0\.25,0\.5,0\.8\}\\\{0\.1,0\.25,0\.5,0\.8\\\}\. All experiments are conducted on NVIDIA A100 GPUs with 80GB GPU memory\.

### C\.5Computation of Fairness Metrics

The utility metrics, includingOverall Accuracy,Minority Accuracy\(Min Acc\), andMajority Accuracy\(Maj Acc\), are straightforward\. Here, we mainly present the formulations of the fairness metrics\.

#### Coefficient of Variation\.

We compute the coefficient of variation over user\-level accuracies to measure the relative dispersion of model performance across users\. Letaua\_\{u\}denote the accuracy of useruu, and let𝒰\\mathcal\{U\}denote the full user set\. The mean and standard deviation of user\-level accuracies are defined as

μa=1\|𝒰\|​∑u∈𝒰au,\\mu\_\{a\}=\\frac\{1\}\{\|\\mathcal\{U\}\|\}\\sum\_\{u\\in\\mathcal\{U\}\}a\_\{u\},σa=1\|𝒰\|​∑u∈𝒰\(au−μa\)2\.\\sigma\_\{a\}=\\sqrt\{\\frac\{1\}\{\|\\mathcal\{U\}\|\}\\sum\_\{u\\in\\mathcal\{U\}\}\(a\_\{u\}\-\\mu\_\{a\}\)^\{2\}\}\.The coefficient of variation is then computed as

CV=σaμa\.\\mathrm\{CV\}=\\frac\{\\sigma\_\{a\}\}\{\\mu\_\{a\}\}\.A smaller CV indicates more stable performance across users\.

#### Gini Coefficient\.

We use the Gini coefficient to measure inequality in user\-level accuracies\. Given user\-level accuracies\{au\}u∈𝒰\\\{a\_\{u\}\\\}\_\{u\\in\\mathcal\{U\}\}, the Gini coefficient is computed as

Gini=∑u∈𝒰∑v∈𝒰\|au−av\|2​\|𝒰\|2​μa\.\\mathrm\{Gini\}=\\frac\{\\sum\_\{u\\in\\mathcal\{U\}\}\\sum\_\{v\\in\\mathcal\{U\}\}\|a\_\{u\}\-a\_\{v\}\|\}\{2\|\\mathcal\{U\}\|^\{2\}\\mu\_\{a\}\}\.A smaller Gini coefficient indicates lower inequality in user\-level performance\.

#### Accuracy\-Support Rate Slope\.

To measure whether model performance depends on how common a user’s preference is, we fit a linear regression between user support rate and user\-level accuracy:

au=β0\+β1​Supp​\(u\)\+ϵu,a\_\{u\}=\\beta\_\{0\}\+\\beta\_\{1\}\\mathrm\{Supp\}\(u\)\+\\epsilon\_\{u\},whereSupp​\(u\)\\mathrm\{Supp\}\(u\)denotes the support rate of useruu\. The slopeβ1\\beta\_\{1\}is estimated by ordinary least squares:

β1=∑u∈𝒰\(Supp​\(u\)−Supp¯\)​\(au−μa\)∑u∈𝒰\(Supp​\(u\)−Supp¯\)2,\\beta\_\{1\}=\\frac\{\\sum\_\{u\\in\\mathcal\{U\}\}\(\\mathrm\{Supp\}\(u\)\-\\overline\{\\mathrm\{Supp\}\}\)\(a\_\{u\}\-\\mu\_\{a\}\)\}\{\\sum\_\{u\\in\\mathcal\{U\}\}\(\\mathrm\{Supp\}\(u\)\-\\overline\{\\mathrm\{Supp\}\}\)^\{2\}\},where

Supp¯=1\|𝒰\|​∑u∈𝒰Supp​\(u\)\.\\overline\{\\mathrm\{Supp\}\}=\\frac\{1\}\{\|\\mathcal\{U\}\|\}\\sum\_\{u\\in\\mathcal\{U\}\}\\mathrm\{Supp\}\(u\)\.A smaller slope indicates that model performance is less dependent on preference support rate, suggesting better user\-level fairness\.

## Appendix DLimitations

This work still has several limitations\. For example, our evaluation mainly relies on automatic metrics, and we do not further conduct a human audit of the unfairness produced by personalized reward models\. Due to limited annotation resources and practical experimental constraints, we are not yet able to systematically analyze the behavioral differences of the model across different user groups through human evaluation\. Although automatic metrics provide a scalable, controllable, and reproducible evaluation protocol, human audit may still offer more fine\-grained evidence for understanding how model unfairness is perceived and manifested in real user\-facing scenarios\. In addition, this work assumes that user preferences remain relatively stable during training and evaluation\. However, in real\-world personalization scenarios, user preferences may change over time, task contexts, or interaction stages, and the same user may exhibit different preference patterns at different stages\. The current framework does not explicitly model such dynamic preference changes\. Future work can further study Pareto\-aligned personalized reward modeling under dynamic user preferences, so that user representations and fairness optimization objectives can be updated as preferences evolve\.

## Appendix EProof of the Pareto\-Aligned Direction

###### Proof of Theorem 1:\.

We define the margin discrepancy asωg​\(i\)=σ​\(mg​\(i\)\)−σ​\(mS​\(i\)\)\\omega\_\{g\}\(i\)=\\sigma\(m\_\{g\}\(i\)\)\-\\sigma\(m\_\{S\}\(i\)\), the total negative gradient of the distillation loss is

vt=∑g=1n𝔼i∼𝒟g​\[ωg​\(i\)​∇θSmS​\(i\)\]\.v\_\{t\}=\\sum\_\{g=1\}^\{n\}\\mathbb\{E\}\_\{i\\sim\\mathcal\{D\}\_\{g\}\}\\left\[\\omega\_\{g\}\(i\)\\nabla\_\{\\theta\_\{S\}\}m\_\{S\}\(i\)\\right\]\.\(11\)Let∇ℓS​\(i\)=\(1−σ​\(mS​\(i\)\)\)​∇θSmS​\(i\)\\nabla\\ell\_\{S\}\(i\)=\(1\-\\sigma\(m\_\{S\}\(i\)\)\)\\nabla\_\{\\theta\_\{S\}\}m\_\{S\}\(i\), the true sample\-level utility gradient on groupggis∇UθS​\(g\)=𝔼i∼𝒟g​\[∇log⁡σ​\(mS​\(i\)\)\]=𝔼i∼𝒟g​\[∇ℓS​\(i\)\]\\nabla U\_\{\\theta\_\{S\}\}\(g\)=\\mathbb\{E\}\_\{i\\sim\\mathcal\{D\}\_\{g\}\}\[\\nabla\\log\\sigma\(m\_\{S\}\(i\)\)\]=\\mathbb\{E\}\_\{i\\sim\\mathcal\{D\}\_\{g\}\}\[\\nabla\\ell\_\{S\}\(i\)\]\. We can factorize the residual asωg​\(i\)=αg​\(i\)​\(1−σ​\(mS​\(i\)\)\)\\omega\_\{g\}\(i\)=\\alpha\_\{g\}\(i\)\(1\-\\sigma\(m\_\{S\}\(i\)\)\), whereαg​\(i\)=σ​\(mg​\(i\)\)−σ​\(mS​\(i\)\)1−σ​\(mS​\(i\)\)\\alpha\_\{g\}\(i\)=\\frac\{\\sigma\(m\_\{g\}\(i\)\)\-\\sigma\(m\_\{S\}\(i\)\)\}\{1\-\\sigma\(m\_\{S\}\(i\)\)\}\. By applying the exact statistical identity for the expectation of a product, we decompose the gradient:

𝔼i∼𝒟g​\[ωg​\(i\)​∇θSmS​\(i\)\]\\displaystyle\\mathbb\{E\}\_\{i\\sim\\mathcal\{D\}\_\{g\}\}\\left\[\\omega\_\{g\}\(i\)\\nabla\_\{\\theta\_\{S\}\}m\_\{S\}\(i\)\\right\]=𝔼𝒟g​\[αg​\(i\)\]​𝔼𝒟g​\[∇ℓS​\(i\)\]\+Cov𝒟g​\(αg​\(i\),∇ℓS​\(i\)\)\\displaystyle=\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{g\}\}\[\\alpha\_\{g\}\(i\)\]\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{g\}\}\[\\nabla\\ell\_\{S\}\(i\)\]\+\\text\{Cov\}\_\{\\mathcal\{D\}\_\{g\}\}\(\\alpha\_\{g\}\(i\),\\nabla\\ell\_\{S\}\(i\)\)\(12\)=λg​∇UθS​\(g\)\+eg\\displaystyle=\\lambda\_\{g\}\\nabla U\_\{\\theta\_\{S\}\}\(g\)\+e\_\{g\}\(13\)whereλg=𝔼𝒟g​\[αg​\(i\)\]\\lambda\_\{g\}=\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{g\}\}\[\\alpha\_\{g\}\(i\)\]is the global expected multiplier for groupgg\. In addition, we haveσ​\(mg​\(i\)\)≥σ​\(mS​\(i\)\)\\sigma\(m\_\{g\}\(i\)\)\\geq\\sigma\(m\_\{S\}\(i\)\), henceαg​\(i\)≥0\\alpha\_\{g\}\(i\)\\geq 0, which guaranteesλg≥0\\lambda\_\{g\}\\geq 0\. Summing over all groups yieldsvt=∑g=1nλg​∇UθS​\(g\)\+ℰc​o​vv\_\{t\}=\\sum\_\{g=1\}^\{n\}\\lambda\_\{g\}\\nabla U\_\{\\theta\_\{S\}\}\(g\)\+\\mathcal\{E\}\_\{cov\}, whereℰc​o​v=∑g=1neg\\mathcal\{E\}\_\{cov\}=\\sum\_\{g=1\}^\{n\}e\_\{g\}\. ∎

## Appendix FRepresentation Clustering Analysis

![Refer to caption](https://arxiv.org/html/2606.07988v1/x7.png)Figure 6:Visualization of hidden state on personal\-LLM
![Refer to caption](https://arxiv.org/html/2606.07988v1/x8.png)Figure 7:Visualization of hidden state on DSP

To provide a qualitative view of preference heterogeneity, we perform a representation clustering analysis on the training data from both datasets\. Specifically, we feed all training samples into the Qwen3\-4B backbone model and extract their final\-layer hidden representations\. We then apply UMAP to project these representations into a two\-dimensional space, as shown in Figure[7](https://arxiv.org/html/2606.07988#A6.F7)and Figure[7](https://arxiv.org/html/2606.07988#A6.F7)\.

The visualization shows that samples associated with different support rates exhibit non\-uniform representation patterns\. In particular, samples from low\-support users are more likely to appear in separated regions or outlying clusters\. Although this analysis is only qualitative, it suggests that low\-support users may correspond to distinct preference patterns at the representation level\. This observation supports our motivation for introducing group\-specialized modeling and conditional distillation in PAFO, which aim to better capture preference structures that may be underrepresented in the mixed training distribution\.

![Refer to caption](https://arxiv.org/html/2606.07988v1/x9.png)Figure 8:PAFO’s performance with different numbers of historical examples on inputs\.
## Appendix GThe Performance with Different History Length

To verify the role of historical interaction data in personalized reward modeling, we vary the number of historical examples included in the input and observe the resulting changes in model performance, as shown in Figure[8](https://arxiv.org/html/2606.07988#A6.F8)\. The results show that as the number of historical examples increases, the overall performance of PAFO continues to improve, confirming the importance of historical behavior information for user preference modeling\. Further analysis shows that the performance gain from increasing the number of historical examples from 0 to 1 is significantly larger than the gain from increasing it from 1 to 2\. These results show that user history provides an effective personalization signal, and PAFO can exploit even limited historical information to better model heterogeneous preferences\.

## Appendix HAdditional Related Work

#### Personalized Reward Model\.

Personalized reward modeling aims to learn user\-dependent preference signals from heterogeneous user feedback, rather than fitting a single reward function to all users\[[29](https://arxiv.org/html/2606.07988#bib.bib29),[27](https://arxiv.org/html/2606.07988#bib.bib27)\]\. Existing studies incorporate user information into the RLHF pipeline through explicit user representations or latent preference modeling, enabling individualized alignment\[[14](https://arxiv.org/html/2606.07988#bib.bib14),[27](https://arxiv.org/html/2606.07988#bib.bib27)\]\. Other methods further exploit shared structures across users to improve scalability and few\-shot adaptation, such as personalized reward spaces, reward factorization, and low\-rank reward modeling\[[31](https://arxiv.org/html/2606.07988#bib.bib31),[30](https://arxiv.org/html/2606.07988#bib.bib30),[32](https://arxiv.org/html/2606.07988#bib.bib32)\]\. In addition, some approaches perform personalized reward estimation at the inference or prompting stage through decoding\-time reward guidance or persona\-guided prompting\[[33](https://arxiv.org/html/2606.07988#bib.bib33),[50](https://arxiv.org/html/2606.07988#bib.bib50)\]\. These studies demonstrate the importance of modeling preference heterogeneity for LLM personalization\[[51](https://arxiv.org/html/2606.07988#bib.bib51)\]\. However, they primarily focus on average personalization quality or adaptation efficiency\[[34](https://arxiv.org/html/2606.07988#bib.bib34)\], while paying limited attention to whether utility is fairly distributed across different users or user groups\.

#### Fairness in Preference Learning\.

When user preferences are highly heterogeneous, preference learning based on aggregate objectives may favor mainstream preferences, leading to utility disparities across user groups\[[34](https://arxiv.org/html/2606.07988#bib.bib34),[36](https://arxiv.org/html/2606.07988#bib.bib36),[38](https://arxiv.org/html/2606.07988#bib.bib38),[39](https://arxiv.org/html/2606.07988#bib.bib39),[52](https://arxiv.org/html/2606.07988#bib.bib52),[53](https://arxiv.org/html/2606.07988#bib.bib53)\]\. Datasets such as PRISM reveal substantial individual\- and group\-level variation in alignment feedback, highlighting the importance of whose preferences are represented during alignment\[[35](https://arxiv.org/html/2606.07988#bib.bib35)\]\. From an optimization perspective, MaxMin\-RLHF and Group Robust Preference Optimization improve the performance of under\-served or worst\-performing groups through max\-min or robust objectives\[[34](https://arxiv.org/html/2606.07988#bib.bib34),[36](https://arxiv.org/html/2606.07988#bib.bib36)\]\. Other work studies heterogeneous feedback from a theoretical perspective, discussing the relationship between personalization and preference aggregation and showing that naively averaging user preferences can be problematic\[[38](https://arxiv.org/html/2606.07988#bib.bib38),[39](https://arxiv.org/html/2606.07988#bib.bib39)\]\. More directly related to reward learning, reward fairness regularization and reward\-model fairness benchmarks further suggest that unfairness may already arise during reward modeling, before being amplified by subsequent policy optimization\[[22](https://arxiv.org/html/2606.07988#bib.bib22),[13](https://arxiv.org/html/2606.07988#bib.bib13)\]\. In contrast, we introduce Pareto fairness into personalized reward modeling, with the goal of improving the utility of minority or under\-served preference patterns without degrading the performance of other user groups\.

#### Multi\-Objective Alignment\.

Multi\-objective alignment provides relevant methodological background for modeling complex and potentially conflicting human preferences\[[40](https://arxiv.org/html/2606.07988#bib.bib40),[41](https://arxiv.org/html/2606.07988#bib.bib41),[44](https://arxiv.org/html/2606.07988#bib.bib44),[45](https://arxiv.org/html/2606.07988#bib.bib45)\]\. Prior work formulates LLM alignment as optimization over multiple preference dimensions\[[40](https://arxiv.org/html/2606.07988#bib.bib40),[41](https://arxiv.org/html/2606.07988#bib.bib41)\]\. For example, MODPO, DPA, RiC, MetaAligner, and Panacea study multi\-objective preference modeling from the perspectives of multi\-objective DPO, reward\-space control, dynamic preference adjustment, and Pareto\-style adaptation\[[40](https://arxiv.org/html/2606.07988#bib.bib40),[41](https://arxiv.org/html/2606.07988#bib.bib41),[42](https://arxiv.org/html/2606.07988#bib.bib42),[43](https://arxiv.org/html/2606.07988#bib.bib43),[44](https://arxiv.org/html/2606.07988#bib.bib44)\]\. On the reward\-modeling side, methods such as ArmoRM, HaM, and PARM further investigate multi\-dimensional reward decomposition, hypervolume maximization, and test\-time preference\-aware reward guidance\[[45](https://arxiv.org/html/2606.07988#bib.bib45),[46](https://arxiv.org/html/2606.07988#bib.bib46),[47](https://arxiv.org/html/2606.07988#bib.bib47)\]\. These works are conceptually related to ours, but most of them focus on trade\-offs among predefined objective dimensions\. Our work instead applies the Pareto perspective to group\-level utility fairness in personalized reward modeling, aiming to improve reward model fairness under heterogeneous user preferences without requiring explicit group labels at inference time\.

## Appendix IBroader Impacts\.

This work aims to improve the fairness and reliability of personalized reward modeling for large language models\. Its potential positive societal impact lies in reducing the performance gap between users with common preference patterns and users whose preferences are less represented in the training distribution\. By encouraging Pareto\-style improvement, PAFO seeks to improve the utility of under\-served preference groups without degrading the performance of other groups, which may contribute to more inclusive and equitable personalized alignment systems\. At the same time, this work also has potential negative societal implications\. If the training data contain biased, noisy, or harmful preference signals, a personalized reward model may learn and preserve such biases more effectively\. In addition, inaccurate group construction during training may lead to misleading fairness conclusions or insufficient protection for truly under\-served users\. More broadly, improved personalization techniques could be misused to optimize model behavior toward manipulative, polarizing, or otherwise harmful user\-specific objectives\. These risks suggest that personalized reward models should be deployed with careful data curation, privacy protection, auditing of group\-level performance, and monitoring for unintended or malicious uses\.

Similar Articles

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

arXiv cs.CL

FSPO proposes a few-shot preference optimization algorithm for LLM personalization that reframes reward modeling as meta-learning, enabling models to quickly infer personalized reward functions from limited user preferences. The method achieves 87% personalization performance on synthetic users and 70% on real users through careful synthetic preference dataset construction.