FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users
Summary
FSPO proposes a few-shot preference optimization algorithm for LLM personalization that reframes reward modeling as meta-learning, enabling models to quickly infer personalized reward functions from limited user preferences. The method achieves 87% personalization performance on synthetic users and 70% on real users through careful synthetic preference dataset construction.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users Source: https://arxiv.org/html/2502.19312 Sheryl HsuStanford UniversityKyle HsuStanford UniversityEric MitchellStanford UniversityOpenAIStefano ErmonStanford UniversityTatsunori HashimotoStanford UniversityArchit SharmaStanford UniversityGoogle DeepMindChelsea FinnStanford University ###### Abstract Effective personalization of LLMs is critical for a broad range of user\-interfacing applications such as virtual assistants and content curation\. Inspired by the strong in\-context capabilities of LLMs, we propose few\-shot preference optimization \(FSPO\), an algorithm for LLM personalization that reframes reward modeling as a meta\-learning problem\. Under FSPO, an LLM learns to quickly infer a personalized reward function for a user via a few labeled preferences\. FSPO also utilizes user description rationalization \(RAT\) to encourage better reward modeling and instruction following, recovering performance with the oracle user description\. Since real\-world preference data is challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs\. To successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self\-consistent structure\. We evaluate FSPO on personalized open\-ended generation for up to 1,500 synthetic users across three domains: movie reviews, education, and open\-ended question answering\. We also run a controlled human study\. Overall, FSPO achieves an 87% Alpaca Eval winrate in generating responses that are personalized to synthetic users and a 70% winrate with real human users in open\-ended question answering\. “Every story I create, creates me\. I write to create myself\.” —Octavia E\. Butler ## 1Introduction As large language models \(LLMs\) increasingly interact with a diverse user base, it becomes important for models to generate responses that align with individual user preferences\. People exhibit a wide range of preferences and beliefs shaped by their cultural background, personal experience, and individual values\. These diverse preferences are present in human\-annotated preference datasets; however, current preferences optimization techniques like reinforcement learning from human feedback \(RLHF\) largely focus on optimizing a*single*model based on preferences aggregated over the entire population\. This approach may neglect minority viewpoints, embed systematic biases into the model, and ultimately lead to worse performance compared to personalized models\. Can we create language models that can adaptively align with the personal preferences of each user instead of the aggregated preferences of all users? Addressing this challenge requires a shift from modeling a singular aggregate reward function to modeling a distribution of reward functions that captures the diversity of human preferences\(sorensen2024roadmappluralisticalignment;jang2023personalizedsoupspersonalizedlarge\)\. By doing so, we can enable personalization in language models, allowing them to generate a wide range of responses tailored to individual subpopulations\. This approach not only enhances user satisfaction but also promotes inclusivity by acknowledging and respecting the varied perspectives that exist within any user base\. Despite this problem’s importance, to our knowledge LLM personalization has yet to be achieved for open\-ended question answering with real users\. Refer to captionFigure 1:Overview of FSPO\.NNpreviously collected preferences are fed into the LLM along with the current query, allowing the LLM to personalize its response to the query using the past preferences\. Furthermore, user description rationalization \(e\.g Synthetic user is family\-oriented\) is utilized to predict details about a user from their preferences in natural language, aiding reward modeling and text generation\.In this paper, we introduce few\-shot preference optimization \(FSPO\), a novel framework designed to model diverse subpopulations in preference datasets to elicit personalization in language models for open\-ended question answering\. At a high level, FSPO leverages in\-context learning to adapt to new subpopulations\. This adaptability is crucial for practical applications, where user preferences can be dynamic and multifaceted\. Inspired by past work on black\-box meta\-learning for language modeling\(chen2022metalearninglanguagemodelincontext;min2022metaicllearninglearncontext;yu2024metamathbootstrapmathematicalquestions\), we fine\-tune the model in a meta\-learning setup using preference\-learning objectives such as IPO\(2023arXiv231012036G\)\. To further improve personalized generation, we additionally propose user description rationalization \(RAT\), which allows the model to leverage additional inference\-time compute for better reward modeling and instruction following\. Learning a model that effectively personalizes to real people requires training on a realistic, user\-stratified preference dataset\. One natural approach to consider is to curate such data from humans, but this is difficult and time\-consuming\. Instead, we propose instantiating this dataset synthetically, and present careful design decisions inspired from the meta\-learning literature\(hsu2019unsupervised;yin2019meta\)to generate a dataset that is both diverse and structured\. To evaluate the efficacy of our approach, we construct a set of three semi\-realistic domains to study personalization: \(1\)Reviews, studying the generation ability of models for reviews of movies, TV shows, and books that are consistent with a user’s writing style, \(2\)Explain Like I’m X \(ELIX\): studying the generation ability of models for responses that are consistent with a user’s education level, and \(3\)Roleplay: studying the generation ability of models for responses that are consistent with a user’s description, with effective transferability to a real human\-study\. Here we find that FSPO outperforms an unpersonalized model on average by 87%\. We additionally perform a controlled human study showcasing a winrate of 70% of FSPO over unpersonalized models\. By addressing limitations of existing reward modeling techniques, our work paves the way for more inclusive and personalized LLMs\. We believe that FSPO represents a significant step toward models that better serve the needs of all users, respecting the rich diversity of human preferences\. ## 2Related Work Personalized learning of preferences\.Prior research has explored personalization through various methods\. One approach is distributional alignment, which focuses on matching model outputs to broad target distributions rather than tailoring them to individual user preferences\. For example, some prior work have concentrated on aligning model\-generated distributions with desired statistical properties\(siththaranjan2024distributionalpreferencelearningunderstanding;meister2024benchmarkingdistributionalalignmentlarge;melnyk2024distributionalpreferencealignmentllms\), yet they do not explicitly optimize for individual preference adaptation\. Another strategy involves explicitly modeling a distribution of rewards\(lee2024testtimealignmenthypothesisreweighting;poddar2024personalizingreinforcementlearninghuman\)\. However, these methods suffer from sample inefficiency during both training and inference\(rafailov2023direct;2023arXiv231012036G\)\. Additionally, these approaches have limited evaluations:lee2024testtimealignmenthypothesisreweightingfocuses solely on reward modeling, whilepoddar2024personalizingreinforcementlearninghumantests with a very limited number of artificial users \(e\.g helpfulness user and honest user\)\. Other works have investigated personalization in multiple\-choice questions, such as GPO\(zhao2024grouppreferenceoptimizationfewshot\)\. Although effective in structured survey settings, these methods have not been validated for open\-ended personalization tasks\. Similarly,shaikh2024showdonttellaligningexplores personalization via explicit human corrections, but relying on such corrections is expensive and often impractical to scale\. Finally, several datasets exist for personalization, such as Prism\(kirk2024prismalignmentdatasetparticipatory\)and Persona Bench\(castricato2024personareproducibletestbedpluralistic\)\. Neither of these datasets demonstrate that policies trained on these benchmarks lead to effective personalization\. Unlike these prior works which study personalization based off of human values and controversial questions, we instead study more general questions that a user may ask\. Algorithms for preference learning\.LLMs are typically fine\-tuned via supervised next\-token prediction on high\-quality responses and later refined with human preference data\(casper2023open;ouyang2022training\)\. This process can use on\-policy reinforcement learning methods like REINFORCE\(NIPS1999\_464d828b\)or PPO\(2017arXiv170706347S\), which optimize a reward model with a KL constraint\. Alternatively, supervised fine\-tuning may be applied to a curated subset of preferred responses\(dubois2024alpacafarm\)or iteratively to preferred completions as in ReST\(gulcehre2023reinforced\)\. Other methods, such as DPO\(rafailov2023direct\), IPO\(2023arXiv231012036G\), and KTO\(HALOs2024\), learn directly from human preferences without an explicit reward model, with recent work exploring iterative preference modeling applications\(2024arXiv240110020Y\)\. Black\-box meta\-learning\.FSPO is an instance of black\-box meta\-learning, which has been studied in a wide range of domains spanning image classification\(santoro2016oneshotlearningmemoryaugmentedneural;mishra2018simpleneuralattentivemetalearner\), language modeling\(chen2022metalearninglanguagemodelincontext;min2022metaicllearninglearncontext;yu2024metamathbootstrapmathematicalquestions\), and reinforcement learning\(duan2016rl;wang2016learning\)\. Black\-box meta\-learning is characterized by the processing of task contexts and queries using generic sequence operations like recurrence or self\-attention, instead of specifically designed adaptation mechanisms\. ## 3Preliminaries and Notation Preference fine\-tuning algorithms, such as reinforcement learning from human feedback \(RLHF\) and direct preference optimization \(DPO\), typically involve two main stages\(ouyang2022training;2022arXiv220302155O\): supervised fine\-tuning \(SFT\) and preference optimization \(DPO/RLHF\)\. First, a pre\-trained model is fine\-tuned on high\-quality data from the target task using SFT\. This process produces a reference model, denoted asπref\\pi\_\{\\text\{ref\}\}\. The purpose of this stage is to bring the responses from a particular domain in distribution with supervised learning\. To further refineπref\\pi\_\{\\text\{ref\}\}according to human preferences, a preference datasetDpref=\{\(x\(i\),yw\(i\),yl\(i\)\)\}\\mathcal\{D\}\_\{\\text\{pref\}\}=\\\{\(\\mathbf\{x\}^\{\(i\)\},\\mathbf\{y\}\_\{w\}^\{\(i\)\},\\mathbf\{y\}\_\{l\}^\{\(i\)\}\)\\\}is collected\. In this dataset,x\(i\)\\mathbf\{x\}^\{\(i\)\}represents a prompt or input context,yw\(i\)\\mathbf\{y\}\_\{w\}^\{\(i\)\}is the preferred response, andyl\(i\)\\mathbf\{y\}\_\{l\}^\{\(i\)\}is the less preferred response\. These responses are typically sampled from the output distribution ofπref\\pi\_\{\\text\{ref\}\}and are labeled based on human feedback\. Most fine\-tuning pipelines assume the existence of an underlying reward functionr∗\(x,⋅\)r^\{\*\}\(\\mathbf\{x\},\\cdot\)that quantifies the quality of responses\. A common approach to modeling human preferences is the Bradley\-Terry \(BT\) model\(bradleyterry1952preferences\), which expresses the probability of preferring responsey1\\mathbf\{y\}\_\{1\}overy2\\mathbf\{y\}\_\{2\}, given a promptx\\mathbf\{x\}\), as: p∗\(y1≻y2∣x\)=er∗\(x,y1\)er∗\(x,y1\)\+er∗\(x,y2\)\\displaystyle\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:bradley\_terry\}\}\{e\}q:bradley\_\{t\}erry\}p^\{\*\}\(\\mathbf\{y\}\_\{1\}\\succ\\mathbf\{y\}\_\{2\}\\mid\\mathbf\{x\}\)=\\frac\{e^\{r^\{\*\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{1\}\)\}\}\{e^\{r^\{\*\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{1\}\)\}\+e^\{r^\{\*\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{2\}\)\}\}\(1\)Here,p∗\(y1≻y2∣x\)p^\{\*\}\(\\mathbf\{y\}\_\{1\}\\succ\\mathbf\{y\}\_\{2\}\\mid\\mathbf\{x\}\)denotes the probability thaty1\\mathbf\{y\}\_\{1\}is preferred overy2\\mathbf\{y\}\_\{2\}givenx\\mathbf\{x\}\. The objective of preference fine\-tuning is to optimize the policyπθ\\pi\_\{\\theta\}to maximize the expected rewardr∗r^\{\*\}\. However, directly optimizingr∗r^\{\*\}is often impractical due to model limitations or noise in reward estimation\. Therefore, a reward modelrφr\_\{\\phi\}is trained to approximater∗r^\{\*\}\. To prevent the fine\-tuned policyπθ\\pi\_\{\\theta\}from deviating excessively from the reference modelπref\\pi\_\{\\text\{ref\}\}, a Kullback\-Leibler \(KL\) divergence constraint is imposed\. This leads to the following fine\-tuning objective: maxπE\[r∗\(x,y\)\]−βDKL\(π∥πref\)\\displaystyle\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:rlhf\_objective\}\}\{e\}q:rlhf\_\{o\}bjective\}\\begin\{split\}\\max\_\{\\pi\}\\;\\mathbb\{E\}\[r^\{\*\}\(x,y\)\]\-\\beta\\,D\_\{\\text\{KL\}\}\(\\pi\\parallel\\pi\_\{\\text\{ref\}\}\)\\end\{split\}\(2\)In this equation, the regularization term weighted byβ\\betacontrols how muchπθ\\pi\_\{\\theta\}diverges fromπref\\pi\_\{\\text\{ref\}\}, based on the reverse KL divergence constraint\. This constraint ensures that the updated policy remains close to the reference model while improving according to the reward function\. Reward model training\.To fine\-tune the large language model \(LLM\) policyπθ\(y∣x\)\\pi\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\), the Bradley\-Terry framework allows for either explicitly learning a reward modelrφ\(x,y\)r\_\{\\phi\}\(\\mathbf\{x\},\\mathbf\{y\}\)or directly optimizing preferences\. Explicit reward models are trained using the following classification objective: maxφEDpref\[logσ\(rφ\(x,yw\)−rφ\(x,yl\)\)\]\\displaystyle\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:reward\_learning\}\}\{e\}q:reward\_\{l\}earning\}\\max\_\{\\phi\}\\penalty 10000\\ \\mathbb\{E\}\_\{\\mathcal\{D\}\_\{\\text\{pref\}\}\}\\left\[\\log\\sigma\\left\(r\_\{\\phi\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\}\)\-r\_\{\\phi\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{l\}\)\\right\)\\right\]\(3\)whereσ\\sigmais the logistic function, used to map the difference in rewards to a probability\. Alternatively, contrastive learning objectives such as Direct Preference Optimization\(rafailov2023direct\)and Implicit Preference Optimization\(2023arXiv231012036G\)utilize the policy’s log\-likelihoodlogπθ\(y∣x\)\\log\\pi\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)as an implicit reward: rθ\(x,y\)=βlog\(πθ\(y∣x\)/πref\(y∣x\)\)\\displaystyle\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:contrastive\_parameterization\}\}\{e\}q:contrastive\_\{p\}arameterization\}r\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{y\}\)=\\beta\\log\\big\(\\pi\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)/\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)\\big\)\(4\)This approach leverages the policy’s log probabilities to represent rewards, thereby simplifying the reward learning process\. ## 4The Few\-Shot Preference Optimization \(FSPO\) Framework Refer to captionFigure 2:User Description Rationalization \(RAT\)\.Prediction is a two\-stage process: first predicting a \(synthetic\) user description from the few\-shot preferences and next predicting the response\. The model is fine\-tuned with a reward of how close the generated user description is to the gold user descriptioSimilar Articles
YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning
This paper introduces YFPO, a neuron-guided preference optimization framework that uses internal activation signals to improve mathematical reasoning in large language models.
$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
This paper introduces xi-DPO, a novel preference optimization method that reformulates the objective to minimize distance to optimal ratio reward margins, addressing hyperparameter tuning challenges in SimPO. Experimental results show that xi-DPO outperforms existing methods on open benchmarks.
Self-Supervised Prompt Optimization
This paper introduces Self-Supervised Prompt Optimization (SPO), a framework that optimizes prompts for LLMs without external references by using output comparisons, significantly reducing costs and data requirements.
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
This paper introduces Test-Time Personalization (TTP), a framework that improves LLM personalization by scaling inference-time computation through candidate sampling and reward-based selection. It diagnoses failure modes in standard reward models and proposes a probabilistic personalized reward model to mitigate them.
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
This paper analyzes spurious correlation learning in preference optimization methods like DPO, identifying mechanisms such as mean spurious bias and causal-spurious leakage. It proposes 'tie training' using equal-utility preference pairs as a mitigation strategy to reduce reliance on spurious features without degrading causal learning.