Spectral Souping: A Unified Framework for Online Preference Alignment
Summary
This paper introduces Spectral Souping, a framework for efficiently aligning LLMs with individual user preferences by discovering a universal spectral representation that enables merging of specialized policies at inference time without costly retraining.
View Cached Full Text
Cached at: 05/21/26, 06:25 AM
# Spectral Souping: A Unified Framework for Online Preference Alignment
Source: [https://arxiv.org/html/2605.20408](https://arxiv.org/html/2605.20408)
\\correspondingauthor
yinlamchow@google\.com\\reportnumber0001
Guy TennenholtzGoogle ResearchTed YunGoogle DeepMindJames HarrisonGoogle DeepMindArthur GrettonGoogle DeepMindAndre BarretoGoogle DeepMindBo DaiGoogle DeepMind
###### Abstract
Reinforcement Learning from Human Feedback \(RLHF\) effectively aligns Large Language Models \(LLMs\) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users\. To overcome this issue, we introduce*Spectral Souping*, a unified framework for efficient, online preference alignment\. Our core contribution is the discovery of a universal spectral representation within LLMs, which is proven to be highly amenable to model merging\. This theoretical insight enables a two\-phase methodology: we first learn a basis of specialized policies offline, each focused on a distinct, fine\-grained preference dimension\. An online adaptation algorithm then efficiently “soups” these policies at inference time, either by merging their outputs or parameters, enabling rapid model adaptation without the need for costly online retraining w\.r\.t\. tailored preference rewards\. Experiments on online preference alignment benchmarks demonstrate that our method achieves significant performance improvements over existing state\-of\-the\-art approaches, presenting a scalable and computationally efficient solution for dynamically adapting LLMs to individual user preferences\.
###### keywords:
personalization, preference alignment, model souping, LLM online adaptation, RLHF
## 1Introduction
Recent advancements in LLMs have demonstrated remarkable success in aligning with human preferences through techniques like RLHF\[ouyang2022training\]and Direct Preference Optimization \(DPO\)\[rafailov2023direct\]\. However, these methods, which rely on a unified reward from aggregated feedback, face significant limitations\. The core issue is that a one\-size\-fits\-all approach fails to account for the diverse and often conflicting needs of individual users, which stem from differences in backgrounds and contexts\. This gap between generalized and specialized preferences highlights a critical challenge of aligning LLMs with individual preferences without incurring the substantial data collection and computational costs associated with fine\-tuning a separate model for each user\.
Our work introduces*Spectral Souping*, a novel framework for online personalized preference alignment of LLMs that overcomes these limitations\. Unlike traditional methods, which require separate and costly fine\-tuning, our method can handle the diverse and varying user preferences in an efficient way\. Our core contribution lies in the discovery of a \(universal\) spectral representation in the context of the language Markov Decision Process \(MDP\), where the LLM policy maximizes the user’s preference\-driven reward\. This observation shows that the logits of various personalized LLM policies do not exist in an arbitrary space but rather in a structured latent space defined by the MDP’s spectral features, implying these logits can be represented as a linear combination of a small number of basis logit functions—each corresponds to a policy that aligns with a distinct preference dimension\.
This theoretical insight underpins our two\-phase LLM\-adaptation methodology\. An offline phase trains the aforementioned basis of specialized policies\. The online phase then dynamically combines these basis policies at inference time to generate responses tailored to user preferences, thereby obviating the need for costly per\-user fine\-tuning\. The resulting framework is highly scalable and can achieve state\-of\-the\-art performance on online preference alignment benchmarks\. Critically, our discovery of a unified spectral representation enables the derivation of provable sub\-optimality bounds for this policy merging approach—a significant advancement over prior techniques, which were largely heuristic and lacked such formal guarantees\. In particular, we show that our spectral souping method achieves performance arbitrarily close to that of a fully fine\-tuned “tailored” policy, thus complementing its empirical efficacy with a rigorous theoretical foundation\.
The rest of this paper is organized as follows\. Section[2](https://arxiv.org/html/2605.20408#S2)provides background on the language MDP, RLHF, and the online preference alignment problem\. Section[3](https://arxiv.org/html/2605.20408#S3)details our theoretical discovery of the spectral representation and its properties related to LLM preference alignment\. Section[4](https://arxiv.org/html/2605.20408#S4)presents our two\-phase methodology, including the offline training of specialized policies and the online souping algorithm\. Section[5](https://arxiv.org/html/2605.20408#S5)describes our experimental setup and presents our results, demonstrating the efficacy of spectral souping\. Finally, Section[6](https://arxiv.org/html/2605.20408#S6)delineates related work on LLM adaptation, and Section[7](https://arxiv.org/html/2605.20408#S7)concludes our work and discusses future directions\.
## 2Preliminaries
We first provide the basic MDP terminologies of language modeling and define the problem formulation of online preference alignment\.
### 2\.1The Language MDP and RLHF w\.r\.t\. an Individual Preference
The context of generating a sequence of tokens auto\-regressively with an LLM can be modeled as an MDP, where the state \(sts\_\{t\}\) at timestepttis the sequence of tokens generated so far,st=\(a0,a1,…,at−1\)s\_\{t\}=\(a\_\{0\},a\_\{1\},\\dots,a\_\{t\-1\}\), action \(ata\_\{t\}\) at timestepttis the next token to be generated, chosen from a finite vocabulary𝒜\\mathcal\{A\}, The state transition is deterministic, i\.e\., given statests\_\{t\}and actionata\_\{t\}, the next state is simply their concatenation:st\+1=concat\(st,at\)s\_\{t\+1\}=\\text\{concat\}\(s\_\{t\},a\_\{t\}\), and the policy \(π\(at\|st\)\\pi\(a\_\{t\}\|s\_\{t\}\)\) is a conditional LLM that we aim to optimize\. It gives the probability of generating tokenata\_\{t\}given the preceding sequencests\_\{t\}\. The goal is to find an optimal policyπ\\pithat solves the following maximum entropy \(soft\) RL problem:
maxπ𝔼π\[∑t=0T−1r\(st,at\)−βDKL\(π\(⋅\|st\)∥πref\(⋅\|st\)\)\],\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{T\-1\}r\(s\_\{t\},a\_\{t\}\)\-\\beta D\_\{KL\}\(\\pi\(\\cdot\|s\_\{t\}\)\\\|\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\_\{t\}\)\)\\right\],\(1\)wherer\(st,at\)r\(s\_\{t\},a\_\{t\}\)is a reward function that scores the quality of generating tokenata\_\{t\}after sequencests\_\{t\}\. This can be based on human preference feedback or other quality metrics,πref\\pi\_\{\\text\{ref\}\}is a reference LLM, andDKL\(π∥πref\)=∑a∈𝒜π\(a\|s\)logπ\(a\|s\)πref\(a\|s\)D\_\{KL\}\(\\pi\\\|\\pi\_\{\\text\{ref\}\}\)=\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\|s\)\\log\\frac\{\\pi\(a\|s\)\}\{\\pi\_\{\\text\{ref\}\}\(a\|s\)\}is the Kullback\-Leibler \(KL\) divergence that penalizes the policyπ\\pifor deviating too far from the reference modelπref\\pi\_\{\\text\{ref\}\}, where the temperature parameterβ\>0\\beta\>0controls the strength of this regularization\.
Given the deterministic transitionss′=\(s,a\)s^\{\\prime\}=\(s,a\), and the fact that the transition terminates at the final time stepTT, the unique fixed\-point solution to this maximum entropy RL problem defines the optimal policyπ∗\\pi^\{\*\}in terms of the optimal Q\-value function,Q\(s,a\)Q\(s,a\), that satisfies the soft Bellman backup and the optimal value function,V\(s\)V\(s\), that serves as normalization\[nachum2017bridging\]:
π∗\(a\|s\)=\\displaystyle\\pi^\{\*\}\(a\|s\)=\\,πref\(a\|s\)exp\(Q\(s,a\)−V\(s\)β\),∀s,a,\\displaystyle\\pi\_\{\\text\{ref\}\}\(a\|s\)\\exp\\left\(\\frac\{Q\(s,a\)\-V\(s\)\}\{\\beta\}\\right\),\\quad\\forall s,a,\(2\)Q\(s,a\)=\\displaystyle Q\(s,a\)=\\,r\(s,a\)\+V\(s′\),\\displaystyle r\(s,a\)\+V\(s^\{\\prime\}\),\(3\)V\(s\)=\\displaystyle V\(s\)=\\,βlog∑a∈Aπref\(a\|s\)exp\(Q\(s,a\)β\),\\displaystyle\\beta\\log\\sum\_\{a\\in A\}\\pi\_\{\\text\{ref\}\}\(a\|s\)\\exp\(\\frac\{Q\(s,a\)\}\{\\beta\}\),\(4\)where the bellman backup in \([3](https://arxiv.org/html/2605.20408#S2.E3)\) corresponds to deterministic transitions\. This shows that the logits of the optimal policy are obtained by adding the optimal Q\-values to the logits of the reference policy\.
### 2\.2Online Preference Alignment as a Multi\-objective MDP
Adapting LLMs to satisfy diverse and often conflicting user preferences presents a complex multi\-objective optimization challenge\. Unlike in standard RLHF, where the LLM’s generation process is treated as a MDP with a single reward, in online preference alignment, in order to represent a distinct set of user preferences, like conciseness or factual accuracy, one considers a*multi\-objective MDP*that involves a preference reward vector,𝐫\(s,a\)=\(r1\(s,a\),…,rK\(s,a\)\)∈ℝK\\mathbf\{r\}\(s,a\)=\(r\_\{1\}\(s,a\),\\ldots,r\_\{K\}\(s,a\)\)\\in\\mathbb\{R\}^\{K\}, where each of theKKcomponents corresponds to a preference dimension\. Suppose for any arbitrary new users, her preference can be modeled as a linear combination of these multifaceted preference attributes, i\.e\.,r𝐰=∑k=1Kwkrkr\_\{\\mathbf\{w\}\}=\\sum\_\{k=1\}^\{K\}w\_\{k\}r\_\{k\}, w\.r\.t\. the user\-specific preference vector𝐰=\(w1,…,wK\)∈ΔK\\mathbf\{w\}=\(w\_\{1\},\\ldots,w\_\{K\}\)\\in\\Delta^\{K\}that lies in aKK\-dimensional simplex set, characterizing the underlying user\-specific importance placed on the corresponding base rewards\. Then a typical approach for learning a tailored LLM is via RLHF\[kirk2023personalisation,das2024active\], i\.e\.,maxπ𝔼π\[∑t=0T−1r𝐰\(st,at\)−βDKL\(π\(⋅\|st\)∥πref\(⋅\|st\)\)\]\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{T\-1\}r\_\{\\mathbf\{w\}\}\(s\_\{t\},a\_\{t\}\)\-\\beta D\_\{KL\}\(\\pi\(\\cdot\|s\_\{t\}\)\\\|\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\_\{t\}\)\)\\right\], which optimizes a policyπ𝐰∗\\pi^\{\*\}\_\{\\mathbf\{w\}\}from the feedback signal of a specific reward modelr𝐰r\_\{\\mathbf\{w\}\}\. However, in general this preference vector𝐰\\mathbf\{w\}is not revealed to the agent prior to online interactions\. Estimating this vector when concurrently training a corresponding tailored policy via RLHF can be challenging \(e\.g\., beyond policy optimization one may require advanced exploration strategies to uncover such user preferences during RL\), especially when this procedure is only run for a limited number of steps \(e\.g\., during online adaptation\)\. Alternatively, training a contextual agent for every possible preference vector𝐰\\mathbf\{w\}can also be computationally expensive and impractical for most real\-world applications\.
## 3A Spectral Representation for Online Preference Alignment
This section investigates the parameterization of the optimal value function within the class of Language MDPs introduced in Section[2\.1](https://arxiv.org/html/2605.20408#S2.SS1)\. Given the reference LLM representationψ\(s\)∈ℝd\\psi\(s\)\\in\\mathbb\{R\}^\{d\}, where the reference policy can be expressed asπref\(a\|s\)=exp\(ψ\(s\)⊤νref\(a\)\)/∫b∈Aexp\(ψ\(s\)⊤νref\(b\)\)𝑑b\\pi\_\{\\text\{ref\}\}\(a\|s\)=\\exp\(\\psi\(s\)^\{\\top\}\\nu\_\{\\text\{ref\}\}\(a\)\)/\\int\_\{b\\in A\}\\exp\(\\psi\(s\)^\{\\top\}\\nu\_\{\\text\{ref\}\}\(b\)\)dbwith the corresponding action token embeddingνref\(a\)∈ℝd\\nu\_\{\\text\{ref\}\}\(a\)\\in\\mathbb\{R\}^\{d\}, our primary goal is to identify the conditions such that this reference LLM feature can also be a*spectral representation*that permits a linear parameterization of the optimal Q\-function defined in Equation \([3](https://arxiv.org/html/2605.20408#S2.E3)\)\. To facilitate our analysis of the spectral representation for the optimal Q\-function, we introduce two technical assumptions that provide a practical framework for our analysis:
###### Assumption 1\(Linear Reward Representation\)\.
Given sufficiently expressive featuresψ\\psiderived from the reference LLM, any reward function of the language MDP can be linearly represented by these features:
r\(s,a\)=ψ\(\(s,a\)\)⊤νr,for some weight vectorνr\.r\(s,a\)=\\psi\(\(s,a\)\)^\{\\top\}\\nu\_\{\\text\{r\}\},\\quad\\text\{for some weight vector \}\\nu\_\{\\text\{r\}\}\.\(5\)
###### Assumption 2\(LL\-step Decodability\)\.
The language MDP induced by the reference LLM isLL\-step decodable for some integerL\>0L\>0, whose trajectory distribution depends only on its most recentLL\-step history, i\.e\., the distribution ofhh\-step trajectoryτh\\tau\_\{h\}is conditioned only on sub\-sequenceτh−L\+1:h=\(sh−L\+1,ah−L\+1,…,sh\)\\tau\_\{h\-L\+1:h\}=\(s\_\{h\-L\+1\},a\_\{h\-L\+1\},\\dots,s\_\{h\}\)\.
The linear reward assumption is justified by the powerful representational capacity of the reference LLM\. While the true reward dynamics may be arbitrarily complex, the LLM\-generated featuresψ\(s\)\\psi\(s\)are rich enough to represent the underlying semantics, allowing the reward itself to be modeled as a simple linear function\. TheLL\-step decodability assumption is motivated by the architecture of the transformer\-based reference policy\. These models operate on a fixed\-length context window, meaning their outputs are conditioned only on the most recentLLtokens\. Leveraging these conditions that align our model with the realistic computational constraints of the LLM, we first have our main technical result characterizing the optimal Q\-function of any reward function that satisfies Assumption[1](https://arxiv.org/html/2605.20408#Thmassumption1)\.
###### Lemma 1\.
For any language MDP, that satisfies Assumption[1](https://arxiv.org/html/2605.20408#Thmassumption1)and Assumption[2](https://arxiv.org/html/2605.20408#Thmassumption2), its optimal Q function from Equation[3](https://arxiv.org/html/2605.20408#S2.E3)can be linearly parameterized with the reference LLM logit featureψ\\psi, i\.e\., there exists a vectorνβ,r,ref∈ℝd\\nu\_\{\\beta,r,\\text\{ref\}\}\\in\\mathbb\{R\}^\{d\}that depends on temperatureβ\\beta, rewardrr, and reference LLMπref\\pi\_\{\\text\{ref\}\}such that
Q∗\(s,a\)=ψ\(\(s,a\)\)⊤νβ,r,ref,∀s,a\.Q^\{\*\}\(s,a\)=\\psi\(\(s,a\)\)^\{\\top\}\\nu\_\{\\beta,r,\\text\{ref\}\},\\,\\forall s,a\.\(6\)
Lemma[1](https://arxiv.org/html/2605.20408#Thmlemma1)reveals a non\-trivial, crucial property of the language MDP: the reference LLM’s logit feature,ψ\\psi, acts as a universal*spectral representation*\. This representation allows the optimal Q\-function for any preference\-driven reward to be linearly parameterized\. Consequently, for a set ofKKdistinct preference attributes\{r1\(s,a\),…,rK\(s,a\)\}\\\{r\_\{1\}\(s,a\),\\ldots,r\_\{K\}\(s,a\)\\\}, their corresponding optimal Q\-functions\{Q1∗\(s,a\),…,QK∗\(s,a\)\}\\\{Q^\{\*\}\_\{1\}\(s,a\),\\ldots,Q^\{\*\}\_\{K\}\(s,a\)\\\}can all be expressed asQk∗\(s,a\)=ψ\(\(s,a\)\)⊤νkQ^\{\*\}\_\{k\}\(s,a\)=\\psi\(\(s,a\)\)^\{\\top\}\\nu\_\{k\}, where eachνk∈ℝd\\nu\_\{k\}\\in\\mathbb\{R\}^\{d\}is a vector in the spectral space\. This insight directly motivates our*Spectral Soup*policy architecture\. The model employs a shared LLM feature extractor that producesψ\\psi, while integrating multiple lightweight adapters at the output logit layer, each specialized to learn a basis Q\-functionQk∗Q^\{\*\}\_\{k\}\. A souped policy,π~λ\\tilde\{\\pi\}\_\{\\lambda\}, is then constructed by linearly combining these Q\-functions with a mixture vectorλ=\(λ1,…,λK\)∈ℝK\\lambda=\(\\lambda\_\{1\},\\ldots,\\lambda\_\{K\}\)\\in\\mathbb\{R\}^\{K\}:
π~λ\(a\|s\)∝πref\(a\|s\)⋅exp\(∑k=1KλkQk∗\(s,a\)/β′\),β∑k\|λk\|≤β′\.\\tilde\{\\pi\}\_\{\\lambda\}\(a\|s\)\\propto\\pi\_\{\\text\{ref\}\}\(a\|s\)\\cdot\\exp\\left\(\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)/\\beta^\{\\prime\}\\right\),\\,\\,\\beta\\sum\_\{k\}\|\\lambda\_\{k\}\|\\leq\\beta^\{\\prime\}\.\(7\)Here,β′\>0\\beta^\{\\prime\}\>0acts as a temperature parameter, normalizing the logit mixture’s magnitude, and is constrained to maintain temperature consistency with the optimal policies\. Mathematically, when the number of basis functions approximates the spectral dimension \(K≈dK\\approx d\), parameterizing the personalized policy via the mixture vectorλ∈ℝK\\lambda\\in\\mathbb\{R\}^\{K\}is equivalent to learning a single spectral feature vectorν∈ℝd\\nu\\in\\mathbb\{R\}^\{d\}\. However, our approach of mixing Q\-functions offers two advantages\. First, the basis of Q\-functions is interpretable, as eachQk∗Q^\{\*\}\_\{k\}corresponds to a tangible preference attribute\. The underlying spectral representation, in contrast, is often not\. This allows us to select a small, relevant subset of basis functions \(K≪dK\\ll d\) that spans most preference\-alignment needs, reducing the learning problem’s dimensionality in practice\. Second, this framework is flexible; one can easily modify the basis by adding or removing specialized Q\-functions to accommodate new preference attributes\. Altering the spectral representation, however, is difficult as it is an intrinsic property of the reference LLM\.
The central goal is to derive a performance sub\-optimality bound for the computationally\-efficient spectral soup policy\. To establish this bound, for any given user preference vector𝐰\\mathbf\{w\}we first define the optimal target\. Recall that a user’s preference𝐰\\mathbf\{w\}defines a personalized reward,r𝐰\(s,a\)=∑k=1Kwkrk\(s,a\)r\_\{\\mathbf\{w\}\}\(s,a\)=\\sum\_\{k=1\}^\{K\}w\_\{k\}r\_\{k\}\(s,a\)\. The optimal policy,π𝐰∗\\pi^\{\*\}\_\{\\mathbf\{w\}\}, that maximizes the KL\-regularized return for this reward has the following closed\-form solution:π𝐰∗\(a\|s\)∝πref\(a\|s\)⋅exp\(Q𝐰∗\(s,a\)/β\),\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(a\|s\)\\propto\\pi\_\{\\text\{ref\}\}\(a\|s\)\\cdot\\exp\\left\(\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}/\{\\beta\}\\right\),whereQ𝐰∗Q^\{\*\}\_\{\\mathbf\{w\}\}is the optimal personalized Q\-function\. The corresponding optimal value function, which represents the maximum performance utility that can be achieved, isV𝐰∗\(s\)=β⋅log𝔼a∼πref\(⋅\|s\)\[exp\(Q𝐰∗\(s,a\)/β\)\]V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)=\\beta\\cdot\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\left\(\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}/\{\\beta\}\\right\)\\right\]\. Our approach approximates this optimal policy with the spectral soup policyπ~λ\\tilde\{\\pi\}\_\{\\lambda\}in Equation \([7](https://arxiv.org/html/2605.20408#S3.E7)\), whose logit\-mixture weightsλ∗\\lambda^\{\*\}is a solution of the following constrained optimization problem:
V𝐰,β′λ∗\(s\):=maxλ∈ℝK\{𝔼π~λ\[∑t=0T−1r𝐰\(st,at\)−β′∑k\|λk\|DKL\(π~λ\(⋅\|st\)∥πref\(⋅\|st\)\)\|s0=s\]s\.t\.β∑k\|λk\|≤β′\}\.V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\):=\\max\_\{\\lambda\\in\\mathbb\{R\}^\{K\}\}\\left\\\{\\mathbb\{E\}\_\{\\tilde\{\\pi\}\_\{\\lambda\}\}\\left\[\\sum\_\{t=0\}^\{T\-1\}r\_\{\\mathbf\{w\}\}\(s\_\{t\},a\_\{t\}\)\-\\frac\{\\beta^\{\\prime\}\}\{\\sum\_\{k\}\|\\lambda\_\{k\}\|\}D\_\{KL\}\(\\tilde\{\\pi\}\_\{\\lambda\}\(\\cdot\|s\_\{t\}\)\\\|\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\_\{t\}\)\)\\bigg\|s\_\{0\}=s\\right\]\\,\\,\\,\\text\{s\.t\.\}\\,\\,\\,\\beta\\sum\_\{k\}\|\\lambda\_\{k\}\|\\leq\\beta^\{\\prime\}\\right\\\}\.\(8\)This problem maximizes the same preference rewardr𝐰r\_\{\\mathbf\{w\}\}as in RLHF, balanced by a KL penalty whose regularization strength adapts to the magnitude of the policy mixture\. Such a key constraint ensures the resulting policy’s temperature remains within a reasonable range, which is also a necessary condition for our performance guarantees to hold\. The following theorem presents our main technical result, establishing the formal guarantees for this approximation\.
###### Theorem 1\(Sub\-optimality Performance Bounds\)\.
Under Assumptions[1](https://arxiv.org/html/2605.20408#Thmassumption1)and[2](https://arxiv.org/html/2605.20408#Thmassumption2), the spectral soup policyπ~λ∗\\tilde\{\\pi\}\_\{\\lambda^\{\*\}\}in Equation \([7](https://arxiv.org/html/2605.20408#S3.E7)\), whose weightsλ∗\\lambda^\{\*\}solves Equation \([8](https://arxiv.org/html/2605.20408#S3.E8)\), achieves the following guarantees\.
1\. KL Divergence Bound:The divergence from the true optimal policyπ𝐰∗\\pi^\{\*\}\_\{\\mathbf\{w\}\}is bounded by:
DKL\(π𝐰∗\(⋅\|s\)∥π~λ∗\(⋅\|s\)\)≤1β′\(𝔼π𝐰∗∥ψ\(\(s,a\)\)∥2\+𝔼πref∥ψ\(\(s,a\)\)∥2\)⋅∥β′βνβ,r𝐰,ref−∑kλk∗νβ,rk,ref∥2\.\\displaystyle D\_\{\\text\{KL\}\}\(\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(\\cdot\|s\)\\,\\\|\\,\\tilde\{\\pi\}\_\{\\lambda^\{\*\}\}\(\\cdot\|s\)\)\\leq\\frac\{1\}\{\\beta^\{\\prime\}\}\\left\(\\mathbb\{E\}\_\{\\pi^\{\*\}\_\{\\mathbf\{w\}\}\}\\\|\\psi\(\(s,a\)\)\\\|\_\{2\}\+\\mathbb\{E\}\_\{\\pi\_\{\\text\{ref\}\}\}\\\|\\psi\(\(s,a\)\)\\\|\_\{2\}\\right\)\\cdot\\left\\\|\\frac\{\\beta^\{\\prime\}\}\{\\beta\}\\nu\_\{\\beta,r\_\{\\mathbf\{w\}\},\\text\{ref\}\}\-\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}\\nu\_\{\\beta,r\_\{k\},\\text\{ref\}\}\\right\\\|\_\{2\}\.\(9\)
2\. Performance Sub\-optimality Bound:The gap between the optimal valueV𝐰∗\(s\)V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)and the value achieved by the spectral soup policy,V𝐰,β′λ∗\(s\)V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\), is bounded as:
0≤V𝐰∗\(s\)−V𝐰,β′λ∗\(s\)≤𝔼π¯\[∑t=0T−1‖ψ\(\(st,at\)\)‖2\|s0=s\]⋅‖∑kνrk\(wk−\|λk∗\|∑k\|λk∗\|\)‖2\+β2β′∑kΔk\(s\)max\{0,−λk∗\}\+𝔼πref\[‖ψ\(\(s,a\)\)‖2\]⋅‖νβ,r𝐰,ref−ββ′∑kλk∗νβ,rk,ref‖2\.\\begin\{split\}0\\leq V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)&\-V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\)\\leq\\mathbb\{E\}\_\{\\underline\{\\pi\}\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\\|\\psi\(\(s\_\{t\},a\_\{t\}\)\)\\\|\_\{2\}\|s\_\{0\}=s\\right\]\\cdot\\left\\\|\\sum\_\{k\}\\nu\_\{r\_\{k\}\}\\left\(w\_\{k\}\-\\frac\{\|\\lambda^\{\*\}\_\{k\}\|\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\right\)\\right\\\|\_\{2\}\\\\ &\+\\frac\{\\beta^\{2\}\}\{\\beta^\{\\prime\}\}\\sum\_\{k\}\\Delta\_\{k\}\(s\)\\max\\\{0,\-\\lambda^\{\*\}\_\{k\}\\\}\+\\mathbb\{E\}\_\{\\pi\_\{\\text\{ref\}\}\}\\left\[\\\|\\psi\(\(s,a\)\)\\\|\_\{2\}\\right\]\\cdot\\left\\\|\\nu\_\{\\beta,r\_\{\\mathbf\{w\}\},\\text\{ref\}\}\-\\frac\{\\beta\}\{\\beta^\{\\prime\}\}\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}\\nu\_\{\\beta,r\_\{k\},\\text\{ref\}\}\\right\\\|\_\{2\}\.\\end\{split\}\(10\)Here,π¯\\underline\{\\pi\}is an auxiliary policy that minimizes the weighted difference of cumulative rewards, i\.e\.,π¯∈argminπ𝔼π\[∑t=0T−1ψ\(\(st,at\)\)⊤∑k\|λk∗\|\(νr𝐰−νrk\)\]\\underline\{\\pi\}\\in\\arg\\min\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\[\\sum\_\{t=0\}^\{T\-1\}\\psi\(\(s\_\{t\},a\_\{t\}\)\)^\{\\top\}\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\(\\nu\_\{r\_\{\\mathbf\{w\}\}\}\-\\nu\_\{r\_\{k\}\}\)\], andΔk\(s\):=log\(M¯k\(s\)\+M¯k\(s\)−1\)−log\(M¯k\(s\)⋅M¯k\(s\)\)\\Delta\_\{k\}\(s\):=\\log\(\\overline\{M\}\_\{k\}\(s\)\+\\underline\{M\}\_\{k\}\(s\)\-1\)\-\\log\(\\overline\{M\}\_\{k\}\(s\)\\cdot\\underline\{M\}\_\{k\}\(s\)\), whereM¯k\(s\)\\overline\{M\}\_\{k\}\(s\)andM¯k\(s\)\\underline\{M\}\_\{k\}\(s\)are the respective upper and lower bounds of the policy ratioπk\(a\|s\)/πref\(a\|s\)\\pi\_\{k\}\(a\|s\)/\\pi\_\{\\text\{ref\}\}\(a\|s\)\.
This technical result marks a significant advancement for policy “souping” methods, which have largely remained empirically\-validated heuristics\. Our work establishes one of the first formal sub\-optimality bounds for such a technique, providing a rigorous theoretical foundation to complement its practical efficacy\. Theorem[1](https://arxiv.org/html/2605.20408#Thmtheorem1)provides this guarantee by first bounding the policy approximation error \(measured by the KL divergence\) between our spectral soup policy and the personalized optimal policy with the*logit approximation error*,‖νβ,r𝐰,ref−\(β/β′\)∑kλk∗νβ,rk,ref‖2\\\|\\nu\_\{\\beta,r\_\{\\mathbf\{w\}\},\\text\{ref\}\}\-\(\{\\beta\}/\{\\beta^\{\\prime\}\}\)\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}\\,\\nu\_\{\\beta,r\_\{k\},\\text\{ref\}\}\\\|\_\{2\}, which captures how well the personalized logit\-vector can be represented by a linear combination of the basis vectors \(normalized by the magnitude of the logit\-mixture weights\)\. Furthermore, it decomposes the performance gap into three intuitive error sources: \(i\) this same logit approximation error; \(ii\) a*reward approximation error*,‖∑kνrk\(wk−\|λk∗\|/∑k\|λk∗\|\)‖2\\\|\\sum\_\{k\}\\nu\_\{r\_\{k\}\}\(w\_\{k\}\-\|\\lambda^\{\*\}\_\{k\}\|/\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\)\\\|\_\{2\}, capturing the mismatch between the user’s true preference weights\{wk\}k=1K\\\{w\_\{k\}\\\}\_\{k=1\}^\{K\}\(which is a simplex vector\) and the normalized magnitudes of the learned soup weights\{λk\}k=1K\\\{\\lambda\_\{k\}\\\}\_\{k=1\}^\{K\}; and \(iii\) a*penalty for negative logit\-mixture weights*\. This decomposition reveals a critical insight: if the set of basis logit\-vectors is rich enough to span the spectral representation space of all possible personalized logits, the policy approximation error vanishes entirely\. However, the theorem also highlights a limitation, as the overall performance bound is not*tight*; it does not go to zero even when this primary error term is eliminated, due to the remaining reward approximation and penalty terms\. This provides a clear direction for future work on tightening these theoretical guarantees\.
## 4The Spectral Souping Algorithm for Online Preference Alignment
Motivated by the theoretical guarantees in Theorem[1](https://arxiv.org/html/2605.20408#Thmtheorem1), we propose a spectral souping algorithm designed for the rapid and efficient preference alignment of LLMs\. The algorithm operates in two phases: an offline training stage followed by an online adaptation stage\. In the first phase, we train a set ofKKspecialized policies,\{π1∗,…,πK∗\}\\\{\\pi\_\{1\}^\{\*\},\\dots,\\pi\_\{K\}^\{\*\}\\\}, where each policy is independently optimized for a single reward attributerkr\_\{k\}\. In the second phase, the logit\-mixture vectorλ\\lambdaof the spectral soup policyπ~λ\\tilde\{\\pi\}\_\{\\lambda\}in Equation \([7](https://arxiv.org/html/2605.20408#S3.E7)\) is learned online to combine the specialized policies, tailoring the model’s behavior to a specific user’s preferences\. This structured composition enables zero\-shot generalization to new users by learning only the low\-dimensional mixture vectorλ\\lambdaat inference time, rather than training a bespoke policy for each user’s underlying preference vector𝐰\\mathbf\{w\}\.
The initial phase involves learning a set ofKKbase policies\{πk∗\}k=1K\\\{\\pi^\{\*\}\_\{k\}\\\}\_\{k=1\}^\{K\}\. Within the language MDP framework \(Section[2\.1](https://arxiv.org/html/2605.20408#S2.SS1)\), each specialized policy is given byπθk∗\(a\|s\)∝πref\(a\|s\)exp\(Qθk\(s,a\)/β\)\\pi^\{\*\}\_\{\\theta\_\{k\}\}\(a\|s\)\\propto\\pi\_\{\\text\{ref\}\}\(a\|s\)\\exp\(Q\_\{\\theta\_\{k\}\}\(s,a\)/\\beta\), implying its output logits can be expressed aslogitθk\(s,a\)=logitref\(s,a\)\+Qθk\(s,a\)\\text\{logit\}\_\{\\theta\_\{k\}\}\(s,a\)=\\text\{logit\}\_\{\\text\{ref\}\}\(s,a\)\+Q\_\{\\theta\_\{k\}\}\(s,a\)\. Following spectral representation theory, the Q\-functionQθk\(s,a\)Q\_\{\\theta\_\{k\}\}\(s,a\)can be optimally parameterized using a LoRA module applied to the output logit layer of the reference policyπref\\pi\_\{\\text\{ref\}\}\. A practical challenge is the lack of granular, token\-level rewards\. We address this by learning fromKKoffline datasets,\{ℬk\}k=1K\\\{\\mathcal\{B\}\_\{k\}\\\}\_\{k=1\}^\{K\}, where each dataset contains trajectories annotated with binary labels \(e\.g\., "good" vs\. "bad"\) for a specific preference attribute\. This reframes the task of learning each specialized policy’s parameters,θk\\theta\_\{k\}, as a well\-studied offline preference alignment problem\. LetRθk\(τ\):=∑t=0T−1r\(st,at\)=∑t=0T−1Qθk\(st,at\)−∑t=1TVθk\(st\)R\_\{\\theta\_\{k\}\}\(\\tau\):=\\sum\_\{t=0\}^\{T\-1\}r\(s\_\{t\},a\_\{t\}\)=\\sum\_\{t=0\}^\{T\-1\}Q\_\{\\theta\_\{k\}\}\(s\_\{t\},a\_\{t\}\)\-\\sum\_\{t=1\}^\{T\}V\_\{\\theta\_\{k\}\}\(s\_\{t\}\)be the cumulative reward over trajectoryτ\\tau, derived according to Equation \([3](https://arxiv.org/html/2605.20408#S2.E3)\)\. One method, developed fromcui2025process, explicitly learns the Q\-function,Qθk\(s,a\)Q\_\{\\theta\_\{k\}\}\(s,a\), by minimizing a composite objective\. This objective combines a binary loss, i\.e\.,LBinary=−𝔼τ∼ℬk\[lk\(τ\)logσ\(Rθk\(τ\)\)\+\(1−lk\(τ\)\)log\(1−σ\(Rθk\(τ\)\)\)\]L\_\{\\text\{Binary\}\}=\-\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{B\}\_\{k\}\}\[l\_\{k\}\(\\tau\)\\log\\sigma\(R\_\{\\theta\_\{k\}\}\(\\tau\)\)\+\(1\-l\_\{k\}\(\\tau\)\)\\log\(1\-\\sigma\(R\_\{\\theta\_\{k\}\}\(\\tau\)\)\)\], to match trajectory labels with a Gumbel loss\[garg2023extreme\], i\.e\.,LGumbel=𝔼\(s,a\)∼ℬk\[exp\(Aθk\(s,a\)/β\)−Aθk\(s,a\)/β−1\]L\_\{\\text\{Gumbel\}\}=\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{B\}\_\{k\}\}\\left\[\\exp\(\{A\_\{\\theta\_\{k\}\}\(s,a\)\}/\{\\beta\}\)\-\{A\_\{\\theta\_\{k\}\}\(s,a\)\}/\{\\beta\}\-1\\right\], whereAθk\(s,a\)=Qθk\(s,a\)−Vθk\(s\)A\_\{\\theta\_\{k\}\}\(s,a\)=Q\_\{\\theta\_\{k\}\}\(s,a\)\-V\_\{\\theta\_\{k\}\}\(s\), to ensure soft Bellman consistency\. While direct, this approach is computationally expensive because it requires training an auxiliary value model,VθkV\_\{\\theta\_\{k\}\}\. An alternative method, inspired byrafailov2024r, operates on preference pairs\(w,l\)\(w,l\)derived fromℬk\\mathcal\{B\}\_\{k\}, where trajectorywwis preferred over trajectoryll\. It optimizes the policy by minimizing the Bradley\-Terry logistic loss:ℒBT\(θk;ℬk\)=−𝔼\(w,l\)∼ℬk\[logσ\(Rθk\(w\)−Rθk\(l\)\)\]\\mathcal\{L\}\_\{BT\}\(\\theta\_\{k\};\\mathcal\{B\}\_\{k\}\)=\-\\mathbb\{E\}\_\{\(w,l\)\\sim\\mathcal\{B\}\_\{k\}\}\[\\log\\sigma\(R\_\{\\theta\_\{k\}\}\(w\)\-R\_\{\\theta\_\{k\}\}\(l\)\)\]\. The key advantage of this formulation is that the difference in cumulative rewards simplifies to a sum of log\-policy ratios, i\.e\.,Rθk\(w\)−Rθk\(l\)=β∑tlogπθk∗\(aw,t\|sw,t\)/πref\(aw,t\|sw,t\)−β∑tlogπθk∗\(al,t\|sl,t\)/πref\(al,t\|sl,t\)R\_\{\\theta\_\{k\}\}\(w\)\-R\_\{\\theta\_\{k\}\}\(l\)=\\beta\\sum\_\{t\}\\log\{\\pi^\{\*\}\_\{\\theta\_\{k\}\}\(a\_\{w,t\}\|s\_\{w,t\}\)\}/\{\\pi\_\{\\text\{ref\}\}\(a\_\{w,t\}\|s\_\{w,t\}\)\}\-\\beta\\sum\_\{t\}\\log\{\\pi^\{\*\}\_\{\\theta\_\{k\}\}\(a\_\{l,t\}\|s\_\{l,t\}\)\}/\{\\pi\_\{\\text\{ref\}\}\(a\_\{l,t\}\|s\_\{l,t\}\)\}, thereby obviating the need for an explicit value function\. Although this determines the reward function only up to a state\-dependent potential, this ambiguity does not affect the optimal policy\. For its computational efficiency, we adopt this second method in our work\.
The second phase is online adaptation, via spectral souping, where the policy can be realized through two approaches\. The*explicit*approach directly constructs the policy’s logit function as a linear combination of the pre\-trained specialized and reference logits:logitπ~λ\(s,a\)=\(1−∑kλk\)logitref\(s,a\)\+∑kλklogitθk\(s,a\)\.\\text\{logit\}\_\{\\tilde\{\\pi\}\_\{\\lambda\}\}\(s,a\)=\\left\(1\-\\sum\_\{k\}\\lambda\_\{k\}\\right\)\\text\{logit\}\_\{\\text\{ref\}\}\(s,a\)\+\\sum\_\{k\}\\lambda\_\{k\}\\text\{logit\}\_\{\\theta\_\{k\}\}\(s,a\)\.This follows directly from the definition of the specialized Q\-functions and the spectral soup policy in Equation \([7](https://arxiv.org/html/2605.20408#S3.E7)\)\. Alternatively, the*implicit*approach avoids instantiating a new model by applying rejection sampling to the reference policyπref\\pi\_\{\\text\{ref\}\}\. An action is first sampled from the reference policy,a∼πref\(a\|s\)a\\sim\\pi\_\{\\text\{ref\}\}\(a\|s\), and then accepted ifu≤exp\(∑k=1Kλkβ\(logπθk∗\(a\|s\)−logπref\(a\|s\)\)\)u\\leq\\exp\(\\sum\_\{k=1\}^\{K\}\\frac\{\\lambda\_\{k\}\}\{\\beta\}\(\\log\\pi^\{\*\}\_\{\\theta\_\{k\}\}\(a\|s\)\-\\log\\pi\_\{\\text\{ref\}\}\(a\|s\)\)\), whereuuis drawn from a uniform distribution𝒰\[0,M\(s\)\]\\mathcal\{U\}\[0,M\(s\)\], and the upper bound can be simply set toM\(s\)=exp\(−∑kλkDKL\(πref\(⋅\|s\)\|\|πθk∗\(⋅\|s\)\)/β\)M\(s\)=\\exp\(\-\\sum\_\{k\}\\lambda\_\{k\}\\,D\_\{KL\}\(\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\|\|\\pi^\{\*\}\_\{\\theta\_\{k\}\}\(\\cdot\|s\)\)/\\beta\)\. While this implicit souping method can be highly efficient, its performance degrades if the personalized policy diverges significantly from the reference policy, as this leads to a high rejection rate\. For both approaches, the spectral soup weightsλ∈ℝK\\lambda\\in\\mathbb\{R\}^\{K\}, which tailor the policy to a specific user, are learned efficiently from online user feedback\. Mirroring the offline phase, we can frame the learning problem using preference optimization, where we minimize a Bradley\-Terry loss over weighted preference pairs,ℒBT\(λ;ℬ\)=−𝔼ℬ\[logσ\(∑kλk\(Rθk\(w\)−Rθk\(l\)\)\)\]\\mathcal\{L\}\_\{\\text\{BT\}\}\(\\lambda;\\mathcal\{B\}\)=\-\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[\\log\\sigma\(\\sum\_\{k\}\\lambda\_\{k\}\(R\_\{\\theta\_\{k\}\}\(w\)\-R\_\{\\theta\_\{k\}\}\(l\)\)\)\]\. Crucially, the learning problem forλ\\lambdareduces to logistic linear regression, allowing for efficient update via convex optimization methods\[hazan2016introduction\]\.
## 5Experiments
To assess our approach’s effectiveness, we conduct empirical evaluations on three realistic LLM online preference alignment benchmarks\. Each experiment involves an offline phase for learning specialized policies and an online phase for personalization and adaptation\.
Our first experiment is on online LLM preference alignment w\.r\.t\. theUltraFeedbackdataset\[cui2023ultrafeedback\], which contains prompts with response pairs annotated with a 4D feature vector \(helpfulness, honesty, instruction\-following, and truthfulness\)\. We synthesize diverse preferences by ranking responses based on the dot product of their feature vector with a unique weight vector𝐰∈ℝ4\\mathbf\{w\}\\in\\mathbb\{R\}^\{4\}\. For the offline phase, we buildK=30K=30specialized datasets by creating 30 unique weight vectors, each randomly sampled around the basis attributes\. In the online phase, we test generalization against 5 held\-out users proxied by unseen public reward models\. To increase difficulty, the dataset is filtered to only contentious examples \(23,61423,614train,401401test\) where preferences conflict\. The second experiment focuses on optimizing an LLM’s personalized prompt\-expansion fortext\-to\-image \(T2I\) generationin a 5\-turn interactive process\[nabati2024preference\], where an agent generates new textual prompts at each turn and inputs that to the environment to return the updated 4x4 image slate\. The system uses Stable Diffusion XL\[podell2023sdxl\]for image generation, Gemini 1\.5 Flash\[team2024gemini\]for prompt expansion, and Gemma 2B\[team2024gemma\]to model utilities\. In the offline phase, we generateK=32K=32datasets from over 30,000 simulated rollouts guided by user models with myopic, turn\-by\-turn preferences; a user rates best overall image in each column and chooses the highest\-scoring column\. During the online phase, adaptability is tested against 5 held\-out users, each simulated by a pre\-trained LTV utility function that models session\-level preferences based on criteria like aesthetics or prompt consistency\. Our third experiment is on LLMsleep coaching\. Each synthetic user is instantiated by an LLM grounded in their detailed sleep profiles obtained from 68 real individuals in the LifeSnaps dataset\[Yfantidou2022\-ay\], following the experimental setup inyun\-etal\-2025\-sleepless\. For the offline phase, we generate three \(high, medium, low\) preference datasets for each of the Big Five personality dimensions\[goldberg1992development\]\(the International Personality Item Pool \(IPIP\) version; extraversion, agreeableness, conscientiousness, stability, intellect\), resulting inK=15K=15personality types\. Each dataset contains 1,000 10\-turn conversation pairs, ranked by a reward function specific to that personality\. In the online phase, we evaluate performance on 5 simulated users \(512 samples each\), each represented by a Gemini 1\.5 Flash auto\-rater that scores conversations based on its specialized rubrics on coaching tone, user understanding, and intervention quality, that mimics users of different backgrounds and personalities\.
To evaluate the performance of our spectral souping method, we compare it against a comprehensive suite of baselines\. The first is the “bespoke” RLHF\[ouyang2022training\]agent, which is trained directly on a specific user’s feedback\. This approach serves as a practical upper bound for performance but is computationally expensive and does not generalize across different users\. To tackle the multi\-objective nature of the task, Personalized Soups \(P\-SOUPS\)\[jang2023personalized\]trains multiple specialized policies, each optimized for a single objective\. At inference time, it creates a personalized policy by mixing the model parameters of these specialized policies via a heuristic weighted average\. This technique is analogous to our explicit spectral souping approach\. We also compare against two decoding\-time alignment frameworks, which are conceptually similar to our implicit spectral souping approach\. The Personalized Alignment at Decoding\-time \(PAD\) framework\[chen2024pad\]guides the decoding process by mixing the model’s output distribution based on a preference reward vector\. Inspired by successor features, the PAD\-SF variant\[barreto2018transfer\]re\-weights the output distribution using a softmax distribution derived from a learnable K\-dimensional scorer, which operates on a spectral representation, mimics the specialized advantage functions\. For our experiments, the default souping models are based on the Gemma\-V3\[team2025gemma\]4B, 1B, 270M architectures, implemented with rank\-100 LoRA modules on all intermediate layers\.
Our experimental results, summarized in Figures[1](https://arxiv.org/html/2605.20408#S5.F1)&[2](https://arxiv.org/html/2605.20408#S5.F2)for GemmaV3 4B and Figures[4](https://arxiv.org/html/2605.20408#A2.F4)&[5](https://arxiv.org/html/2605.20408#A2.F5)for GemmaV3 1B \(in Appendix[B\.3](https://arxiv.org/html/2605.20408#A2.SS3)\), validate the efficacy of spectral souping across different domains\. A key observation is that implicit and explicit spectral souping methods achieve comparable performance, with the explicit approach holding a slight edge in several settings\. This aligns with our hypothesis that both methods stem from the same theoretical foundation described in Section[3](https://arxiv.org/html/2605.20408#S3), but explicit souping is more precise as it avoids the sampling approximations inherent to the implicit method\. Crucially, our methods consistently approximate the performance of the computationally expensive, tailored RLHF upper bound, when averaged over all online users, achieving83%83\\%of the optimal performance on UltraFeedback,88%88\\%on T2I, and72%72\\%on Sleep coaching, thereby empirically confirming our theoretical sub\-optimality bounds\. When compared against other baselines, spectral souping demonstrates significant advantages\. It surpasses P\-SOUPS, even after tuning the hyper\-parameters \(the P\-SOUPS weights\) used for averaging, underscoring the necessity of merging policies within the structured spectral representation space rather than an arbitrary parameter space\. Similarly, implicit spectral souping is more stable and performs better than PAD and PAD\-SF baselines\. This suggests that guiding the policy mixture with optimal Q\-function weights in the spectral representation space is more effective than using reward weights \(as in PAD\) or souping weights heuristically developed via advantage approximations with successor features \(as in PAD\-SF\)\. Finally, across all held\-out users in both T2I and sleep coaching, spectral souping exhibits faster and more robust online adaptation, proven to be more data\-efficient than all baselines, highlighting its capability in real\-world scenarios\.
Figure 1:Test\-time Training Performance of Different Methods: Explicit & Implicit Spectral Souping \(SS\-Exp & SS\-Imp\), P\-SOUPS, PAD, PAD\-SF, RLHF, adapted to 5 various users in the UltraFeedback, T2I Generation, Sleep Coaching domains\. The SS methods \(especially SS\-Exp\) consistently and outperform P\-SOUPS and the PAD baselines, demonstrating superior performance in online adaptation\.Figure 2:Evaluation Performance of Different Online Adaptation Methods: Explicit & Implicit Spectral Souping \(SS\-Exp & SS\-Imp\), P\-SOUPS, PAD, PAD\-SF, RLHF, across adapted to 5 various users in the UltraFeedback, T2I Generation, Sleep Coaching domains\. The superior performance of the SS methods \(over P\-SOUPS, PAD, and PAD\-SF baselines\) is also generalizable to online evaluations\.Our ablation study investigates the impact of reducing the number of offline specialized policies \(KK\) on online adaptation performance, with results presented in the scaling\-law curves in Figure[3](https://arxiv.org/html/2605.20408#S5.F3)with analysis conducted across our 3 domains using the Gemma\-V3 4B, 1B and 270M models respectively\. To manage the combinatorial complexity of basis selection of specialized policies, we employed a leave\-one\-out elimination approach, randomly selecting and removing one specialized policy at a time, and averaging the results\. The findings reveal several key insights\. First, as illustrated in both figures, reducing the number of basis policies consistently degrades online learning performance across all domains, as expected\. More importantly, we observe a significant performance drop below a certain threshold for each domain—specifically atK=7K=7for UltraFeedback,K=5K=5for T2I generation, andK=13K=13for sleep coaching\. This suggests the existence of a minimal set of specialized policies is required to form a basis that effectively span the representation space for various preferences\. Second, the size of this minimal basis set appears correlated with the complexity of the problem’s underlying characteristics; domains like UltraFeedback \(4 core attributes\) and T2I \(5 preference categories\) require fewer basis policies than the real\-world sleep coaching domain, which involves more diverse and complex auto\-rater feedback\. Finally, the larger Gemma\-V3 4B model exhibits a more gradual decrease in performance as specialized policies are removed\. This suggests that larger pretrained models capture a more semantically comprehensive spectral representation, making each specialized policy more expressive and the overall basis more robust to reductions in its span\.
Figure 3:Scaling Laws of SS\-Exp to illustrate the effect of model size on performance with an increasing number of specialized policies across three domains\. While performance consistently improved with model size, the larger \(4B\) model architecture developed a more comprehensive spectral representation, making its online adaptation agents more robust to basis reductions\.
## 6Related Work
Our work is situated at the intersection of LLM preference alignment, RL spectral representation, and online adaptation\. We review the key concepts in these areas that relate to our methods below\.
#### LLM Alignment
A dominant approach to aligning LLMs with human preferences is RLHF\[christiano2017deep,bakker2022fine\], which involves training a reward model to mimic human feedback and then using an algorithm like Proximal Policy Optimization \(PPO\)\[schulman2017proximal\]to fine\-tune the LLM\. Direct Policy Optimization \(DPO\) has also emerged as a promising alternative that simplifies the above process and reduces computational cost\. It reformulates the RLHF problem as a preference\-based classification task, optimizing a policy without the need for an explicit reward model\. More recently, research has explored decoding\-time alignment as a way to avoid the computational cost of model training\. This includes methods like Controlled Decoding \(CD\)\[mudgal2023controlled\], which uses a value\-based scorer to guide generation, ARGS\[khanov2024args\], which adjusts probabilistic predictions based on online feedback, and DeAL\[huang2024deal\]which focuses on heuristic\-guided searches to meet diverse alignment goals\. While these methods are powerful, they often address alignment w\.r\.t\. a single, uniform preference\. Our work extends these concepts by focusing on personalized alignment, which considers the diverse preferences of individual users\.
#### Spectral Representations for RL
Representation learning is crucial in RL for abstracting complex state and action spaces to facilitate policy optimization\. Existing methods for learning these representations utilize various techniques, including reconstruction\[watter2015embed,hafner2019dream,fujimoto2023sale\], successor features\[gershman2012successor,kulkarni2016deep,barreto2017successor\], and bisimulation\[ferns2014bisimulation,gelada2019deepmdp,zhang2025revisiting\]\. A particularly effective approach involves using spectral decomposition, which has been explored in works likemahadevan2007protoandren2022spectral\. These methods often assume that the environment’s transition kernel has a low\-rank spectral structure, enabling the use of linear representations for value functions and leads to provably sample\-efficient algorithms\[jin2019bayesian,yang2020reinforcement\]\. Leveraging the above theoretical underpinnings, our work develops a universal spectral representation in LLMs, whose existence within language MDPs has been proven\. This finding justifies our online adaptation algorithm for personalized preference alignment, which is designed to only modify the projection vectors within this stable spectral representation, rather than fine\-tuning the entire model\. Our approach is distinct from prior work as it connects spectral representation principles to LLM personalization, a domain where such techniques have not been previously explored in this manner\.
#### Online Preference Alignment
Aligning LLMs with online preferences is crucial to addresses widely\-varying tastes of individual users\[kirk2023past,feng2024modular,jiang2024can,zhang2024self,zhong2024panacea,wang2024arithmetic\]\. Existing personalized alignment methods can be grouped into three categories: \(i\) Joint optimization of multi\-dimensional preference rewards\[zeng2023diversified,li2024dissecting,wang2024arithmetic,zhu2025structured,yang2024aligning,chakraborty2024maxmin,das2024active,zhong2024panacea\]; \(ii\) Merging model parameters \(via linearly interpolation\) or outputs \(via mixture\-of\-expert composition\) w\.r\.t\. multiple preference dimensions\[jang2023personalized,rame2024warp,park2024learning,yang2024model,wan2024knowledge\]; \(iii\) Prompt\-based methods that use diverse prompts to guide the model toward specific preferences\[jafari2024morl,trivedi2025align,hwang2023aligning,min2025prompting,ravichandran2025align\], but that requires textual preference descriptions\. Our work combines the first two approaches, which formulates the personalized preference alignment as a Multi\-objective language MDP, utilizes its universal spectral representation for LLMs, and develops a theoretically\-grounded online adaptation recipe that unifies both model parameter merging and mixture\-of\-expert output sampling to effectively handle different trade\-offs on conflicting preferences\.
## 7Conclusion
This work introduces spectral souping, a principled framework that addresses the critical challenge of LLM adaptation by discovering a universal spectral representation within the language MDP\. This reveals that diverse policies inhabit a structured, low\-dimensional latent space, allowing any specialized policy to be effectively represented as a linear combination of a few pre\-trained basis policies\. This discovery unlocks an efficient two\-phase methodology: an offline phase to train a compact set of basis policies, followed by an online inference phase where they are dynamically combined to tailor responses to any user without costly retraining\. The efficacy of this approach is validated across several realistic LLM preference\-alignment domains where spectral souping consistently outperforms state\-of\-the\-art baselines\. For instance, it is significantly more data\-efficient than two\-stage methods that first infer a tailored reward before applying RLHF, and more stable than alternatives that estimates the occupancy measures \(via successor features\) of tailored policies\. Crucially, unlike prior heuristic approaches, our framework is grounded in formal guarantees, providing sub\-optimality bounds that ensure its performance approximates that of a fully fine\-tuned LLM\. Spectral souping thus offers a scalable, computationally efficient, and theoretically sound solution for online adaptation of LLMs\.
This work opens several promising avenues for future research\. One key direction is to learn an optimal spectral representation directly during pre\-training, embedding a universal basis for diverse human preferences into the foundation model itself to make it inherently more adaptable\. Another avenue involves analyzing more sophisticated online adaptation algorithms, such as non\-linear souping techniques or meta\-learning approaches, to infer user needs more rapidly from sparse feedback\. Finally, extending the principles of spectral souping to other modalities \(beyond LLMs\), including personalized text\-to\-image generation, presents an exciting opportunity to generalize this framework, enabling robust personalization across a wide range of creative and practical domains\.
## References
## Appendix ADerivations of Results in Section[3](https://arxiv.org/html/2605.20408#S3)
### A\.1Proof of Lemma[1](https://arxiv.org/html/2605.20408#Thmlemma1)
Given the reference LLM representationψ\(s\)∈ℝd\\psi\(s\)\\in\\mathbb\{R\}^\{d\}and an action token embeddingνref\(a\)∈ℝd\\nu\_\{\\text\{ref\}\}\(a\)\\in\\mathbb\{R\}^\{d\}, the reference policy is expressed as a softmax function:π\(a\|s\)=exp\(ψ\(s\)⊤νref\(a\)\)/∫b∈Aexp\(ψ\(s\)⊤νref\(b\)\)𝑑b\\pi\(a\|s\)=\\exp\(\\psi\(s\)^\{\\top\}\\nu\_\{\\text\{ref\}\}\(a\)\)/\\int\_\{b\\in A\}\\exp\(\\psi\(s\)^\{\\top\}\\nu\_\{\\text\{ref\}\}\(b\)\)db\. We can also rewrite the inner product using the identityψ\(s\)⊤νref\(a\)=−12\(‖ψ\(s\)−νref\(a\)‖2−‖ψ\(s\)‖2−‖νref\(a\)‖2\)\\psi\(s\)^\{\\top\}\\nu\_\{\\text\{ref\}\}\(a\)=\-\\frac\{1\}\{2\}\(\\\|\\psi\(s\)\-\\nu\_\{\\text\{ref\}\}\(a\)\\\|^\{2\}\-\\\|\\psi\(s\)\\\|^\{2\}\-\\\|\\nu\_\{\\text\{ref\}\}\(a\)\\\|^\{2\}\)\. Substituting this into the reference policy equation reveals a Gaussian kernel:πref\(a\|s\)∝exp\(−12‖ψ\(s\)−νref\(a\)‖2\)\\pi\_\{\\text\{ref\}\}\(a\|s\)\\propto\\exp\\left\(\-\\frac\{1\}\{2\}\\\|\\psi\(s\)\-\\nu\_\{\\text\{ref\}\}\(a\)\\\|^\{2\}\\right\), which measures the similarity between the state and action embeddings\. To create a more tractable linear representation, we approximate this Gaussian kernel using Random Fourier Features \(RFF\)\. RFF approximates a continuous shift\-invariant kernel with an inner product of randomized feature maps\. Applying this technique yields a spectral representation of the reference LLM policy:
πref\(a\|s\)=⟨ϕω\(s\),μref,ω\(a\)⟩N\(ω\)⟨ϕω\(s\),∫b∈Aμref,ω\(b\)𝑑b⟩N\(ω\)\\pi\_\{\\text\{ref\}\}\(a\|s\)=\\frac\{\\langle\\phi\_\{\\omega\}\(s\),\\mu\_\{\\text\{ref\},\\omega\}\(a\)\\rangle\_\{N\(\\omega\)\}\}\{\\langle\\phi\_\{\\omega\}\(s\),\\int\_\{b\\in A\}\\mu\_\{\\text\{ref\},\\omega\}\(b\)db\\rangle\_\{N\(\\omega\)\}\}\(11\)Here, the expectation⟨⋅,⋅⟩N\(ω\)\\langle\\cdot,\\cdot\\rangle\_\{N\(\\omega\)\}is over a random frequency vectorω∼𝒩\(0,I\)\\omega\\sim\\mathcal\{N\}\(0,I\), and the feature maps are defined as:ϕω\(s\)=exp\(−iω⊤ψ\(s\)\)exp\(12‖ψ\(s\)‖2\)\\phi\_\{\\omega\}\(s\)=\\exp\(\-i\\omega^\{\\top\}\\psi\(s\)\)\\exp\(\\frac\{1\}\{2\}\\\|\\psi\(s\)\\\|^\{2\}\),μref,ω\(a\)=exp\(iω⊤νref\(a\)\)exp\(12‖νref\(a\)‖2\)\\mu\_\{\\text\{ref\},\\omega\}\(a\)=\\exp\(i\\omega^\{\\top\}\\nu\_\{\\text\{ref\}\}\(a\)\)\\exp\(\\frac\{1\}\{2\}\\\|\\nu\_\{\\text\{ref\}\}\(a\)\\\|^\{2\}\)This spectral representation transforms the non\-linear kernel into a linear inner products in a randomized feature space, providing a direct path toward understanding the conditions under which the corresponding optimal Q\-function can also be represented linearly\.
To prove this technical lemma, we start by understanding the L\-step decodability condition in Assumption[2](https://arxiv.org/html/2605.20408#Thmassumption2)\. Consider the optimal Bellman equation forQ∗\(st,at\)Q^\{\*\}\(s\_\{t\},a\_\{t\}\)in Equation[3](https://arxiv.org/html/2605.20408#S2.E3), which can be easily derived by unrolling the one\-step Bellman equation forward in time forLLsteps and using the time consistency property of exponential risk measurehau2023entropiclog𝔼πrefexp\(Q\(s,a\)/β\)\\log\\mathbb\{E\}\_\{\\pi\_\{\\text\{ref\}\}\}\\exp\(\{Q\(s,a\)\}/\{\\beta\}\), and the fact that the transition dynamics of the language MDP is deterministics′=\(s,a\)s^\{\\prime\}=\(s,a\):
Q∗\(st,at\)=r\(st,at\)\+βlog𝔼πref,t\+1:t\+L−1\[exp\(1β∑i=t\+1t\+L−1r\(si,ai\)\+Q∗\(st\+L,at\+L\)\)\],∀st,at\.Q^\{\*\}\(s\_\{t\},a\_\{t\}\)=r\(s\_\{t\},a\_\{t\}\)\+\\beta\\log\\mathbb\{E\}\_\{\\pi\_\{\\text\{ref\},t\+1:t\+L\-1\}\}\\left\[\\exp\\left\(\\frac\{1\}\{\\beta\}\\sum\_\{i=t\+1\}^\{t\+L\-1\}r\(s\_\{i\},a\_\{i\}\)\+Q^\{\*\}\(s\_\{t\+L\},a\_\{t\+L\}\)\\right\)\\right\],\\,\\,\\forall s\_\{t\},a\_\{t\}\.\(12\)According to theLL\-step decodability assumption, the functionQ∗\(st\+L,at\+L\)Q^\{\*\}\(s\_\{t\+L\},a\_\{t\+L\}\)from the above formulation is directly independent to the earlier historyst=\(s0,a0,a1,…,at−1\)s\_\{t\}=\(s\_\{0\},a\_\{0\},a\_\{1\},\\ldots,a\_\{t\-1\}\)andata\_\{t\}\. However, the forward steps are conducted according to the policyπref,t\+1:t\+L−1=\{πref\(⋅\|st\+1\),…,πref\(⋅\|st\+L−1\)\}\\pi\_\{\\text\{ref\},t\+1:t\+L\-1\}=\\\{\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\_\{t\+1\}\),\\ldots,\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\_\{t\+L\-1\}\)\\\}still depends on parts ofsts\_\{t\}, hence the distribution in the above expectation still retains a dependence onsts\_\{t\}\.
To further study the dependence on history of this optimal Q\-function, we introduce a final observation: For the reference policyπref\\pi\_\{\\text\{ref\}\}in Equation \([11](https://arxiv.org/html/2605.20408#A1.E11)\), under the L\-step decodability assumption there exists a corresponding policyκπref\\kappa\_\{\\pi\_\{\\text\{ref\}\}\}, known as the*moment matching policy*, that conditions on a sufficient latent variable \(the reference LLM representation\) to generate the same expected observation dynamics while being independent of history older thanLLstepszhang2023provable\. Now, with this observation we consider theLL\-step auto\-regressive distribution induced by the reference policyπref\\pi\_\{\\text\{ref\}\}, i\.e\.,ℙπref\(st\+1:t\+L,at\+1:t\+L\|st,at\)\\mathbb\{P\}^\{\\pi\_\{\\text\{ref\}\}\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\|s\_\{t\},a\_\{t\}\)\. UnderLL\-step decodability, for any arbitrary state\-action pair\(st,at\)\(s\_\{t\},a\_\{t\}\)at any steptt, this forward distribution emits the following spectral decomposition:
ℙπref\(st\+1:t\+L,at\+1:t\+L\|st,at\)=⟨ϕω\(st,at\),μκref,ω\(st\+1:t\+L,at\+1:t\+L\)⟩𝒩\(ω\)⟨ϕω\(st,at\),∫μκref,ω\(st\+1:t\+L,at\+1:t\+L\)𝑑st\+1:t\+L𝑑at\+1:t\+L⟩𝒩\(ω\),\\mathbb\{P\}^\{\\pi\_\{\\text\{ref\}\}\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\|s\_\{t\},a\_\{t\}\)=\\frac\{\\langle\\phi\_\{\\omega\}\(s\_\{t\},a\_\{t\}\),\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\)\\rangle\_\{\\mathcal\{N\}\(\\omega\)\}\}\{\\langle\\phi\_\{\\omega\}\(s\_\{t\},a\_\{t\}\),\\int\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\)ds\_\{t\+1:t\+L\}da\_\{t\+1:t\+L\}\\rangle\_\{\\mathcal\{N\}\(\\omega\)\}\},\(13\)When modeling the state\-action pairs in the nexth∈\{1,…,L\}h\\in\\\{1,\\ldots,L\\\}steps, the above expression uses the*moment matching*trick, which uses the policyκref\\kappa\_\{\\text\{ref\}\}that only depends on the latent variable and is independent of\(xt,at\)\(x\_\{t\},a\_\{t\}\)\. Under this policy,μκref,ω\(st\+1:t\+L,at\+1:t\+L\)\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\), is then a function that maps to the latent space by construction and is independent of history, leading to the above spectral representation\. Utilizing this result, one can re\-write Equation \([12](https://arxiv.org/html/2605.20408#A1.E12)\) as
Q∗\(st,at\)−r\(st,at\)=βlog∫ℙπref\(st\+1:t\+L,at\+1:t\+L\|st,at\)exp\(1β∑i=t\+1t\+L−1r\(si,ai\)\+Q∗\(st\+L,at\+L\)\)𝑑st\+1:t\+L𝑑at\+1:t\+L=βlog∫⟨ϕω\(st,at\),μκref,ω\(st\+1:t\+L,at\+1:t\+L\)⟩𝒩\(ω\)exp\(1β∑i=t\+1t\+L−1r\(si,ai\)\+Q∗\(st\+L,at\+L\)\)𝑑st\+1:t\+L𝑑at\+1:t\+L⟨ϕω\(st,at\),∫μκref,ω\(st\+1:t\+L,at\+1:t\+L\)𝑑st\+1:t\+L𝑑at\+1:t\+L⟩𝒩\(ω\)=βlog∫⟨ϕω\(st,at\),μκref,ω\(st\+1:t\+L,at\+1:t\+L\)exp\(1β∑i=t\+1t\+L−1r\(si,ai\)\+Q∗\(st\+L,at\+L\)\)⟩𝒩\(ω\)𝑑st\+1:t\+L𝑑at\+1:t\+L⟨ϕω\(st,at\),∫μκref,ω\(st\+1:t\+L,at\+1:t\+L\)𝑑st\+1:t\+L𝑑at\+1:t\+L⟩𝒩\(ω\)=βlog⟨ϕω\(st,at\),∫μκref,ω\(st\+1:t\+L,at\+1:t\+L\)exp\(1β∑i=t\+1t\+L−1r\(si,ai\)\+Q∗\(st\+L,at\+L\)\)𝑑st\+1:t\+L𝑑at\+1:t\+L⏟μβ,Q∗,ω=exp\(−iω⊤νβ,∑r\)exp\(‖νβ,∑r‖2/2\),for someνβ,∑r⟩𝒩\(ω\)⟨ϕω\(st,at\),∫μκref,ω\(st\+1:t\+L,at\+1:t\+L\)𝑑st\+1:t\+L𝑑at\+1:t\+L⏟μκ¯ref,ω=exp\(−iω⊤νref\)exp\(‖νref‖2/2\),for someνref⟩𝒩\(ω\),\\begin\{split\}&Q^\{\*\}\(s\_\{t\},a\_\{t\}\)\-r\(s\_\{t\},a\_\{t\}\)\\\\ =&\\beta\\log\\int\\mathbb\{P\}^\{\\pi\_\{\\text\{ref\}\}\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\|s\_\{t\},a\_\{t\}\)\\exp\\left\(\\frac\{1\}\{\\beta\}\\sum\_\{i=t\+1\}^\{t\+L\-1\}r\(s\_\{i\},a\_\{i\}\)\+Q^\{\*\}\(s\_\{t\+L\},a\_\{t\+L\}\)\\right\)ds\_\{t\+1:t\+L\}da\_\{t\+1:t\+L\}\\\\ =&\\beta\\log\\frac\{\\int\\langle\\phi\_\{\\omega\}\(s\_\{t\},a\_\{t\}\),\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\)\\rangle\_\{\\mathcal\{N\}\(\\omega\)\}\\exp\\left\(\\frac\{1\}\{\\beta\}\\sum\_\{i=t\+1\}^\{t\+L\-1\}r\(s\_\{i\},a\_\{i\}\)\+Q^\{\*\}\(s\_\{t\+L\},a\_\{t\+L\}\)\\right\)ds\_\{t\+1:t\+L\}da\_\{t\+1:t\+L\}\}\{\\langle\\phi\_\{\\omega\}\(s\_\{t\},a\_\{t\}\),\\int\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\)ds\_\{t\+1:t\+L\}da\_\{t\+1:t\+L\}\\rangle\_\{\\mathcal\{N\}\(\\omega\)\}\}\\\\ =&\\beta\\log\\frac\{\\int\\langle\\phi\_\{\\omega\}\(s\_\{t\},a\_\{t\}\),\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\)\\exp\(\\frac\{1\}\{\\beta\}\\sum\_\{i=t\+1\}^\{t\+L\-1\}r\(s\_\{i\},a\_\{i\}\)\+Q^\{\*\}\(s\_\{t\+L\},a\_\{t\+L\}\)\)\\rangle\_\{\\mathcal\{N\}\(\\omega\)\}ds\_\{t\+1:t\+L\}da\_\{t\+1:t\+L\}\}\{\\langle\\phi\_\{\\omega\}\(s\_\{t\},a\_\{t\}\),\\int\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\)ds\_\{t\+1:t\+L\}da\_\{t\+1:t\+L\}\\rangle\_\{\\mathcal\{N\}\(\\omega\)\}\}\\\\ =&\\beta\\log\\frac\{\\bigg\\langle\\phi\_\{\\omega\}\(s\_\{t\},a\_\{t\}\),\\underbrace\{\\int\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\)\\exp\\left\(\\frac\{1\}\{\\beta\}\\sum\_\{i=t\+1\}^\{t\+L\-1\}r\(s\_\{i\},a\_\{i\}\)\+Q^\{\*\}\(s\_\{t\+L\},a\_\{t\+L\}\)\\right\)ds\_\{t\+1:t\+L\}da\_\{t\+1:t\+L\}\}\_\{\\mu\_\{\\beta,Q^\{\*\},\\omega\}=\\exp\(\-i\\omega^\{\\top\}\\nu\_\{\\beta,\\sum r\}\)\\exp\(\\\|\\nu\_\{\\beta,\\sum r\}\\\|^\{2\}/2\),\\text\{ for some $\\nu\_\{\\beta,\\sum r\}$ \}\}\\bigg\\rangle\_\{\\mathcal\{N\}\(\\omega\)\}\}\{\\langle\\phi\_\{\\omega\}\(s\_\{t\},a\_\{t\}\),\\underbrace\{\\int\\mu\_\{\\kappa\_\{\\text\{ref\}\},\\omega\}\(s\_\{t\+1:t\+L\},a\_\{t\+1:t\+L\}\)ds\_\{t\+1:t\+L\}da\_\{t\+1:t\+L\}\}\_\{\\mu\_\{\\overline\{\\kappa\}\_\{\\text\{ref\}\},\\omega\}=\\exp\(\-i\\omega^\{\\top\}\\nu\_\{\\text\{ref\}\}\)\\exp\(\\\|\\nu\_\{\\text\{ref\}\}\\\|^\{2\}/2\),\\text\{ for some $\\nu\_\{\\text\{ref\}\}$ \}\}\\rangle\_\{\\mathcal\{N\}\(\\omega\)\}\},\\end\{split\}where the first equality follows directly from the definitions of the optimal Q\-function and the forward distribution in Equation \([13](https://arxiv.org/html/2605.20408#A1.E13)\), the second equality follows from the linear property of the Fourier spectral representation, which is characterized by an inner product⟨⋅,⋅⟩N\(ω\)\\langle\\cdot,\\cdot\\rangle\_\{N\(\\omega\)\}in the frequency space, and the third equality follows from algebraic operations of the corresponding random Fourier feature maps\. The above arguments further imply that
Q∗\(st,at\)−r\(st,at\)=β⋅logexp\(ψ\(\(s,a\)\)⊤νβ,∑r\)exp\(ψ\(\(s,a\)\)⊤νref\)=ψ\(\(st,at\)\)⊤\(νβ,∑r−νref\)⋅β,∀st,at\.Q^\{\*\}\(s\_\{t\},a\_\{t\}\)\-r\(s\_\{t\},a\_\{t\}\)=\\beta\\cdot\\log\\frac\{\\exp\(\\psi\(\(s,a\)\)^\{\\top\}\\nu\_\{\\beta,\\sum r\}\)\}\{\\exp\(\\psi\(\(s,a\)\)^\{\\top\}\\nu\_\{\\text\{ref\}\}\)\}=\\psi\(\(s\_\{t\},a\_\{t\}\)\)^\{\\top\}\(\\nu\_\{\\beta,\\sum r\}\-\\nu\_\{\\text\{ref\}\}\)\\cdot\\beta,\\,\\,\\forall s\_\{t\},a\_\{t\}\.This further implies thatQ∗\(s,a\)=ψ\(\(s,a\)\)⊤νβ,r,refQ^\{\*\}\(s,a\)=\\psi\(\(s,a\)\)^\{\\top\}\\nu\_\{\\beta,r,\\text\{ref\}\}withνβ,r,ref=νr\+νβ,∑r−νref\\nu\_\{\\beta,r,\\text\{ref\}\}=\\nu\_\{\\text\{r\}\}\+\\nu\_\{\\beta,\\sum r\}\-\\nu\_\{\\text\{ref\}\}, meaning that the optimal Q\-function can be linearly parametrized with the reference LLM featureψ\\psiunder these conditions, completing the proof of this lemma\.
### A\.2Proof of Theorem[1](https://arxiv.org/html/2605.20408#Thmtheorem1)
For any base and spectral soup temperaturesβ,β′\>0\\beta,\\beta^\{\\prime\}\>0, logit\-mixture vectorλ∈ℝK\\lambda\\in\\mathbb\{R\}^\{K\}, recall the spectral soup policyπ~λ\(a\|s\)∝πref\(a\|s\)exp\(∑kλkQk∗\(s,a\)/β′\)\\tilde\{\\pi\}\_\{\\lambda\}\(a\|s\)\\propto\\pi\_\{\\text\{ref\}\}\(a\|s\)\\exp\(\\sum\_\{k\}\\lambda\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)/\\beta^\{\\prime\}\), the personalized policyπ𝐰∗\(a\|s\)∝πref\(a\|s\)exp\(Q𝐰∗\(s,a\)/β\)\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(a\|s\)\\propto\\pi\_\{\\text\{ref\}\}\(a\|s\)\\exp\(Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)/\\beta\), i\.e\.,DKL\(π𝐰∗\(⋅\|s\)\|\|π~λ\(⋅\|s\)\)D\_\{\\text\{KL\}\}\(\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(\\cdot\|s\)\|\|\\tilde\{\\pi\}\_\{\\lambda\}\(\\cdot\|s\)\), the personalized value function
V𝐰∗\(s\)=maxπ𝔼π\[∑t=0T−1r𝐰\(st,at\)−βDKL\(π\(⋅\|st\)∥πref\(⋅\|st\)\)\|s0=s\]=β⋅log𝔼a∼πref\(⋅\|s\)\[expQ𝐰∗\(s,a\)β\],V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)=\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{T\-1\}r\_\{\\mathbf\{w\}\}\(s\_\{t\},a\_\{t\}\)\-\\beta D\_\{KL\}\(\\pi\(\\cdot\|s\_\{t\}\)\\\|\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\_\{t\}\)\)\|s\_\{0\}=s\\right\]=\\beta\\cdot\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}\{\\beta\}\\right\],and notice that the spectral soup policy has the following optimal value function:
V𝐰,β′λ∗\(s\):=maxλ∈ℝK𝔼π~λ\[∑t=0T−1r𝐰\(st,at\)−β′∑k\|λk\|⋅DKL\(π~λ\(⋅\|st\)∥πref\(⋅\|st\)\)\|s0=s\]s\.t\.β∑k\|λk\|≤β′\.V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\):=\\max\_\{\\lambda\\in\\mathbb\{R\}^\{K\}\}\\,\\,\\mathbb\{E\}\_\{\\tilde\{\\pi\}\_\{\\lambda\}\}\\left\[\\sum\_\{t=0\}^\{T\-1\}r\_\{\\mathbf\{w\}\}\(s\_\{t\},a\_\{t\}\)\-\\frac\{\\beta^\{\\prime\}\}\{\\sum\_\{k\}\|\\lambda\_\{k\}\|\}\\cdot D\_\{KL\}\(\\tilde\{\\pi\}\_\{\\lambda\}\(\\cdot\|s\_\{t\}\)\\\|\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\_\{t\}\)\)\|s\_\{0\}=s\\right\]\\,\\,\\text\{s\.t\.\}\\,\\,\\beta\\sum\_\{k\}\|\\lambda\_\{k\}\|\\leq\\beta^\{\\prime\}\.
First, we aim to compute an upper bound for the KL divergence between the spectral soup policy and the personalized policy, i\.e\.,DKL\(π𝐰∗\(⋅\|s\)\|\|π~λ\(⋅\|s\)\)D\_\{\\text\{KL\}\}\(\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(\\cdot\|s\)\|\|\\tilde\{\\pi\}\_\{\\lambda\}\(\\cdot\|s\)\)\. Expanding the KL divergence term, we obtain the following inequality for any logit\-mixture vectorλ∈ℝK\\lambda\\in\\mathbb\{R\}^\{K\}:
0≤DKL\(π𝐰∗\(⋅\|s\)\|\|π~λ\(⋅\|s\)\)=𝔼a∼π𝐰∗\(⋅\|s\)\[logπ𝐰∗\(a\|s\)π~λ\(a\|s\)\]\\displaystyle 0\\leq D\_\{\\text\{KL\}\}\(\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(\\cdot\|s\)\|\|\\tilde\{\\pi\}\_\{\\lambda\}\(\\cdot\|s\)\)=\\mathbb\{E\}\_\{a\\sim\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(\\cdot\|s\)\}\\left\[\\log\\frac\{\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(a\|s\)\}\{\\tilde\{\\pi\}\_\{\\lambda\}\(a\|s\)\}\\right\]=\\displaystyle=𝔼a∼π𝐰∗\(⋅\|s\)\[logπref\(a\|s\)\+Q𝐰∗\(s,a\)−V𝐰∗\(s\)β′−\(logπref\(a\|s\)\+∑kλkQk∗\(s,a\)β′−log𝔼a∼πref\(⋅\|s\)exp∑kλkQk∗\(s,a\)β′\)\]\\displaystyle\\mathbb\{E\}\_\{a\\sim\\pi^\{\*\}\_\{\\mathbf\{w\}\}\(\\cdot\|s\)\}\\left\[\\log\\pi\_\{\\text\{ref\}\}\(a\|s\)\+\\frac\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\-V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)\}\{\\beta^\{\\prime\}\}\-\\left\(\\log\\pi\_\{\\text\{ref\}\}\(a\|s\)\+\\frac\{\\sum\_\{k\}\\lambda\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta^\{\\prime\}\}\-\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{\\sum\_\{k\}\\lambda\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta^\{\\prime\}\}\\right\)\\right\]=\\displaystyle=𝔼a∼π𝐰∗\[Q𝐰∗\(s,a\)β−∑kλkQk∗\(s,a\)β′\]\+\[log𝔼a∼πref\(⋅\|s\)exp∑kλkQk∗\(s,a\)β′−log𝔼a∼πref\(⋅\|s\)expQ𝐰∗\(s,a\)β\]\\displaystyle\\mathbb\{E\}\_\{a\\sim\\pi^\{\*\}\_\{\\mathbf\{w\}\}\}\\left\[\\frac\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}\{\\beta\}\-\\sum\_\{k\}\\frac\{\\lambda\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta^\{\\prime\}\}\\right\]\+\\left\[\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{\\sum\_\{k\}\\lambda\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta^\{\\prime\}\}\-\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}\{\\beta\}\\right\]≤\\displaystyle\\leq1β′\(‖Ea∼π𝐰∗\[ψ\(\(s,a\)\)\]‖2\+‖Ea∼πref\|ψ\(\(s,a\)\)\|‖2\)⋅‖β′βνβ,r𝐰,ref−∑kλkνβ,rk,ref‖2,\\displaystyle\\frac\{1\}\{\\beta^\{\\prime\}\}\\bigg\(\\\|E\_\{a\\sim\\pi^\{\*\}\_\{\\mathbf\{w\}\}\}\[\\psi\(\(s,a\)\)\]\\\|\_\{2\}\+\\\|E\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}\|\\psi\(\(s,a\)\)\|\\\|\_\{2\}\\bigg\)\\cdot\\left\\\|\\frac\{\\beta^\{\\prime\}\}\{\\beta\}\\nu\_\{\\beta,r\_\{\\mathbf\{w\}\},\\text\{ref\}\}\-\\sum\_\{k\}\\lambda\_\{k\}\\,\\nu\_\{\\beta,r\_\{k\},\\text\{ref\}\}\\right\\\|\_\{2\},\(14\)where the last inequality follows from Lemma[1](https://arxiv.org/html/2605.20408#Thmlemma1)and the Lipchitz property of thelog𝔼πrefexp\(X\)\\log\\mathbb\{E\}\_\{\\pi\_\{\\text\{ref\}\}\}\\exp\(X\)function \(with Lipschitz constant11\)\. Second, we aim to derive the performance sub\-optimality bound with respect to the personalized\-soup value functionV𝐰,β′λ∗\(s\)V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\), and personalized value functionV𝐰∗\(s\)V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)\. By the above definitions, we can easily argue that \(i\)V𝐰∗\(s\)≥V𝐰,β′λ∗\(s\)V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)\\geq V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\)as the personalized\-soup optimization problem always has a lower objective value than that of the personalized RLHF problem, and \(ii\)V𝐰,β′λ∗\(s\)≥V𝐰,β′πk∗\(s\)V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\)\\geq V^\{\\pi^\{\*\}\_\{k\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\),∀k∈\{1,…,K\}\\forall k\\in\\\{1,\\ldots,K\\\}, as the right hand side of this inequality can be recovered by setting the logit mixture vectorλ\\lambdaas the corresponding one\-hot vector with magnitudeβ′/β\>0\\beta^\{\\prime\}/\\beta\>0at thekk\-th attribute\. Furthermore, based on the assumption of non\-negative rewards𝐫\\mathbf\{r\}, which implies that all optimal value functions \(V𝐰∗V^\{\*\}\_\{\\mathbf\{w\}\},V𝐰λ∗V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\}\},Vk∗V^\{\*\}\_\{k\}, but not necessarilyV𝐰,β′πk∗V^\{\\pi^\{\*\}\_\{k\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}since it is not an optimal value function\) are also non\-negative\. Withβ′/\(β∑k\|λk∗\|\)≥1\{\\beta^\{\\prime\}\}/\{\(\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\)\}\\geq 1, we can therefore express the sub\-optimality performance bound of the personalized\-soup policy at any statessas
V𝐰∗\(s\)≥V𝐰,β′λ∗\(s\)≥∑k\|λk∗\|∑k\|λk∗\|V𝐰,β′πk∗\(s\)=∑k\|λk∗\|∑k\|λk∗\|\(V𝐰,β′πk∗\(s\)−Vk∗\(s\)\+Vk∗\(s\)\)≥∑k\|λk∗\|∑k\|λk∗\|\(V𝐰,β′πk∗\(s\)−Vk∗\(s\)\)⏟A\(s\)\+ββ′\(∑kλk∗Vk∗\(s\)−β∑k\|λk∗\|log𝔼a∼πref\(⋅\|s\)exp∑kλk∗Qk∗\(s,a\)β∑k\|λk∗\|⏟B\(s\)\+β∑k\|λk∗\|log𝔼a∼πref\(⋅\|s\)exp∑kλk∗Qk∗\(s,a\)β∑k\|λk∗\|−β′log𝔼a∼πref\(⋅\|s\)expQ𝐰∗\(s,a\)β⏟C\(s\)\)\+V∗𝐰\(s\)\.\\begin\{split\}V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)\\geq V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}&\(s\)\\geq\\sum\_\{k\}\\frac\{\|\\lambda^\{\*\}\_\{k\}\|\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}V^\{\\pi^\{\*\}\_\{k\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\)=\\sum\_\{k\}\\frac\{\|\\lambda^\{\*\}\_\{k\}\|\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\left\(V^\{\\pi^\{\*\}\_\{k\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\)\-V\_\{k\}^\{\*\}\(s\)\+V\_\{k\}^\{\*\}\(s\)\\right\)\\\\ \\geq&\\underbrace\{\\sum\_\{k\}\\frac\{\|\\lambda^\{\*\}\_\{k\}\|\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\left\(V^\{\\pi^\{\*\}\_\{k\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\)\-V\_\{k\}^\{\*\}\(s\)\\right\)\}\_\{A\(s\)\}\+\\frac\{\\beta\}\{\\beta^\{\\prime\}\}\\Bigg\(\\underbrace\{\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}V^\{\*\}\_\{k\}\(s\)\-\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\}\_\{B\(s\)\}\\\\ &\\quad\+\\underbrace\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\-\\beta^\{\\prime\}\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}\{\\beta\}\}\_\{C\(s\)\}\\Bigg\)\+V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)\.\\end\{split\}This performance bound can further be simplified via the following sequence of inequalities:
V𝐰,β′λ∗\(s\)≥V𝐰∗\(s\)\+A\(s\)\+\(β/β′\)B\(s\)−‖Ea∼πref\|ψ\(\(s,a\)\)\|‖2‖νβ,r𝐰,ref−\(β/β′\)∑kλk∗νβ,rk,ref‖2\\displaystyle V^\{\\lambda^\{\*\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\)\\geq V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)\+A\(s\)\+\(\{\\beta\}/\{\\beta^\{\\prime\}\}\)B\(s\)\-\\\|E\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}\|\\psi\(\(s,a\)\)\|\\\|\_\{2\}\\\|\\nu\_\{\\beta,r\_\{\\mathbf\{w\}\},\\text\{ref\}\}\-\(\{\\beta\}/\{\\beta^\{\\prime\}\}\)\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}\\nu\_\{\\beta,r\_\{k\},\\text\{ref\}\}\\\|\_\{2\}≥\\displaystyle\\geqV𝐰∗\(s\)−∑t=0T−1𝔼π¯\[‖ψ\(\(st,at\)\)‖2\|s\]‖νr𝐰−∑k\|λk∗\|νrk∑k\|λk∗\|‖2\+ββ′B\(s\)−‖Ea∼πref\|ψ\(\(s,a\)\)\|‖2‖νβ,r𝐰,ref−ββ′∑kλk∗νβ,rk,ref‖2\\displaystyle V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)\-\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{\\underline\{\\pi\}\}\\left\[\\\|\\psi\(\(s\_\{t\},a\_\{t\}\)\)\\\|\_\{2\}\|s\\right\]\\left\\\|\\nu\_\{r\_\{\\mathbf\{w\}\}\}\-\\frac\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\nu\_\{r\_\{k\}\}\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\right\\\|\_\{2\}\+\\frac\{\\beta\}\{\\beta^\{\\prime\}\}B\(s\)\-\\\|E\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}\|\\psi\(\(s,a\)\)\|\\\|\_\{2\}\\\|\\nu\_\{\\beta,r\_\{\\mathbf\{w\}\},\\text\{ref\}\}\-\\frac\{\\beta\}\{\\beta^\{\\prime\}\}\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}\\nu\_\{\\beta,r\_\{k\},\\text\{ref\}\}\\\|\_\{2\}≥\\displaystyle\\geqV𝐰∗\(s\)−∑t=0T−1𝔼π¯\[‖ψ\(\(st,at\)\)‖2\|s0=s\]⋅‖νr𝐰−∑k\|λk∗\|νrk∑k\|λk∗\|‖2\+β2β′⋅∑kΔk\(s\)\(λk∗\)−\\displaystyle V^\{\*\}\_\{\\mathbf\{w\}\}\(s\)\-\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{\\underline\{\\pi\}\}\\left\[\\\|\\psi\(\(s\_\{t\},a\_\{t\}\)\)\\\|\_\{2\}\|s\_\{0\}=s\\right\]\\cdot\\left\\\|\\nu\_\{r\_\{\\mathbf\{w\}\}\}\-\\frac\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\nu\_\{r\_\{k\}\}\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\right\\\|\_\{2\}\+\\frac\{\\beta^\{2\}\}\{\\beta^\{\\prime\}\}\\cdot\\sum\_\{k\}\\Delta\_\{k\}\(s\)\(\\lambda^\{\*\}\_\{k\}\)\_\{\-\}−‖Ea∼πref\|ψ\(\(s,a\)\)\|‖2⋅‖νβ,r𝐰,ref−\(β/β′\)∑kλk∗νβ,rk,ref‖2,\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad\-\\\|E\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}\|\\psi\(\(s,a\)\)\|\\\|\_\{2\}\\cdot\\\|\\nu\_\{\\beta,r\_\{\\mathbf\{w\}\},\\text\{ref\}\}\-\(\{\\beta\}/\{\\beta^\{\\prime\}\}\)\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}\\,\\nu\_\{\\beta,r\_\{k\},\\text\{ref\}\}\\\|\_\{2\},\(15\)where the first inequality follows from the derivations in Equation \([9](https://arxiv.org/html/2605.20408#S3.E9)\), i\.e\., Lipschitz continuity oflog𝔼πrefexp\(X\)\\log\\mathbb\{E\}\_\{\\pi\_\{\\text\{ref\}\}\}\\exp\(X\), and the convexity of the functionf\(x\)=xβ′/\(β∑k\|λk∗\|\)f\(x\)=x^\{\\beta^\{\\prime\}/\(\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\)\},β′/\(β∑k\|λk∗\|\)≥1\\beta^\{\\prime\}/\(\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\)\\geq 1i\.e\.,
C\(s\)=β∑k\|λk∗\|log𝔼a∼πref\(⋅\|s\)exp∑kλk∗Qk∗\(s,a\)β∑k\|λk∗\|−β′log𝔼a∼πref\(⋅\|s\)expQ𝐰∗\(s,a\)β=β∑k\|λk∗\|log𝔼a∼πref\(⋅\|s\)\(exp∑kλk∗Qk∗\(s,a\)β′\)β′/β∑k\|λk∗\|−β′log𝔼a∼πref\(⋅\|s\)expQ𝐰∗\(s,a\)β≥β′⋅β∑k\|λk∗\|β∑k\|λk∗\|log\(𝔼a∼πref\(⋅\|s\)exp∑kλk∗Qk∗\(s,a\)β′\)−β′log𝔼a∼πref\(⋅\|s\)expQ𝐰∗\(s,a\)β≥−∥Ea∼πref\|ψ\(\(s,a\)\)\|∥2∥⋅∥β′βνβ,r𝐰,ref−∑kλk∗νβ,rk,ref∥2,\\begin\{split\}C\(s\)=&\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\-\\beta^\{\\prime\}\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}\{\\beta\}\\\\ =&\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\(\\exp\\frac\{\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta^\{\\prime\}\}\\right\)^\{\\beta^\{\\prime\}/\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\-\\beta^\{\\prime\}\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}\{\\beta\}\\\\ \\geq&\\frac\{\\beta^\{\\prime\}\\cdot\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\log\\left\(\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta^\{\\prime\}\}\\right\)\-\\beta^\{\\prime\}\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{Q^\{\*\}\_\{\\mathbf\{w\}\}\(s,a\)\}\{\\beta\}\\\\ \\geq&\-\\\|E\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}\|\\psi\(\(s,a\)\)\|\\\|\_\{2\}\\\|\\cdot\\\|\\frac\{\\beta^\{\\prime\}\}\{\\beta\}\\nu\_\{\\beta,r\_\{\\mathbf\{w\}\},\\text\{ref\}\}\-\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}\\,\\nu\_\{\\beta,r\_\{k\},\\text\{ref\}\}\\\|\_\{2\},\\end\{split\}∀s\\forall s, the second inequality follows from utilizing the definitions of value functions in the soft MDP:
A\(s\):=∑k\|λk∗\|∑k\|λk∗\|\(V𝐰,β′πk∗\(s\)−Vk∗\(s\)\)=1∑k\|λk∗\|∑k\|λk∗\|𝔼πk∗\[∑t=0T−1ψ\(\(st,at\)\)\|s0=s\]⊤\(νr𝐰−νrk\)≥1∑k\|λk∗\|minπ𝔼π\[∑t=0T−1ψ\(\(st,at\)\)⊤∑k\|λk∗\|\(νr𝐰−νrk\)\|s0=s\]=1∑k\|λk∗\|𝔼π¯\[∑t=0T−1ψ\(\(st,at\)\)⊤∑k\|λk∗\|\(νr𝐰−νrk\)\|s0=s\]≥−∑t=0T−1𝔼π¯\[‖ψ\(\(st,at\)\)‖2\|s\]⋅‖νr𝐰−∑k\|λk∗\|νrk∑k\|λk∗\|‖2,\\begin\{split\}&A\(s\):=\\sum\_\{k\}\\frac\{\|\\lambda^\{\*\}\_\{k\}\|\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\(V^\{\\pi^\{\*\}\_\{k\}\}\_\{\\mathbf\{w\},\\beta^\{\\prime\}\}\(s\)\-V\_\{k\}^\{\*\}\(s\)\)=\\frac\{1\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\,\\mathbb\{E\}\_\{\\pi\_\{k\}^\{\*\}\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\psi\(\(s\_\{t\},a\_\{t\}\)\)\|s\_\{0\}=s\\right\]^\{\\top\}\\left\(\\nu\_\{r\_\{\\mathbf\{w\}\}\}\-\\nu\_\{r\_\{k\}\}\\right\)\\\\ \\geq&\\frac\{1\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\min\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\psi\(\(s\_\{t\},a\_\{t\}\)\)^\{\\top\}\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\left\(\\nu\_\{r\_\{\\mathbf\{w\}\}\}\-\\nu\_\{r\_\{k\}\}\\right\)\|s\_\{0\}=s\\right\]\\\\ =&\\frac\{1\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\mathbb\{E\}\_\{\\underline\{\\pi\}\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\psi\(\(s\_\{t\},a\_\{t\}\)\)^\{\\top\}\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\left\(\\nu\_\{r\_\{\\mathbf\{w\}\}\}\-\\nu\_\{r\_\{k\}\}\\right\)\|s\_\{0\}=s\\right\]\\geq\-\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{\\underline\{\\pi\}\}\\left\[\\\|\\psi\(\(s\_\{t\},a\_\{t\}\)\)\\\|\_\{2\}\|s\\right\]\\cdot\\left\\\|\\nu\_\{r\_\{\\mathbf\{w\}\}\}\-\\frac\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\nu\_\{r\_\{k\}\}\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\right\\\|\_\{2\},\\end\{split\}and assume both the linear parameterization rewards in Assumption[1](https://arxiv.org/html/2605.20408#Thmassumption1)and the existence of a minimal policy over the mixture of reward differencesπ¯∈argminπ𝔼π\[∑t=0T−1ψ\(\(st,at\)\)⊤∑k\|λk∗\|\(νr𝐰−νrk\)\|s0=s\]\\underline\{\\pi\}\\in\\operatorname\*\{arg\\,min\}\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\psi\(\(s\_\{t\},a\_\{t\}\)\)^\{\\top\}\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\left\(\\nu\_\{r\_\{\\mathbf\{w\}\}\}\-\\nu\_\{r\_\{k\}\}\\right\)\|s\_\{0\}=s\\right\], and the third inequality follows from a hypothesis that there existsΔk\(s\)≥0\\Delta\_\{k\}\(s\)\\geq 0such that the following property holds \(such a technical result will be derived in the remaining part of the proof\):
B\(s\):=∑kλk∗Vk∗\(s\)−β∑k\|λk∗\|log𝔼a∼πref\(⋅\|s\)exp∑kλk∗Qk∗\(s,a\)β∑k\|λk∗\|≥β⋅∑kΔk\(s\)\(λk∗\)−⏟≤0\.B\(s\):=\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}V^\{\*\}\_\{k\}\(s\)\-\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\exp\\frac\{\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\geq\\beta\\cdot\\underbrace\{\\sum\_\{k\}\\Delta\_\{k\}\(s\)\(\\lambda^\{\*\}\_\{k\}\)\_\{\-\}\}\_\{\\leq 0\}\.\(16\)
To show that Equation \([16](https://arxiv.org/html/2605.20408#A1.E16)\) holds, first consider the following upper\-bound of1/x1/x, with0≤x¯≤x≤x¯0\\leq\\underline\{x\}\\leq x\\leq\\overline\{x\}, with respect to an arbitrary anchor pointx¯≤a≤x¯\\underline\{x\}\\leq a\\leq\\overline\{x\}, such that\|x−a\|≤\|a\|\|x\-a\|\\leq\|a\|,∀x\\forall x:
1x=1a−x−aa2\+\(a−x\)2a21x≤1a−x−aa2\+\(a−x¯\)2a21x⟹1x≤∑k=0∞\(\(a−x¯\)2a2\)k\(1a−x−aa2\)=2a−xa211−\(a−x¯\)2a2=2a−xa2a2x¯\(2a−x¯\)=2a−xx¯\(2a−x¯\)\.\\begin\{split\}&\\frac\{1\}\{x\}=\\frac\{1\}\{a\}\-\\frac\{x\-a\}\{a^\{2\}\}\+\\frac\{\(a\-x\)^\{2\}\}\{a^\{2\}\}\\frac\{1\}\{x\}\\leq\\frac\{1\}\{a\}\-\\frac\{x\-a\}\{a^\{2\}\}\+\\frac\{\(a\-\\underline\{x\}\)^\{2\}\}\{a^\{2\}\}\\frac\{1\}\{x\}\\\\ \\implies&\\frac\{1\}\{x\}\\leq\\sum\_\{k=0\}^\{\\infty\}\(\\frac\{\(a\-\\underline\{x\}\)^\{2\}\}\{a^\{2\}\}\)^\{k\}\\left\(\\frac\{1\}\{a\}\-\\frac\{x\-a\}\{a^\{2\}\}\\right\)=\\frac\{2a\-x\}\{a^\{2\}\}\\frac\{1\}\{1\-\\frac\{\(a\-\\underline\{x\}\)^\{2\}\}\{a^\{2\}\}\}=\\frac\{2a\-x\}\{a^\{2\}\}\\frac\{a^\{2\}\}\{\\underline\{x\}\(2a\-\\underline\{x\}\)\}=\\frac\{2a\-x\}\{\\underline\{x\}\(2a\-\\underline\{x\}\)\}\.\\end\{split\}Substituting to the above inequality
x=expQk∗\(s,a\)β,x¯=M¯\(s\)⋅𝔼a∼πref\[expQk∗\(s,a\)β\],x¯=M¯\(s\)⋅𝔼a∼πref\[expQk∗\(s,a\)β\],a=M\(s\)⋅𝔼a∼πref\[expQk∗\(s,a\)β\],\\begin\{split\}&x=\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\},\\quad\\underline\{x\}=\\underline\{M\}\(s\)\\cdot\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}\\left\[\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\],\\\\ &\\overline\{x\}=\\overline\{M\}\(s\)\\cdot\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}\\left\[\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\],\\quad a=M\(s\)\\cdot\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}\\left\[\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\],\\end\{split\}for the importance\-sampling \(IS\) factorM\(s\):=πk∗\(a\|s\)/πref\(a\|s\)M\(s\):=\\pi^\{\*\}\_\{k\}\(a\|s\)/\\pi\_\{\\text\{ref\}\}\(a\|s\), IS lower boundM¯\(s\)\\underline\{M\}\(s\), and IS upper boundM¯\(s\)\\overline\{M\}\(s\), where0≤M¯\(s\)≤10\\leq\\underline\{M\}\(s\)\\leq 1,M¯\(s\)≥1\\overline\{M\}\(s\)\\geq 1,M\(s\)∈\[M¯\(s\),M¯\(s\)\]M\(s\)\\in\[\\underline\{M\}\(s\),\\overline\{M\}\(s\)\], and taking expectation𝔼a∼πref\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\}on both sides, this expression becomes
𝔼a∼πref\(⋅\|s\)\[1expQk∗\(s,a\)β\]≤2M\(s\)−1M¯\(s\)\(2M\(s\)−M¯\(s\)\)1𝔼a∼πref\(⋅\|s\)\[expQk∗\(s,a\)β\]\\displaystyle\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\frac\{1\}\{\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\}\\right\]\\leq\\frac\{2M\(s\)\-1\}\{\\underline\{M\}\(s\)\(2M\(s\)\-\\underline\{M\}\(s\)\)\}\\frac\{1\}\{\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\]\}\(17\)⟹\\displaystyle\\implieslog𝔼a∼πref\(⋅\|s\)\[1expQk∗\(s,a\)β\]≤log2M\(s\)−1M¯\(s\)\(2M\(s\)−M¯\(s\)\)−log𝔼a∼πref\(⋅\|s\)\[expQk∗\(s,a\)β\]\\displaystyle\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\frac\{1\}\{\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\}\\right\]\\leq\\log\\frac\{2M\(s\)\-1\}\{\\underline\{M\}\(s\)\(2M\(s\)\-\\underline\{M\}\(s\)\)\}\-\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\]⟹\\displaystyle\\implies−log𝔼a∼πref\(⋅\|s\)\[exp−Qk∗\(s,a\)β\]≥log𝔼a∼πref\(⋅\|s\)\[expQk∗\(s,a\)β\]−log2M\(s\)−1M¯\(s\)\(2M\(s\)−M¯\(s\)\)⏟ΔkM\(s\)\.\\displaystyle\-\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{\-Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\]\\geq\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\]\-\\underbrace\{\\log\\frac\{2M\(s\)\-1\}\{\\underline\{M\}\(s\)\(2M\(s\)\-\\underline\{M\}\(s\)\)\}\}\_\{\\Delta^\{M\}\_\{k\}\(s\)\}\.The above bound is only valid whenΔkM\(s\)≥0\\Delta^\{M\}\_\{k\}\(s\)\\geq 0\. We will achieve that and also minimizeΔkM\(s\)\\Delta^\{M\}\_\{k\}\(s\)by choosing the rightM\(s\)M\(s\)\. Notice that, since0≤M¯\(s\)≤10\\leq\\underline\{M\}\(s\)\\leq 1and1≤M¯\(s\)1\\leq\\overline\{M\}\(s\), the expressionexpΔkM\(s\)=2M\(s\)−1M¯\(s\)\(2M\(s\)−M¯\(s\)\)≥1\\exp\\Delta^\{M\}\_\{k\}\(s\)=\\frac\{2M\(s\)\-1\}\{\\underline\{M\}\(s\)\(2M\(s\)\-\\underline\{M\}\(s\)\)\}\\geq 1only whenM\(s\)≥12\(1\+M¯\(s\)\)M\(s\)\\geq\\frac\{1\}\{2\}\(1\+\\underline\{M\}\(s\)\), and this value would monotonically increase withMMbeyond that\. However, to satisfy the necessary condition−\|a\|≤x−a≤\|a\|\-\|a\|\\leq x\-a\\leq\|a\|,∀x\\forall x, that guarantees the convergence of the above geometric sum, forx¯≤x≤x¯\\underline\{x\}\\leq x\\leq\\overline\{x\}, the smallestaavalue can only be\(x¯\+x¯\)/2\(\\overline\{x\}\+\\underline\{x\}\)/2, or in other words the smallest possibleMMis\(M¯\(s\)\+M¯\(s\)\)/2\(\\overline\{M\}\(s\)\+\\underline\{M\}\(s\)\)/2, which is a valid choice as it is greater than\(1\+M¯\(s\)\)/2\(1\+\\underline\{M\}\(s\)\)/2\. SubstitutingM∗=\(M¯\(s\)\+M¯\(s\)\)/2M^\{\*\}=\(\\overline\{M\}\(s\)\+\\underline\{M\}\(s\)\)/2intoΔkM\(s\)\\Delta^\{M\}\_\{k\}\(s\)yields
Δk\(s\):=ΔkM∗\(s\)=log\(M¯\(s\)\+M¯\(s\)−1\)−log\(M¯\(s\)⋅M¯\(s\)\)≥0\.\\Delta\_\{k\}\(s\):=\\Delta^\{M^\{\*\}\}\_\{k\}\(s\)=\\log\(\\overline\{M\}\(s\)\+\\underline\{M\}\(s\)\-1\)\-\\log\(\\overline\{M\}\(s\)\\cdot\\underline\{M\}\(s\)\)\\geq 0\.\(18\)IfΔk\(s\)\\Delta\_\{k\}\(s\)is constructed via Equation \([18](https://arxiv.org/html/2605.20408#A1.E18)\), then the inequality in Equation \([16](https://arxiv.org/html/2605.20408#A1.E16)\) can be proven by using the convexity arguments and definitions ofVk∗V^\{\*\}\_\{k\}via the following inequalities:
1β∑k\|λk∗\|∑kλk∗Vk∗\(s\)=\\displaystyle\\frac\{1\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}V^\{\*\}\_\{k\}\(s\)=1β∑k\|λk∗\|∑kλk∗βlog𝔼a∼πref\(⋅\|s\)\[expQk∗\(s,a\)β\]\\displaystyle\\frac\{1\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}\\beta\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\]≥\\displaystyle\\geq1β∑k\|λk∗\|∑k\(βλk∗\(sgn\(λk∗\)\)log𝔼a∼πref\(⋅\|s\)\[expsgn\(λk∗\)Qk∗\(s,a\)β\]\+βΔk\(s\)\(λk∗\)−\)\\displaystyle\\frac\{1\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\sum\_\{k\}\\left\(\\beta\\lambda^\{\*\}\_\{k\}\(\\text\{sgn\}\(\\lambda^\{\*\}\_\{k\}\)\)\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{\\text\{sgn\}\(\\lambda^\{\*\}\_\{k\}\)Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\]\+\\beta\\Delta\_\{k\}\(s\)\(\\lambda^\{\*\}\_\{k\}\)\_\{\-\}\\right\)=\\displaystyle=ββ∑k\(\|λk∗\|∑k\|λk∗\|log𝔼a∼πref\(⋅\|s\)\[expsgn\(λk∗\)Qk∗\(s,a\)β\]\+Δk\(s\)\(λk∗\)−∑k\|λk∗\|\)\\displaystyle\\frac\{\\beta\}\{\\beta\}\\sum\_\{k\}\\left\(\\frac\{\|\\lambda^\{\*\}\_\{k\}\|\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{\\text\{sgn\}\(\\lambda^\{\*\}\_\{k\}\)Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\}\\right\]\+\\frac\{\\Delta\_\{k\}\(s\)\(\\lambda^\{\*\}\_\{k\}\)\_\{\-\}\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\right\)≥\\displaystyle\\geqlog𝔼a∼πref\(⋅\|s\)\[exp∑k\|λk∗\|sgn\(λk∗\)Qk∗\(s,a\)β∑k\|λk∗\|\]\+∑kΔk\(s\)\(λk∗\)−∑k\|λk∗\|\\displaystyle\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\\text\{sgn\}\(\\lambda^\{\*\}\_\{k\}\)Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\right\]\+\\frac\{\\sum\_\{k\}\\Delta\_\{k\}\(s\)\(\\lambda^\{\*\}\_\{k\}\)\_\{\-\}\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}=\\displaystyle=log𝔼a∼πref\(⋅\|s\)\[exp∑kλk∗Qk∗\(s,a\)β∑k\|λk∗\|\]\+∑kΔk\(s\)\(λk∗\)−∑k\|λk∗\|,\\displaystyle\\log\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|s\)\}\\left\[\\exp\\frac\{\\sum\_\{k\}\\lambda^\{\*\}\_\{k\}Q^\{\*\}\_\{k\}\(s,a\)\}\{\\beta\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\}\\right\]\+\\frac\{\\sum\_\{k\}\\Delta\_\{k\}\(s\)\(\\lambda^\{\*\}\_\{k\}\)\_\{\-\}\}\{\\sum\_\{k\}\|\\lambda^\{\*\}\_\{k\}\|\},where the first equality follows from the definition ofVk∗\(s\)V^\{\*\}\_\{k\}\(s\), the first inequality follows from the arguments in Equation \([17](https://arxiv.org/html/2605.20408#A1.E17)\) applied to the cases whenλk∗<0\\lambda^\{\*\}\_\{k\}<0\(with equality holds trivially whenλk∗≥0\\lambda^\{\*\}\_\{k\}\\geq 0\), the second and third equalities follows from simple algebra, the second inequality follows from convex property of the functionlog𝔼πrefexp\(X\)\\log\\mathbb\{E\}\_\{\\pi\_\{\\text\{ref\}\}\}\\exp\(X\)\.
Combining all these arguments complete the proof of this theorem\.
## Appendix BExperiment Details
### B\.1Experimental Domains
To assess the effectiveness of our approach, we conduct empirical evaluations of spectral souping on three realistic LLM personalization experiments\. Each experiment contain two phases, an offline learning phase of specialized policies and an online adaptation phase for LLM personalization\.
The first experiment is built upon theUltraFeedback dataset, which initially contains a training set of 60,829 examples and a test set of 985 examples\. Each data point consists of an input promptxxand a pair of distinct agent responses,\(y1,y2\)\(y\_\{1\},y\_\{2\}\), where each responseyiy\_\{i\}is annotated with a four\-dimensional feature vector,ϕ\(x,yi\)\\phi\(x,y\_\{i\}\), quantifying its helpfulness, honesty, instruction\-following, and truthfulness\. We leverage this fine\-grained scores to synthesize diverse preference labels, each with a unique weight vector𝐰∈ℝ4\\mathbf\{w\}\\in\\mathbb\{R\}^\{4\}\. Pairwise preference rankings of two responses is determined by the dot product of their feature vectors with this weight vector; for instance,y1y\_\{1\}is preferred overy2y\_\{2\}if⟨ϕ\(x,y1\),𝐰⟩\>⟨ϕ\(x,y2\),𝐰⟩\\langle\\phi\(x,y\_\{1\}\),\\mathbf\{w\}\\rangle\>\\langle\\phi\(x,y\_\{2\}\),\\mathbf\{w\}\\rangle\. In the offline phase, we build a library ofK=30K=30specialized policies\. We generate 30 distinct datasets by creating 30 unique preference vectors,𝐰k\\mathbf\{w\}\_\{k\}, each sampled from a distribution centered around a basis vector for one of the four attributes \(e\.g\.,\[1,0,0,0\]\[1,0,0,0\]for helpfulness\)\. The online phase evaluates the algorithm’s generalization to novel preferences under ambiguous conditions\. On the same dataset of prompts and responses, we simulate responses of eight held\-out "users," proxied by publicly available reward models\. Crucially, the preference functions of these models were unseen during offline training, providing a rigorous test of generalization\. To further amplify the task’s difficulty, we filter the dataset to retain only the most contentious examples where preferences conflict, resulting in a benchmark of23,61423,614training and401401test examples, forcing the model to learn nuanced preference trade\-offs\.
The second experimental setup focuses onpersonalized text\-to\-image \(T2I\)generation within the PASTA framework, which involves a 5\-turn \(H=5\) interactive process\. At each turn, the agent presents the user with a 4x4 slate of 16 images, where each column corresponds to a unique prompt expansion\. The core generation models include Stable Diffusion XL for images and Gemini 1\.5 Flash for creating a candidate set of 25 prompts, from which the four are selected for the slate\. The utility functions are based on fine\-tuned Gemma 2B models\. For the offline learning phase, we generated K=32 specialized datasets from over 30,000 simulated user rollouts, totaling more than 2\.5 million images\. These rollouts were guided by 32 distinct user models designed to capture myopic, turn\-by\-turn preferences\. In this simulation, a user provides an absolute satisfaction score \(on a 5\-point scale\) for the best image in each column based on its relevance to the original prompt\. The choice is then modeled as selecting the column that received the highest score\. In the online phase, the algorithm’s adaptability is tested against 5 held\-out, simulated users\. Each of these "auto\-raters" is powered by a unique, pre\-trained Q\-function that models a holistic, session\-level preference\. In contrast to the myopic offline models, these Q\-functions evaluate the entire 5\-turn session based on salient user values, such as aesthetic quality, prompt\-image consistency, or a specific artistic style\. These user\-mimicking Q\-functions were trained on a large offline dataset using Implicit Q\-Learning\.
Our third experiment is grounded in the healthcare domain ofsleep coaching\. We begin by obtaining detailed user profiles from 68 real individuals from the LifeSnaps dataset\. Each profile is constructed using a rich set of attributes, including demographics \(age, gender\), health metrics \(BMI, average and variable sleep duration\)\. From these attributes, we generate a ’sleep profile’ for each user, detailing their primary sleep concern, goals, and barriers\. To simulate conversations, each user is instantiated as an LLM whose prompt contains the user’s entire backstory vignette, along with the full preceding conversation history, ensuring that the dialogue is consistently grounded in the user’s profile\. For the offline learning phase, we generateK=15K=15specialized preference datasets, each corresponding to one of the five personality dimensions\. To create each dataset, we first generate1,0001,000pairs of 10\-turn conversations by having our synthetic users interact with the coaching agent—a ’Talker’ and ’Reasoner’ system powered by Gemini 1\.5 Pro\. The key step for creating specific data is via persona\-based ranking: each of the1,0001,000conversation pairs is ranked by a reward function that specifically embodies one of the five target personalities\. For the online adaptation phase, we evaluate the algorithm’s performance on a test set of512512samples for each of 5 distinct "users," who are simulated by a auto\-rater system with generative feedback\. This system employs Gemini 1\.5 Flash to score conversations by systematically evaluating them against a detailed set of rubrics designed to holistically assess sleep\-coaching quality\. These rubrics cover the agent’s tone and style \(friendliness, supportiveness, empowerment\), its ability to understand the user \(rapport establishment, capturing concerns, efficient information gathering\), and the efficacy of its personalized intervention \(collaborative goal\-setting, relevance, and quality of the structured plan\)\.
### B\.2Model and Training Details
### B\.3Additional Results
Figure 4:Test\-time Training Performance of Different Methods using the Gemma3 1B model: Explicit & Implicit Spectral Souping \(SS\-Exp & SS\-Imp\), P\-SOUPS, PAD, PAD\-SF, RLHF, adapted to 5 various users in the UltraFeedback, T2I Generation, Sleep Coaching domains\. The SS methods \(especially SS\-Exp\) consistently and outperform P\-SOUPS and the PAD baselines, demonstrating superior performance in online adaptation\.Figure 5:Evaluation Performance of Different Online Adaptation Methods using the Gemma3 1B model: Explicit & Implicit Spectral Souping \(SS\-Exp & SS\-Imp\), P\-SOUPS, PAD, PAD\-SF, RLHF, across adapted to 5 various users in the UltraFeedback, T2I Generation, Sleep Coaching domains\. The superior performance of the SS methods \(over P\-SOUPS, PAD, and PAD\-SF baselines\) is also generalizable to online evaluations\.
## Appendix CSequential Estimation ofλ\\lambda
The loss functions in both cases \(binary labels and preference labels\) are convex inλ\\lambda, and so inference of these parameters is tractable using off\-the\-shelf convex optimization methods\. However evaluation of the loss function is relatively expensive and must be performed for each user, ideally in real\-time\. Thus, in this section we propose an online method optimization ofλ\\lambda\. The formulation of our method is based on sequential variational inference, but we will show the resulting inference algorithm is a simple weighted least squares update\.
Our variational method aims to compute a variational posteriorq\(λ\)q\(\\lambda\)via
q∗\(λ\)=argmin𝔼q\(λ\)\[ℒ\(λ;ℬ\)\]\+KL\(q\(λ\)∣p\(λ\)\)\.q^\{\*\}\(\\lambda\)=\\operatorname\*\{arg\\,min\}\\mathbb\{E\}\_\{q\(\\lambda\)\}\[\\mathcal\{L\}\(\\lambda;\\mathcal\{B\}\)\]\+\\textrm\{KL\}\(q\(\\lambda\)\\mid p\(\\lambda\)\)\.\(19\)We fixq\(λ\)=𝒩\(λ¯,S\)q\(\\lambda\)=\\mathcal\{N\}\(\\bar\{\\lambda\},S\), and a Gaussian priorp\(λ\)p\(\\lambda\)\. Note that this objective is convex in the parameters of the variational posterior\. In the case where we sequentially observe trajectories and labels at interactionnn,\(τn,ln\)\(\\tau\_\{n\},l\_\{n\}\), we propose a sequential method for identification ofλ\\lambdabased on variational continual learning\. In particular, at time indexnn, we compute variational posteriorqnq\_\{n\}via
qn\(λ\)=argminq𝔼q\(λ\)\[ℒ\(λ;\(τn,ln\)\)\]\+KL\(q\(λ\)∣qn−1\(λ\)\)\.q\_\{n\}\(\\lambda\)=\\operatorname\*\{arg\\,min\}\_\{q\}\\mathbb\{E\}\_\{q\(\\lambda\)\}\[\\mathcal\{L\}\(\\lambda;\(\\tau\_\{n\},l\_\{n\}\)\)\]\+\\textrm\{KL\}\(q\(\\lambda\)\\mid q\_\{n\-1\}\(\\lambda\)\)\.\(20\)Briefly, this method aims to sequentially infer the variational posterior by regularizing to the posterior of the previous timestep, using a method similar to sequential filtering\.
We derive the least squares update for the preference learning case, and the binary classification model is a straightforward extension\. We approximate \([20](https://arxiv.org/html/2605.20408#A3.E20)\) via second order Taylor expansion, allowing exact computation of the expectation\. We will defineΔn−1=Rn−1\(w\)−Rn−1\(l\)β\\Delta\_\{n\-1\}=\\frac\{R\_\{n\-1\}\(w\)\-R\_\{n\-1\}\(l\)\}\{\\beta\}\(where eachRRterm here is vectorized overkk\)\. This term is computed by the log likelihoods of the policy as in SectionLABEL:sec:offline\_learning\_specialized\. We further defineσn−1=σ\(λ¯n−1⊤Δn−1\)\\sigma\_\{n\-1\}=\\sigma\(\\bar\{\\lambda\}\_\{n\-1\}^\{\\top\}\\Delta\_\{n\-1\}\), the predictive preference likelihood after updating at stepn−1n\-1\.
Computing the analytical minimum of this second order approximation, we get updates
Sn−1\\displaystyle S\_\{n\}^\{\-1\}=Sn−1−1\+σn−1\(1−σn−1\)Δn−1Δn−1⊤\\displaystyle=S\_\{n\-1\}^\{\-1\}\+\\sigma\_\{n\-1\}\(1\-\\sigma\_\{n\-1\}\)\\Delta\_\{n\-1\}\\Delta\_\{n\-1\}^\{\\top\}\(21\)λ¯n\\displaystyle\\bar\{\\lambda\}\_\{n\}=λ¯n−1\+\(1−σn−1\)SΔn−1\\displaystyle=\\bar\{\\lambda\}\_\{n\-1\}\+\(1\-\\sigma\_\{n\-1\}\)S\\Delta\_\{n\-1\}\(22\)which are inexpensive to compute compared to the cost of the model evaluation\. The updates may be made cheaper by exploiting rank\-1 updates, but this is a relatively minor consideration\. This update can naively be performed once per timestep/user feedback, corresponding to a Newton\-style step per interaction\. It can also be performed multiple times \(matching iteratively reweighted least squares\) by settingσn−1=σ\(λ¯n⊤Δn−1\)\\sigma\_\{n\-1\}=\\sigma\(\\bar\{\\lambda\}\_\{n\}^\{\\top\}\\Delta\_\{n\-1\}\)\.Similar Articles
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
WildFeedback is a novel framework that leverages in-situ user feedback from actual LLM conversations to automatically create preference datasets for aligning language models with human preferences, addressing scalability and bias issues in traditional annotation-based alignment methods.
Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning
Introduces a curation-free metric (Triangulated Preference Shift) to isolate and quantify lexical biases induced during preference learning in LLMs, without manual curation, across six model families.
Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models
This paper introduces two automated metrics, Lexical Alignment Score and Triangulated Preference Shift, to identify lexical overuse in LLMs and attribute it to preference learning stages. The method is tested on six model families using PubMed abstracts, replicating prior findings without manual intervention.
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
This paper investigates preference instability in reward models for LLMs, where subtle input variations cause contradictory preference assignments. The authors propose two SAE-based mitigation strategies—SAE Feature Steering and SAE Residual Correction—to reduce incorrect preference assignments without retraining.
FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users
FSPO proposes a few-shot preference optimization algorithm for LLM personalization that reframes reward modeling as meta-learning, enabling models to quickly infer personalized reward functions from limited user preferences. The method achieves 87% personalization performance on synthetic users and 70% on real users through careful synthetic preference dataset construction.