From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning
Summary
This paper proposes a unified framework for personalized agentic reinforcement learning that decouples generic task rewards from personalized preference rewards, introducing PARPO and PSGM for preference-aligned policy optimization and skill retrieval.
View Cached Full Text
Cached at: 05/25/26, 09:01 AM
# From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning
Source: [https://arxiv.org/html/2605.23382](https://arxiv.org/html/2605.23382)
Ranxu Zhang1Zeyang Li111footnotemark:1Jiacheng Huang2Rui Zhang2 Xiaozhou Xu2Zhe Sun2Yanyong Zhang1Chao Wang1 1University of Science and Technology of China2Alibaba Group
###### Abstract
Agentic reinforcement learning \(Agentic RL\) has achieved strong progress in tasks with clear success signals\. However, many real\-world agent applications require user\-conditioned behavior: the same query may call for different planning strategies and tool\-use decisions across users\. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval\. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training\-time optimization\. At its core is*Personalized Anchor Reward\-Decoupled Policy Optimization*\(PARPO\), which decouples generic task\-quality rewards from personalized preference rewards and uses user\-specific anchors to stabilize learning under heterogeneous reward scales\. We further introduce a two\-stage preference\-disentangled reward model and*Preference\-Aligned Skill Evolution Graph Memory*\(PSGM\) for personalized supervision and preference\-aligned skill retrieval\. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation\. Experiments on ETAPP, ETAPP\-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines\. Code and data are included in the supplementary materials\.
## 1Introduction
Large language model \(LLM\)\-based agentic reinforcement learning \(Agentic RL\) has emerged as a powerful paradigm for optimizing LLM agents, achieving strong performance in code generation\(Gehringet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib1); Leet al\.,[2022](https://arxiv.org/html/2605.23382#bib.bib2); Yanget al\.,[2024](https://arxiv.org/html/2605.23382#bib.bib3)\), web navigation\(Nakanoet al\.,[2021](https://arxiv.org/html/2605.23382#bib.bib4); Qiet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib5)\), tool use\(Qianet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib42); Fenget al\.,[2026a](https://arxiv.org/html/2605.23382#bib.bib7); Xiaet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib45)\), and long\-horizon planning\(Tanget al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib10); Penget al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib9); Xiet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib8)\)\. However, most of this progress has been made in*verifiable*settings\(Qianet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib42); Jinet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib15); Shaoet al\.,[2024](https://arxiv.org/html/2605.23382#bib.bib37)\), where policy optimization can rely on a unique ground\-truth answer\. In contrast, in many real\-world agent applications—such as e\-commerce assistance, travel planning, and daily scheduling\(Skarlinskiet al\.,[2024](https://arxiv.org/html/2605.23382#bib.bib11); Schmidgallet al\.,[2024](https://arxiv.org/html/2605.23382#bib.bib12); Ninget al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib14); Xieet al\.,[2024](https://arxiv.org/html/2605.23382#bib.bib13); Lyuet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib46)\)—this verifiability breaks down because optimal behavior is user\-dependent: the same query may admit multiple plausible trajectories, with the preferred one determined by the user’s preferences, habits, and constraints\.
Recent work has begun to extend LLM optimization beyond strictly verifiable tasks\. Non\-verifiable or open\-ended optimization methods use LLM\-based evaluation\(Yeet al\.,[2025a](https://arxiv.org/html/2605.23382#bib.bib19); Chanet al\.,[2024](https://arxiv.org/html/2605.23382#bib.bib18); Liuet al\.,[2023](https://arxiv.org/html/2605.23382#bib.bib17); Zhenget al\.,[2023](https://arxiv.org/html/2605.23382#bib.bib16)\), rubric\-based rewards\(Gunjalet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib20); Liuet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib21)\), and other reward construction schemes\(Ouet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib22); Yeet al\.,[2025b](https://arxiv.org/html/2605.23382#bib.bib23); Tanet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib25); Xuet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib24)\)to provide supervision when exact answers are unavailable\. However, these methods typically optimize generic objectives, such as overall quality, helpfulness, or rubric satisfaction, rather than user\-conditioned preferences\. In parallel, personalized agent methods incorporate user profiles, prompting strategies, agent designs, and memory retrieval over historical interactions\(Zhanget al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib26); Lianget al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib28); Caiet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib27); Wanget al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib30); Lyuet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib46); Suet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib31)\)\. While effective for improving user alignment, they largely personalize behavior at inference time and do not directly optimize policies for user\-contingent trajectories\. Therefore, existing methods lack a native training\-time optimization framework for personalized agent behavior\.
Personalization fundamentally changes the optimization target of Agentic RL\. For the same query, users may prefer different planning strategies and tool\-use decisions; for example, “plan a one\-day trip in Tokyo” may call for a museum\-centered route for one user but an anime\-themed route for another\. Therefore, the agent must move beyond learning a single average\-optimal policy and instead learn user\-contingent tool\-use trajectories\. As illustrated in Figure[1](https://arxiv.org/html/2605.23382#S1.F1), this setting introduces three core challenges\.\(C1\) Personalized reward ambiguity: generic rewards mainly capture correctness, task completion, or overall helpfulness, but cannot express how a specific user evaluates the same trajectory, nor can they handle heterogeneous reward scales across users\.\(C2\) Personalized preference disentanglement: observed user behaviors are often shaped by both intrinsic interests and external conformity or contextual effects, making individualized preference signals noisy and difficult to identify accurately\.\(C3\) User\-aware memory and skill organization: existing agent memories are often flat and query\-centric, and thus cannot explicitly model or retrieve the structured relations among users, intents, skills, tools, scenarios, and trajectories\.
To bridge this gap, we propose a unified personalized Agentic RL framework that embeds personalization into the training\-time optimization loop\. At its core is*Personalized Anchor Reward\-Decoupled Policy Optimization*\(PARPO\), which decouples generic task\-quality rewards from personalized preference rewards and uses user\-specific anchors to stabilize learning under heterogeneous reward scales\. PARPO preserves general task competence while enabling user\-contingent policy improvement\. To provide cleaner personalized supervision, we develop a two\-stage preference\-disentangled reward model that separates intrinsic interests from conformity and contextual effects\. To support personalized rollout context, we introduce*Preference\-Aligned Skill Evolution Graph Memory*\(PSGM\), an evolving heterogeneous graph memory that organizes users, skills, tools, scenarios, and trajectories for preference\-aligned skill retrieval\. Together, these components form a closed loop of preference identification, personalized policy optimization, and structured skill accumulation\. Our contributions are summarized as follows:
- •We formulate personalized Agentic RL for user\-conditioned agent tasks, where optimal behavior depends on individual preferences\.
- •We proposePARPO, an anchor\-stabilized and reward\-decoupled policy optimization method for learning personalized policies under heterogeneous user reward scales\.
- •We introduce a preference\-disentangled reward model andPSGMto provide reliable personalized supervision and preference\-aligned skill retrieval\.
- •We evaluate our framework on ETAPP\(Haoet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib29)\), ETAPP\-Hard, and SJAgent \(a real\-world industrial agent training scenario from a large Chinese e\-commerce platform\), showing gains in personalization and procedural quality while maintaining factual and logical quality\.
Figure 1:Personalization in Agentic RL changes the notion of optimal behavior: the same query may require different plans for different users\.
## 2Related Work
Agentic RL in Verifiable Settings\.RL has proven effective for improving LLM agents on verifiable tasks, including Retool, ToolRL, AutoWebGLM, and Search\-R1\(Fenget al\.,[2026a](https://arxiv.org/html/2605.23382#bib.bib7); Qianet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib42); Jinet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib15); Laiet al\.,[2024](https://arxiv.org/html/2605.23382#bib.bib6)\)\. GRPO\-style methods such as GRPO, DAPO, GSPO, GiGPO, and GDPO improve scalability, sequence\-level optimization, and multi\-reward stability\(Shaoet al\.,[2024](https://arxiv.org/html/2605.23382#bib.bib37); Yuet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib38); Zhenget al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib39); Fenget al\.,[2026b](https://arxiv.org/html/2605.23382#bib.bib41); Liuet al\.,[2026b](https://arxiv.org/html/2605.23382#bib.bib40)\)\. However, these advances remain centered on verifiable settings with correctness or task\-success signals, rather than user\-specific alignment\(Jinet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib15); Xiaet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib45); Fenget al\.,[2026b](https://arxiv.org/html/2605.23382#bib.bib41)\)\.
Non\-verifiable and Open\-ended Optimization\.RL has also been extended beyond verifiable tasks using LLM\-based evaluation, rubric supervision, and learned rewards, e\.g\., OpenRubrics and Rubrics as Rewards\(Liuet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib21); Gunjalet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib20)\)\. While these methods support open\-ended outputs, they still optimize generic objectives, such as quality, rubric satisfaction, rather than personalized behavior\.
Personalization, Preference, and Memory\.Prior work explores personalization through profiles, memory, and personalized agents, including PersonaAgent, O\-Mem, Preference\-Aware Memory Update, and Learning Personalized Agents from Human Feedback\(Zhanget al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib26); Wanget al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib30); Sunet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib47); Lianget al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib28)\)\. CoPD studies true\-interest/conformity entanglement in user behavior\(Zhenget al\.,[2021](https://arxiv.org/html/2605.23382#bib.bib36)\), while memory\- and skill\-based agents retrieve skills and user context\(Zhanget al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib43); Xiaet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib45); Zhouet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib34); Liuet al\.,[2026a](https://arxiv.org/html/2605.23382#bib.bib35)\)\. Most focus on inference\-time personalization or omit personalized policy in user\-conditioned environments\. Our work instead unifies training\-time user\-conditioned policy optimization, true\-preference reward modeling, and personalized skill retrieval\.
A more detailed discussion of relevant prior work is provided in Appendix[A](https://arxiv.org/html/2605.23382#A1)\.
## 3Problem Definition
As discussed in the Introduction, personalization is important in real\-world scenarios such as e\-commerce and daily planning, since user satisfaction depends not only on whether the task is completed, but also on whether the result aligns with user preferences and whether the overall decision\-making experience is satisfactory\. Personalized agent behavior can be naturally described as a user\-conditioned Markov decision process \(MDP\):
ℳ=\(𝒮,𝒜,𝒫,𝒬,T,R,γ\),\\small\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},\\mathcal\{Q\},T,R,\\gamma\),
where𝒮\\mathcal\{S\}denotes the state space,𝒜\\mathcal\{A\}denotes the action space,𝒫\\mathcal\{P\}denotes the user profile space,𝒬\\mathcal\{Q\}denotes the user query space,T:𝒮×𝒜→𝒮T:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathcal\{S\}is the transition function,RRis the reward function, andγ\\gammais the discount factor\. For each instance, the user profilepu∈𝒫p\_\{u\}\\in\\mathcal\{P\}and the user queryq∈𝒬q\\in\\mathcal\{Q\}jointly define the task condition, and the initial states0s\_\{0\}is determined by\(pu,q\)\(p\_\{u\},q\)\. The agent then selects actions according to the policyπθ\(at∣st,pu,q\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\},p\_\{u\},q\), which induces a trajectoryτ=\(s0,a0,s1,a1,…,sT,aT\)\\tau=\(s\_\{0\},a\_\{0\},s\_\{1\},a\_\{1\},\\dots,s\_\{T\},a\_\{T\}\)\.
The training objective is to maximize the expected trajectory reward:
maxθ𝔼\(u,q\)∼𝒟,τ∼πθ\(⋅∣pu,q\)\[R\(τ,pu,q\)\]\.\\small\\max\_\{\\theta\}\\mathbb\{E\}\_\{\(u,q\)\\sim\\mathcal\{D\},\\,\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid p\_\{u\},q\)\}\\bigl\[R\(\\tau,p\_\{u\},q\)\\bigr\]\.\(1\)
The reward consists of two conceptually distinct components: a general\-quality reward and a personalized preference reward\. The general\-quality reward evaluates task completion, logical coherence, and procedural correctness according to the benchmark\-specific evaluation protocol, while the personalized preference reward measures how well a trajectory aligns with the target user’s preferences\. Therefore, the central challenge is that the same query may correspond to different optimal trajectories for different users\. Accordingly, a personalized agent needs to optimize both general task quality and preference alignment, while also handling differences in the scale and distribution of personalized rewards across users\.
## 4Method
To address the challenge of personalized decision\-making in user\-conditioned tasks, we propose a unified Agentic RL framework\. As illustrated in Figure[2](https://arxiv.org/html/2605.23382#S4.F2), the framework operates as a closed loop of personalized retrieval, generation, evaluation, and refinement\. Given a user query and profile, the framework first retrieves relevant historical skills from a graph\-based memory to form a personalized rollout context\. Conditioned on this enriched context, the LLM\-based policy interacts with the environment to generate decision\-making trajectories\. These trajectories are then evaluated by a dual\-reward system: a general task\-quality reward and a personalized preference reward\. Finally, the policy is updated using these decoupled signals, and high\-value trajectories are consolidated back into memory as reusable skills\.
In the following, we first introduce the core of our framework—the policy optimization algorithm \(PARPO, §[4\.1](https://arxiv.org/html/2605.23382#S4.SS1)\)\. We then detail the two supporting modules that make this optimization possible: the personalized reward model that provides user\-conditioned signals \(§[4\.2](https://arxiv.org/html/2605.23382#S4.SS2)\), and the skill graph memory that structures the rollout context \(§[4\.3](https://arxiv.org/html/2605.23382#S4.SS3)\)\.
Figure 2:Overview of the proposed personalized Agentic RL framework\.### 4\.1PARPO: Personalized Anchor Reward\-Decoupled Policy Optimization
We now introduce the policy optimizer in our framework\. PARPO separates task quality from personalized alignment, allowing the policy to improve user\-specific behavior without entangling it with shared supervision\. In implementation, PARPO is instantiated as a dual\-track GRPO\-style advantage estimator built on the personalized reward model in Section[4\.2](https://arxiv.org/html/2605.23382#S4.SS2)and the graph memory in Section[4\.3](https://arxiv.org/html/2605.23382#S4.SS3)\.
##### Theoretical justification\.
Appendix[C](https://arxiv.org/html/2605.23382#A3)shows that under heterogeneous user preferences, personalized optimization is preferable to user\-agnostic optimization, while standard GRPO incurs structural bias from pooled baselines and normalization\. The analysis yields three conclusions\. First, under heterogeneous user preferences, user\-aware optimization is never worse than user\-agnostic optimization: when different users prefer different trajectories for the same query, a single average\-optimal policy necessarily compromises across users\. Second, standard GRPO introduces structural bias in personalized settings because it uses pooled baselines and pooled normalization statistics\. As a result, its advantage estimate can deviate from the true user\-specific advantage due to both reward\-center mismatch and reward\-scale mismatch, with the dominant error controlled by cross\-user preference heterogeneity\. Third, PARPO reduces this bias by decoupling generic task\-quality rewards from personalized preference rewards and calibrating the personalized branch with user\-specific anchors\.
In our implementation and experiments, PARPO reduces this bias primarily through reward decomposition and user\-specific anchor calibration\. In particular, for a fixed useruu, its personalized advantage estimation error satisfies
\|A¯PARPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|≤δu\+ϵuσu\(q\)\+ϵ,\\small\\left\|\\bar\{A\}\_\{\\mathrm\{PARPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\leq\\frac\{\\delta\_\{u\}\+\\epsilon\_\{u\}\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\},\(2\)whereδu\\delta\_\{u\}measures the estimation error of the user\-specific historical anchor andϵu\\epsilon\_\{u\}is a conservative margin term\. In expectation over users, this yields
𝔼u\[\|A¯PARPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|\]≤δ¯\+ϵ¯σmin\+ϵ\.\\small\\mathbb\{E\}\_\{u\}\\\!\\left\[\\left\|\\bar\{A\}\_\{\\mathrm\{PARPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\right\]\\leq\\frac\{\\bar\{\\delta\}\+\\bar\{\\epsilon\}\}\{\\sigma\_\{\\min\}\+\\epsilon\}\.\(3\)This shows that the practical benefit of PARPO in our setting comes from individual\-specific baseline calibration: when a user’s historical anchor is a better approximation to that user’s true preference center than the pooled baseline, PARPO yields a tighter personalized advantage estimate for that user\. A more general extension that also incorporates local grouping is deferred to Appendix[C](https://arxiv.org/html/2605.23382#A3)\.
Guided by this analysis, PARPO uses two explicit optimization tracks: a base track for generic task quality and a personalized track for user\-contingent preference improvement\.
##### Base advantage\.
For a sampled group of trajectories\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}under the same prompt group, the base advantage follows standard within\-group relative normalization:
Aibase=Rbase\(τi\)−R¯base\(g\)Std\(\{Rbase\(τj\)\}j∈g\)\+ϵ\.\\small A^\{\\mathrm\{base\}\}\_\{i\}=\\frac\{R\_\{\\mathrm\{base\}\}\(\\tau\_\{i\}\)\-\\bar\{R\}\_\{\\mathrm\{base\}\}^\{\(g\)\}\}\{\\mathrm\{Std\}\(\\\{R\_\{\\mathrm\{base\}\}\(\\tau\_\{j\}\)\\\}\_\{j\\in g\}\)\+\\epsilon\}\.\(4\)
##### Personalized advantage with user\-anchor calibration\.
To stabilize optimization across heterogeneous users, PARPO maintains a persistent user\-specific anchor for personalized rewards\. For useruu, letmu\(t\)m\_\{u\}^\{\(t\)\}andvu\(t\)v\_\{u\}^\{\(t\)\}denote the running mean and variance of personalized rewards\. Given the current batch personalized rewards for that user, the anchor is updated by exponential moving average:
mu\(t\+1\)=ρmu\(t\)\+\(1−ρ\)R¯persu,t,vu\(t\+1\)=ρvu\(t\)\+\(1−ρ\)Var\(Rpersu,t\)\.\\small m\_\{u\}^\{\(t\+1\)\}=\\rho m\_\{u\}^\{\(t\)\}\+\(1\-\\rho\)\\bar\{R\}\_\{\\mathrm\{pers\}\}^\{u,t\},\\qquad v\_\{u\}^\{\(t\+1\)\}=\\rho v\_\{u\}^\{\(t\)\}\+\(1\-\\rho\)\\mathrm\{Var\}\(R\_\{\\mathrm\{pers\}\}^\{u,t\}\)\.\(5\)The personalized branch then uses the following user\-aware baseline:
bu,g=max\(R¯pers\(g\),mu−γpvu\),\\small b\_\{u,g\}=\\max\\\!\\left\(\\bar\{R\}\_\{\\mathrm\{pers\}\}^\{\(g\)\},\\;m\_\{u\}\-\\gamma\_\{p\}\\sqrt\{v\_\{u\}\}\\right\),\(6\)which prevents the personalized baseline from drifting too far above the user’s historical personalized reward center and provides a stable individual\-specific calibration signal under heterogeneous reward scales\. The personalized advantage is defined as
Aipers=Rpers\(τi\)−bui,gvui\+ϵ\.\\small A^\{\\mathrm\{pers\}\}\_\{i\}=\\frac\{R\_\{\\mathrm\{pers\}\}\(\\tau\_\{i\}\)\-b\_\{u\_\{i\},g\}\}\{\\sqrt\{v\_\{u\_\{i\}\}\}\+\\epsilon\}\.\(7\)
##### Advantage fusion and policy update\.
The trajectory advantage is the weighted sum of the two:
Aitotal=wbaseAibase\+wpersAipers\.\\small A^\{\\mathrm\{total\}\}\_\{i\}=w\_\{\\mathrm\{base\}\}A^\{\\mathrm\{base\}\}\_\{i\}\+w\_\{\\mathrm\{pers\}\}A^\{\\mathrm\{pers\}\}\_\{i\}\.\(8\)This fused advantage is then broadcast to the token level and used in a standard PPO\-style clipped policy objective:
ℒPARPO=1B∑imax\(−ri\(θ\)Aitotal,−clip\(ri\(θ\),1−η,1\+η\)Aitotal\),\\small\\mathcal\{L\}\_\{\\mathrm\{PARPO\}\}=\\frac\{1\}\{B\}\\sum\_\{i\}\\max\\\!\\left\(\-r\_\{i\}\(\\theta\)A^\{\\mathrm\{total\}\}\_\{i\},\\;\-\\mathrm\{clip\}\(r\_\{i\}\(\\theta\),1\-\\eta,1\+\\eta\)A^\{\\mathrm\{total\}\}\_\{i\}\\right\),\(9\)whereri\(θ\)r\_\{i\}\(\\theta\)is the token\-level policy ratio andη\\etathe clipping coefficient\. KL regularization, when enabled, is handled separately in the actor update loop rather than absorbed into the advantage\.
In this way, PARPO explicitly separates objective task quality from user\-contingent preference improvement, while using user\-specific running statistics to mitigate cross\-user reward\-scale mismatch and stabilize personalized policy learning through individual\-specific baseline calibration\.
### 4\.2Personalized Preference Reward Model with Two\-Stage Preference Disentanglement
As introduced in Section 4\.1, PARPO requires a reliable personalized preference rewardRpersR\_\{\\mathrm\{pers\}\}\. To provide this signal, we build a personalized preference reward model that provides user\-conditioned neural preference signals for policy optimization through learned user representations and action compatibility scores\.
##### Stage 1: Multi\-view profile representation learning\.
To alleviate cold\-start issues, we construct a profile representation from multiple semantic views\. Given profile views\{xu\(k\)\}k=1K\\\{x\_\{u\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}for useruu, we encode each view as𝐡u\(k\)=E\(xu\(k\)\)\\mathbf\{h\}\_\{u\}^\{\(k\)\}=E\(x\_\{u\}^\{\(k\)\}\)and compute attention weights
αu\(k\)=exp\(𝐰⊤tanh\(𝐖attn𝐡u\(k\)\+𝐛attn\)\)∑k′=1Kexp\(𝐰⊤tanh\(𝐖attn𝐡u\(k′\)\+𝐛attn\)\)\.\\small\\alpha\_\{u\}^\{\(k\)\}=\\frac\{\\exp\\\!\\left\(\\mathbf\{w\}^\{\\top\}\\tanh\(\\mathbf\{W\}\_\{\\mathrm\{attn\}\}\\mathbf\{h\}\_\{u\}^\{\(k\)\}\+\\mathbf\{b\}\_\{\\mathrm\{attn\}\}\)\\right\)\}\{\\sum\_\{k^\{\\prime\}=1\}^\{K\}\\exp\\\!\\left\(\\mathbf\{w\}^\{\\top\}\\tanh\(\\mathbf\{W\}\_\{\\mathrm\{attn\}\}\\mathbf\{h\}\_\{u\}^\{\(k^\{\\prime\}\)\}\+\\mathbf\{b\}\_\{\\mathrm\{attn\}\}\)\\right\)\}\.\(10\)The fused profile representation is
𝐮profile=LayerNorm\(𝐖out∑k=1Kαu\(k\)𝐡u\(k\)\)\.\\small\\mathbf\{u\}\_\{\\mathrm\{profile\}\}=\\mathrm\{LayerNorm\}\\\!\\left\(\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\sum\_\{k=1\}^\{K\}\\alpha\_\{u\}^\{\(k\)\}\\mathbf\{h\}\_\{u\}^\{\(k\)\}\\right\)\.\(11\)To preserve view\-specific information, we reconstruct each view embedding by𝐡^u\(k\)=𝐖rec\(k\)𝐮profile\+𝐛rec\(k\)\\hat\{\\mathbf\{h\}\}\_\{u\}^\{\(k\)\}=\\mathbf\{W\}\_\{\\mathrm\{rec\}\}^\{\(k\)\}\\mathbf\{u\}\_\{\\mathrm\{profile\}\}\+\\mathbf\{b\}\_\{\\mathrm\{rec\}\}^\{\(k\)\}and minimize
ℒrecon=∑u∑k=1K‖𝐡^u\(k\)−𝐡u\(k\)‖22\.\\small\\mathcal\{L\}\_\{\\mathrm\{recon\}\}=\\sum\_\{u\}\\sum\_\{k=1\}^\{K\}\\left\\\|\\hat\{\\mathbf\{h\}\}\_\{u\}^\{\(k\)\}\-\\mathbf\{h\}\_\{u\}^\{\(k\)\}\\right\\\|\_\{2\}^\{2\}\.\(12\)
##### Stage 2: Collaborative preference disentanglement\.
We further incorporate collaborative preference signals from the user–item interaction graph\. Specifically, LightGCN propagates embeddings as𝐄\(ℓ\+1\)=𝐀^𝐄\(ℓ\),\\mathbf\{E\}^\{\(\\ell\+1\)\}=\\hat\{\\mathbf\{A\}\}\\mathbf\{E\}^\{\(\\ell\)\},and obtains the final collaborative representation by layer\-wise averaging:𝐄final=1L\+1∑ℓ=0L𝐄\(ℓ\)\.\\mathbf\{E\}\_\{\\mathrm\{final\}\}=\\frac\{1\}\{L\+1\}\\sum\_\{\\ell=0\}^\{L\}\\mathbf\{E\}^\{\(\\ell\)\}\.Given the collaborative user representation𝐮cf\\mathbf\{u\}\_\{\\mathrm\{cf\}\}, we learn two branches to capture interest and conformity signals:𝐮int=InterestEncoder\(𝐮cf\),𝐮conf=ConformityEncoder\(𝐮cf\)\.\\mathbf\{u\}\_\{\\mathrm\{int\}\}=\\mathrm\{InterestEncoder\}\(\\mathbf\{u\}\_\{\\mathrm\{cf\}\}\),\\mathbf\{u\}\_\{\\mathrm\{conf\}\}=\\mathrm\{ConformityEncoder\}\(\\mathbf\{u\}\_\{\\mathrm\{cf\}\}\)\.The two normalized branch embeddings are then fused as𝐮fused=s\(αint𝐮^int\+αconf𝐮^conf\),\\mathbf\{u\}\_\{\\mathrm\{fused\}\}=s\\\!\\left\(\\alpha\_\{\\mathrm\{int\}\}\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{int\}\}\+\\alpha\_\{\\mathrm\{conf\}\}\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{conf\}\}\\right\),where𝐮^int\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{int\}\}and𝐮^conf\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{conf\}\}denote the normalized interest and conformity embeddings, respectively\.
This branch structure alone does*not*guarantee true separation of intrinsic preference from popularity\-, conformity\-, or group\-level bias\. We therefore treat disentanglement operationally and impose it through branch\-specific objectives\. The interest branch upweights less popular items:
ℒint=1B∑\(u,i\+\)\[−log\(ωi\+int\+ϵ\)−𝐮int⊤𝐢cf\+τ\+log∑jexp\(𝐮int⊤𝐢cf\(j\)τ\)\],ωiint=exp\(1−p~i\)\.\\footnotesize\\mathcal\{L\}\_\{\\mathrm\{int\}\}=\\frac\{1\}\{B\}\\sum\_\{\(u,i^\{\+\}\)\}\\left\[\-\\log\\\!\\left\(\\omega\_\{i^\{\+\}\}^\{\\mathrm\{int\}\}\+\\epsilon\\right\)\-\\frac\{\\mathbf\{u\}\_\{\\mathrm\{int\}\}^\{\\top\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\+\}\}\{\\tau\}\+\\log\\sum\_\{j\}\\exp\\\!\\left\(\\frac\{\\mathbf\{u\}\_\{\\mathrm\{int\}\}^\{\\top\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\(j\)\}\}\{\\tau\}\\right\)\\right\],\\qquad\\omega\_\{i\}^\{\\mathrm\{int\}\}=\\exp\(1\-\\tilde\{p\}\_\{i\}\)\.\(13\)Herep~i∈\[0,1\]\\tilde\{p\}\_\{i\}\\in\[0,1\]denotes normalized popularity\. The conformity branch uses the same objective with opposite weighting, i\.e\.,ωiconf=exp\(p~i\)\\omega\_\{i\}^\{\\mathrm\{conf\}\}=\\exp\(\\tilde\{p\}\_\{i\}\)\. We further regularize both branches by
ℒorth=1B∑u\(𝐮^int⊤𝐮^conf\)2,\\small\\mathcal\{L\}\_\{\\mathrm\{orth\}\}=\\frac\{1\}\{B\}\\sum\_\{u\}\\left\(\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{int\}\}^\{\\top\}\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{conf\}\}\\right\)^\{2\},\(14\)which discourages overlap but is not, by itself, evidence of full causal disentanglement\.
For an action textaa, we encode it as𝐚proj=ActionEncoder\(E\(a\)\)\\mathbf\{a\}\_\{\\mathrm\{proj\}\}=\\mathrm\{ActionEncoder\}\(E\(a\)\)and compute the personalized score byrfused\(u,a\)=𝐮fused⊤𝐚projr\_\{\\mathrm\{fused\}\}\(u,a\)=\\mathbf\{u\}\_\{\\mathrm\{fused\}\}^\{\\top\}\\mathbf\{a\}\_\{\\mathrm\{proj\}\}\. The resulting neural preference score combines collaborative structure, branch\-specific supervision, and regularization, and is later calibrated and integrated with LLM\-based evaluation in the environment\-level reward pipeline\.
### 4\.3Preference\-Aligned Skill Evolution Graph Memory
Beyond the reward signal, personalized policy learning also requires a structured behavioral context during rollout\. To this end, we maintain a heterogeneous graph memory𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)over users, skills, tools, scenarios, and trajectories, with typed edges encoding ownership, applicability, complementarity, conflict, execution history, and scenario triggers\.
Rather than using a flat retrieval index, PSGM organizes skills with both node embeddings and graph structure\. Skill and user semantic embeddings are stored on nodes, while graph connectivity provides signals such as community membership, complementary neighbors, and conflicting neighbors\. To capture multi\-granularity structure, we perform hierarchical community detection with Leiden when available and Louvain otherwise\. At each level, communities are obtained by maximizing modularity:
Q=12m∑i,j\(Aij−kikj2m\)δ\(ci,cj\)\.\\small Q=\\frac\{1\}\{2m\}\\sum\_\{i,j\}\\left\(A\_\{ij\}\-\\frac\{k\_\{i\}k\_\{j\}\}\{2m\}\\right\)\\delta\(c\_\{i\},c\_\{j\}\)\.\(15\)
At inference, retrieval has two stages\. We first retrieve semantic candidates from the skill set:
𝒮init\(q\)=TopMs∈𝒮sim\(q,s\)\.\\small\\mathcal\{S\}\_\{\\mathrm\{init\}\}\(q\)=\\operatorname\{TopM\}\_\{s\\in\\mathcal\{S\}\}\\operatorname\{sim\}\(q,s\)\.\(16\)We then expand the candidates with a 2\-hop traversal: from each retrieved skill to its owner user, and then to sibling skills of that user, injecting personalized local structure into the pool\.
Each candidate is ranked by the following graph\-aware score:
score\(q,s,pu\)=fsem\(q,s\)\(α\+βfuser\(pu,s\)\)\(1\+γfcomm\(pu,s\)\)fcomp\(s\)\(1−δfconf\(s\)\)\.\\small\\operatorname\{score\}\(q,s,p\_\{u\}\)=f\_\{\\mathrm\{sem\}\}\(q,s\)\\,\\bigl\(\\alpha\+\\beta f\_\{\\mathrm\{user\}\}\(p\_\{u\},s\)\\bigr\)\\,\\bigl\(1\+\\gamma f\_\{\\mathrm\{comm\}\}\(p\_\{u\},s\)\\bigr\)\\,f\_\{\\mathrm\{comp\}\}\(s\)\\,\\bigl\(1\-\\delta f\_\{\\mathrm\{conf\}\}\(s\)\\bigr\)\.\(17\)Here,fsemf\_\{\\mathrm\{sem\}\}is query–skill similarity,fuserf\_\{\\mathrm\{user\}\}user–skill similarity,fcommf\_\{\\mathrm\{comm\}\}community relevance,fcompf\_\{\\mathrm\{comp\}\}complement boost, andfconff\_\{\\mathrm\{conf\}\}a conflict penalty;α,β,γ,δ\\alpha,\\beta,\\gamma,\\deltaare fixed graph\-level hyperparameters\.
The top\-ranked skills are inserted into the rollout context as structured personalized memory\. Thus, PSGM improves rollout\-time personalization by exposing preference\-relevant and graph\-consistent skills before each decision step, without changing policy parameters\.
## 5Experiments
### 5\.1Experimental Setup
#### 5\.1\.1Baselines
We compare against prompting, memory, RL, and memory\-RL baselines, includingReAct,Mem0,GRPO,DAPO,GSPO,GiGPO,MemRL, andSkillRL\. We also report GPT\-4o and Claude Sonnet 4 as closed\-source references\. All open\-source methods are evaluated under the same model scales, task settings, and tool interfaces whenever applicable\. We further include a GDPO\-style variant in the ablation study, which removes PARPO’s user\-anchor calibration while keeping the same personalized reward and memory components\. For baselines in personalized environments, we additionally provide the native personalization reward to ensure fair comparison\.
#### 5\.1\.2Benchmarks, Evaluation, and Training Details
We evaluate on ETAPP and SJAgent\. ETAPP is a public benchmark for personal assistant agents, covering user behaviors in daily\-life scenarios; we further construct a more challenging split, ETAPP\-Hard\. SJAgent is a realistic environment for merchant decision\-making and recommendation, built from merchant data on a major Chinese e\-commerce platform\. Environment details are provided in Appendix[D\.2](https://arxiv.org/html/2605.23382#A4.SS2), and the construction pipeline of ETAPP\-Hard is given in Appendix[D\.3](https://arxiv.org/html/2605.23382#A4.SS3)\.
For ETAPP, we report Judge, Personal, Proactive, and Procedure\. For SJAgent, we report Reward, Data Authenticity, Business Logic, Merchant Profile Match, Task Completion, and Market Analysis Depth\. Metric definitions, official evaluation prompts, and scoring criteria are provided in Appendix[D\.4](https://arxiv.org/html/2605.23382#A4.SS4), and training hyperparameters are listed in Appendix[D\.5](https://arxiv.org/html/2605.23382#A4.SS5)\.
### 5\.2Main Results on Personalized Decision\-Making Benchmarks
Table 1:Main Results on ETAPP, ETAPP\-Hard and SJAgent\. Ours significantly outperforms the strongest baseline, SkillRL, under the paired t\-test \(p < 0\.005\)MetricClosed\-sourceOpen\-source ModelsGPT\-4oClaude S4ReActMem0GRPODAPOGSPOGiGPOMemRLSkillRLOursScale: Qwen3\-4B ModelsETAPP\-OriginalPersonal2\.58823\.97052\.37652\.52943\.50003\.74063\.80943\.93753\.18324\.03124\.2344Proactive2\.15293\.18711\.51761\.78823\.12503\.21253\.40623\.41562\.31053\.36623\.4844Procedure2\.63533\.53653\.49413\.58823\.46883\.31253\.22503\.39062\.90313\.71883\.8438Judge0\.49180\.71290\.49250\.52700\.67290\.68440\.69600\.71620\.55980\.74110\.7708ETAPP\-HardPersonal2\.15603\.60811\.61562\.68233\.48753\.62503\.71873\.71883\.28753\.86564\.0469Proactive2\.34493\.55500\.86881\.84702\.87502\.97193\.29383\.33752\.63443\.24063\.3313Procedure2\.17503\.04881\.91563\.85883\.10313\.41563\.05003\.05312\.82813\.40943\.5344Judge0\.44510\.68080\.29330\.43180\.63100\.66750\.67080\.67400\.58330\.70100\.7275SJAgentData Auth\.1\.00001\.13501\.85402\.14502\.61422\.74802\.94122\.95502\.21503\.05233\.3790Business Logic2\.91702\.82901\.62151\.95202\.30852\.65152\.72102\.78452\.04502\.87033\.0830Profile Match3\.88603\.79201\.98402\.21402\.52412\.88452\.86402\.89412\.31202\.85432\.8960Task Compl\.3\.90303\.26102\.14502\.45802\.81203\.14153\.16133\.18422\.54803\.24423\.4150Market Depth2\.80602\.70802\.01552\.31102\.92922\.90253\.23453\.24422\.42003\.22693\.4450Reward↑\\uparrow0\.72560\.68620\.48100\.55400\.65940\.71640\.74610\.75310\.57700\.76250\.8109Scale: Qwen3\-8B ModelsETAPP\-OriginalPersonal2\.58823\.97052\.15292\.68233\.95314\.00004\.04693\.93753\.14064\.09384\.2344Proactive2\.15293\.18711\.47061\.84703\.50003\.50943\.45313\.56252\.76563\.76564\.0938Procedure2\.63533\.53653\.40003\.85883\.73443\.74693\.96883\.93752\.90623\.82814\.1719Judge0\.49180\.71290\.46820\.55920\.74580\.75040\.76460\.76250\.58750\.77920\.8333ETAPP\-HardPersonal2\.15603\.60811\.71872\.33123\.81933\.92193\.96253\.95312\.99694\.11884\.3187Proactive2\.34493\.55500\.90001\.46563\.28543\.29383\.30313\.27812\.28603\.40943\.7844Procedure2\.17503\.04881\.88132\.68123\.37213\.46253\.51563\.56882\.55783\.46253\.8719Judge0\.44510\.68080\.30000\.43180\.69950\.71190\.71880\.72000\.52270\.73270\.7983SJAgentData Auth\.1\.00001\.13501\.94202\.31202\.91053\.05403\.04103\.12502\.45803\.10163\.4120Business Logic2\.91702\.82901\.78502\.04502\.64502\.91202\.68452\.94102\.18502\.98843\.1080Profile Match3\.88603\.79202\.01202\.35802\.81423\.10502\.74253\.10202\.51403\.02433\.1415Task Compl\.3\.90303\.26102\.31402\.61403\.04533\.28403\.10403\.31502\.78403\.36873\.4240Market Depth2\.80602\.70802\.24702\.47103\.08503\.24503\.34803\.35702\.65903\.36703\.4545Reward↑\\uparrow0\.72560\.68620\.51500\.59000\.72500\.78000\.74610\.79200\.63000\.79250\.8270Table[1](https://arxiv.org/html/2605.23382#S5.T1)reports results on ETAPP, ETAPP\-Hard, and SJAgent\. Our method achieves the best overall performance under both 4B and 8B settings and outperforms all open\-source baselines on most metrics\. On ETAPP and ETAPP\-Hard, it obtains the highest judge scores, with larger gains on ETAPP\-Hard, indicating stronger robustness in challenging personalized scenarios\. It also performs best on Personal, Proactive, and Procedure, showing better preference alignment and decision quality\. On SJAgent, our method again achieves the highest Reward and leads on key dimensions including Data Authenticity, Business Logic, Task Completion, and Market Analysis Depth, demonstrating strong cross\-domain generalization\.
### 5\.3Ablation Study
Table 2:Ablation results on ETAPP\. For each variant, we report the judge score, and list the correspondingpersonal,proactive,procedure, andjudgescores\.Δ\\Deltadenotes the absolute difference injudge\_bestcompared with the full model\.GroupVariantPersonal↑\\uparrowProactive↑\\uparrowProcedure↑\\uparrowJudge↑\\uparrowΔ\\DeltaJudge–Full \(baseline\)4\.23443\.48443\.84380\.77080\.0000AA1 no\_memory3\.65623\.19383\.65940\.7006\-0\.0702AA2 skill\_original4\.15623\.41883\.76250\.7558\-0\.0150BB1 no\_community4\.16253\.40313\.79060\.7571\-0\.0137BB2 louvain4\.06253\.67193\.76560\.7667\-0\.0041BB3 no\_dynupd3\.96883\.34383\.79690\.7406\-0\.0302BB4 bm254\.10943\.46563\.81250\.7592\-0\.0116CC1 grpo\_only3\.92503\.36253\.55000\.7225\-0\.0483CC2 no\_base3\.76563\.35943\.67190\.7198\-0\.0510CC3 no\_pers3\.75003\.32813\.73440\.7208\-0\.0500CC4 GDPO\-style \(w/o anchor\)3\.90333\.36943\.59710\.7247\-0\.0461CC5 no\_rm\_model4\.00003\.53123\.87500\.7604\-0\.0104CC6 no\_interest3\.84643\.29863\.63720\.7188\-0\.0588CC7 no\_conformity3\.72153\.41083\.68840\.7214\-0\.0494Table[2](https://arxiv.org/html/2605.23382#S5.T2)reports ablation results on ETAPP\. A1 removes skill memory, and A2 replaces PSGM with the original flat skill pipeline\. Within PSGM, B1 removes hierarchical communities, B2 replaces Leiden with Louvain, B3 disables dynamic graph updates, and B4 replaces dense retrieval with BM25\. Within reward learning, C1 replaces PARPO with GRPO, C2/C3 remove the base/personalized reward branch, C4 adopts a GDPO\-style variant without user\-anchor calibration, C5 disables the personalized neural reward model, and C6/C7 remove the interest/conformity branch\.
All ablations reduce performance\. Removing skill memory causes the largest Judge drop \(0\.7708 → 0\.7006\), showing that memory is central to personalized decision\-making\. Weakening PSGM also consistently hurts performance, validating structured retrieval and dynamic graph updates\. In reward learning, all variants underperform PARPO; notably, the GDPO\-style variant remains worse than PARPO, showing that reward decoupling alone is insufficient without user\-anchor calibration\.
### 5\.4Rollout Evaluation with Human and LLM Judges on ETAPP
To evaluate personalized behavior on user\-conditioned tasks, we conduct a blinded rollout study on 20 ETAPP tasks with rich user\-specific contexts\. For each task, PARPO, SkillRL, Claude 3\.5 Sonnet, and GPT\-4o independently generate responses, which are then anonymized and scored by 15 human experts and 4 LLM judges \(GPT\-5\.4, GLM\-5\.1, Kimi\-K2\.6, and Gemini\-3\.1\-Pro\) alongQuestion Relevance,User Relevance, andReadability\.
Figures[3\(a\)](https://arxiv.org/html/2605.23382#S5.F3.sf1)and[3\(b\)](https://arxiv.org/html/2605.23382#S5.F3.sf2)show that PARPO achieves the highest average score under both human and LLM evaluation, and all five judge groups rank it above all baselines\. The largest gain appears onUser Relevance, indicating stronger personalization rather than merely better fluency\. Figure[3\(c\)](https://arxiv.org/html/2605.23382#S5.F3.sf3)further shows that the gains are consistent across individual human experts, indicating robust preference rather than isolated wins\. A representative case further illustrating this behavior is shown in Appendix[G](https://arxiv.org/html/2605.23382#A7)\.
\(a\)Human scores by dimension\.
\(b\)Overall scores from judges\.
\(c\)Expert agreement and consistency\.
Figure 3:Blinded evaluation on 20 personalized ETAPP tasks\. Left: human scores by dimension\. Middle: overall scores from human and LLM judges\. Right: expert\-level agreement, including the expert\-by\-model heatmap and PARPO’s margin over the strongest baseline for each human expert\.
### 5\.5Training Dynamics and Skill Evolution Analysis
\(a\)RL training dynamics on ETAPP\.
\(b\)Personalized reward decomposition on ETAPP\.
\(c\)Final EMA scores across reward dimensions on ETAPP\.
\(d\)Hierarchical growth of general skills during training\.
Figure 4:Training dynamics and skill evolution analysis of Qwen3\-8B on ETAPP\. Top: RL training dynamics\. Bottom\-left: reward decomposition during training\. Bottom\-middle: final EMA scores by reward dimension\. Bottom\-right: hierarchical growth of general skills during training\. All experiments exclude the personalized reward model and Skill Graph\.All experiments in this section are conducted without the personalized reward model or Skill Graph to isolate the effect of RL optimization\.
#### 5\.5\.1Comparison of RL Training Strategies
We compare GRPO, GSPO, GiGPO, and PARPO on ETAPP\. Figure[4\(a\)](https://arxiv.org/html/2605.23382#S5.F4.sf1)shows that PARPO yields the strongest training dynamics, with higher reward, better training/validation success rates, stronger tool\-call success, and stable KL divergence\. Similar trends hold on ETAPP\-Hard \(Appendix[E\.1](https://arxiv.org/html/2605.23382#A5.SS1)\)\.
#### 5\.5\.2Effect of Personalized Reward Optimization and Skill Evolution
We further analyze the four ETAPP reward dimensions: personalization, judge, proactivity, and procedure\. Figures[4\(b\)](https://arxiv.org/html/2605.23382#S5.F4.sf2)and[4\(c\)](https://arxiv.org/html/2605.23382#S5.F4.sf3)show that PARPO improves all dimensions during training and achieves the best EMA scores, especially on personalization, indicating that explicit personalized reward optimization provides a stronger learning signal\. Moreover, Figure[4\(d\)](https://arxiv.org/html/2605.23382#S5.F4.sf4)shows that the cumulative number of general skills grows steadily with training steps, while the composition across application scenarios remains broadly stable\. This suggests that training continuously consolidates reusable cross\-domain skills rather than improving only a narrow subset of scenarios\. ETAPP\-Hard results are deferred to Appendix[E\.2](https://arxiv.org/html/2605.23382#A5.SS2)\.
## 6Conclusion and Limitations
This work studied personalized Agentic RL for user\-conditioned tasks, where optimal behavior depends on user preferences rather than a single correct answer\. We proposed a unified framework centered on PARPO, which decouples general\-quality and personalized rewards and stabilizes training with user\-specific anchors\. Experiments on ETAPP, ETAPP\-Hard, and SJAgent showed consistent and significant gains in personalization while preserving overall task quality, highlighting the value of training\-time personalized optimization for user\-centric agents\. A key limitation of this work is the scale of the human evaluation: due to annotation cost and limited expert availability, we reported judgments from only 15 experts on 20 randomly sampled examples\. Although these results broadly agree with the LLM evaluations, future work should expand the human study to a larger and more diverse pool of annotators and evaluation instances to better assess robustness, generalizability, and real\-world validity\.
## References
- H\. Cai, Y\. Li, W\. Wang, F\. ZHU, X\. Shen, W\. Li, and T\. Chua \(2025\)Large language models empowered personalized web agents\.InTHE WEB CONFERENCE 2025,External Links:[Link](https://openreview.net/forum?id=kAzqfqsCC5)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- C\. Chan, W\. Chen, Y\. Su, J\. Yu, W\. Xue, S\. Zhang, J\. Fu, and Z\. Liu \(2024\)ChatEval: towards better LLM\-based evaluators through multi\-agent debate\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=FQepisCUWu)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§A\.3](https://arxiv.org/html/2605.23382#A1.SS3.SSS0.Px2.p1.1)\.
- J\. Feng, S\. Huang, X\. Qu, G\. Zhang, Y\. Qin, B\. Zhong, C\. Jiang, J\. Chi, and W\. Zhong \(2026a\)ReTool: reinforcement learning for strategic tool use in LLMs\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=tRk1nofSmz)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2026b\)Group\-in\-group policy optimization for LLM agent training\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=QXEhBMNrCW)Cited by:[§A\.1](https://arxiv.org/html/2605.23382#A1.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- J\. Gehring, K\. Zheng, J\. Copet, V\. Mella, T\. Cohen, and G\. Synnaeve \(2025\)RLEF: grounding code LLMs in execution feedback with reinforcement learning\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=PzSG5nKe1q)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- A\. Gunjal, A\. Wang, E\. Lau, V\. Nath, B\. Liu, and S\. Hendryx \(2025\)Rubrics as rewards: reinforcement learning beyond verifiable domains\.CoRRabs/2507\.17746\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.17746)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1),[§2](https://arxiv.org/html/2605.23382#S2.p2.1)\.
- Y\. Hao, P\. Cao, Z\. Jin, H\. Liao, Y\. Chen, K\. Liu, and J\. Zhao \(2025\)Evaluating personalized tool\-augmented llms from the perspectives of personalization and proactivity\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 21897–21935\.Cited by:[4th item](https://arxiv.org/html/2605.23382#S1.I1.i4.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. O\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training LLMs to reason and leverage search engines with reinforcement learning\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Rwhi91ideu)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- H\. Lai, X\. Liu, I\. L\. Iong, S\. Yao, Y\. Chen, P\. Shen, H\. Yu, H\. Zhang, X\. Zhang, Y\. Dong,et al\.\(2024\)Autowebglm: a large language model\-based web navigating agent\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5295–5306\.Cited by:[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- H\. Le, Y\. Wang, A\. D\. Gotmare, S\. Savarese, and S\. C\. H\. Hoi \(2022\)Coderl: mastering code generation through pretrained models and deep reinforcement learning\.Advances in Neural Information Processing Systems35,pp\. 21314–21328\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- K\. Liang, J\. Kruk, S\. Qian, X\. Yang, S\. Bi, Y\. Yao, S\. Nie, M\. Zhang, L\. Liu, J\. F\. Fisac,et al\.\(2026\)Learning personalized agents from human feedback\.arXiv preprint arXiv:2602\.16173\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1),[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
- J\. Liu, Y\. Su, P\. Xia, S\. Han, Z\. Zheng, C\. Xie, M\. Ding, and H\. Yao \(2026a\)SimpleMem: efficient lifelong memory for llm agents\.arXiv preprint arXiv:2601\.02553\.Cited by:[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
- S\. Liu, X\. Dong, X\. Lu, S\. Diao, P\. Belcak, M\. Liu, M\. Chen, H\. Yin, Y\. F\. Wang, K\. Cheng,et al\.\(2026b\)Gdpo: group reward\-decoupled normalization policy optimization for multi\-reward rl optimization\.arXiv preprint arXiv:2601\.05242\.Cited by:[§A\.1](https://arxiv.org/html/2605.23382#A1.SS1.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- T\. Liu, R\. Xu, T\. Yu, I\. Hong, C\. Yang, T\. Zhao, and H\. Wang \(2025\)Openrubrics: towards scalable synthetic rubric generation for reward modeling and llm alignment\.arXiv preprint arXiv:2510\.07743\.Cited by:[§A\.2](https://arxiv.org/html/2605.23382#A1.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.23382#S1.p2.1),[§2](https://arxiv.org/html/2605.23382#S2.p2.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: NLG evaluation using gpt\-4 with better human alignment\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://openreview.net/forum?id=puMfaHb1hY)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- Y\. Lyu, G\. Chen, R\. Shao, W\. Guan, and L\. Nie \(2026\)PersonalAlign: hierarchical implicit intent alignment for personalized gui agent with long\-term user\-centric records\.arXiv preprint arXiv:2601\.09636\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1),[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders,et al\.\(2021\)Webgpt: browser\-assisted question\-answering with human feedback\.arXiv preprint arXiv:2112\.09332\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- Y\. Ning, R\. Liu, J\. Wang, K\. Chen, W\. Li, J\. Fang, K\. Zheng, N\. Tan, and H\. Liu \(2025\)Deeptravel: an end\-to\-end agentic reinforcement learning framework for autonomous travel planning agents\.arXiv preprint arXiv:2509\.21842\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- W\. Ou, Y\. Zheng, S\. Sun, W\. Zhang, B\. Dong, H\. Zhu, R\. Huang, G\. Yu, P\. Yan, and Y\. Qiao \(2026\)Serl: self\-examining reinforcement learning on open\-domain\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 32619–32627\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- J\. Peng, Y\. Liu, R\. Zhou, C\. Fleming, Z\. Wang, A\. Garcia, and M\. Hong \(2026\)HiPER: hierarchical reinforcement learning with explicit credit assignment for large language model agents\.arXiv preprint arXiv:2602\.16165\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- Z\. Qi, X\. Liu, I\. L\. Iong, H\. Lai, X\. Sun, J\. Sun, X\. Yang, Y\. Yang, S\. Yao, W\. Xu, J\. Tang, and Y\. Dong \(2025\)WebRL: training LLM web agents via self\-evolving online curriculum reinforcement learning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=oVKEAFjEqv)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- C\. Qian, E\. C\. Acikgoz, Q\. He, H\. WANG, X\. Chen, D\. Hakkani\-Tür, G\. Tur, and H\. Ji \(2026\)ToolRL: reward is all tool learning needs\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=eOLdGbXT6t)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- S\. Schmidgall, R\. Ziaei, C\. Harris, E\. Reis, J\. Jopling, and M\. Moor \(2024\)Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments\.arXiv preprint arXiv:2405\.07960\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Yang, J\. Xu, M\. Wu, N\. Li, Z\. Chen, K\. Zhang, Z\. Hu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.External Links:2402\.03300Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- M\. D\. Skarlinski, S\. Cox, J\. M\. Laurent, J\. D\. Braza, M\. Hinks, M\. J\. Hammerling, M\. Ponnapati, S\. G\. Rodriques, and A\. D\. White \(2024\)Language agents achieve superhuman synthesis of scientific knowledge\.arXiv preprint arXiv:2409\.13740\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- M\. Su, Y\. Guo, Z\. Hou, L\. Bai, Z\. Li, Y\. Zhang, G\. Yin, W\. Lin, X\. Jin, J\. Guo,et al\.\(2026\)Beyond dialogue time: temporal semantic memory for personalized llm agents\.arXiv preprint arXiv:2601\.07468\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- H\. Sun, Z\. Zhang, and S\. Zeng \(2025\)Preference\-aware memory update for long\-term llm agents\.arXiv preprint arXiv:2510\.09720\.Cited by:[§A\.3](https://arxiv.org/html/2605.23382#A1.SS3.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
- W\. Tan, X\. Qu, M\. Tu, M\. Ge, A\. T\. Liu, P\. Koehn, and L\. Lu \(2025\)Process\-supervised reinforcement learning for interactive multimodal tool\-use agents\.arXiv preprint arXiv:2509\.14480\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- Z\. Tang, X\. Yin, W\. Chen, Z\. Chen, Y\. Zheng, W\. Ye, K\. Wang, and L\. Lin \(2026\)AlphaAgentEvo: evolution\-oriented alpha mining via self\-evolving agentic reinforcement learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=lNmZrawUMu)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- P\. Wang, M\. Tian, J\. Li, Y\. Liang, Y\. Wang, Q\. Chen, T\. Wang, Z\. Lu, J\. Ma, Y\. E\. Jiang,et al\.\(2025\)O\-mem: omni memory system for personalized, long horizon, self\-evolving agents\.arXiv preprint arXiv:2511\.13593\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1),[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
- Z\. Xi, J\. Huang, C\. Liao, B\. Huang, J\. Liu, H\. Guo, yajie yang, R\. Zheng, J\. Ye, J\. Zhang, W\. Chen, W\. He, Y\. Ding, G\. Li, Z\. Chen, Z\. Du, X\. Yao, Y\. Xu, J\. Chen, T\. Gui, Z\. Wu, Q\. Zhang, X\. Huang, and Y\. Jiang \(2026\)AgentGym\-RL: an open\-source framework to train LLM agents for long\-horizon decision making via multi\-turn RL\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ZgCCDwcGwn)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[§A\.1](https://arxiv.org/html/2605.23382#A1.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.23382#S1.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
- J\. Xie, K\. Zhang, J\. Chen, T\. Zhu, R\. Lou, Y\. Tian, Y\. Xiao, and Y\. Su \(2024\)Travelplanner: a benchmark for real\-world planning with language agents\.arXiv preprint arXiv:2402\.01622\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- Y\. Xu, T\. Chakraborty, S\. Sharma, L\. Nunes, S\. Sharma, K\. D\. Demopulos, E\. Kıcıman, S\. Lu, and R\. Chandra \(2025\)Direct reasoning optimization: constrained rl with token\-level dense reward and rubric\-gated constraints for open\-ended tasks\.arXiv preprint arXiv:2506\.13351\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024\)Swe\-agent: agent\-computer interfaces enable automated software engineering\.Advances in Neural Information Processing Systems37,pp\. 50528–50652\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p1.1)\.
- J\. Ye, Y\. Wang, Y\. Huang, D\. Chen, Q\. Zhang, N\. Moniz, T\. Gao, W\. Geyer, C\. Huang, P\. Chen, N\. V\. Chawla, and X\. Zhang \(2025a\)Justice or prejudice? quantifying biases in LLM\-as\-a\-judge\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3GTtZFiajM)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- Z\. Ye, Y\. Yue, H\. Wang, X\. Han, J\. Jiang, C\. Wei, L\. Fan, J\. Liang, S\. Zhang, J\. Li,et al\.\(2025b\)Self\-rewarding rubric\-based reinforcement learning for open\-ended reasoning\.arXiv preprint arXiv:2509\.25534\.Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- S\. Zhang, J\. Wang, R\. Zhou, J\. Liao, Y\. Feng, Z\. Li, Y\. Zheng, W\. Zhang, Y\. Wen, Z\. Li,et al\.\(2026\)Memrl: self\-evolving agents via runtime reinforcement learning on episodic memory\.arXiv preprint arXiv:2601\.03192\.Cited by:[§A\.1](https://arxiv.org/html/2605.23382#A1.SS1.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
- W\. Zhang, X\. Zhang, C\. Zhang, L\. Yang, J\. Shang, Z\. Wei, H\. P\. Zou, Z\. Huang, Z\. Wang, Y\. Gao, X\. Pan, L\. Xiong, J\. Liu, P\. S\. Yu, and X\. Li \(2025\)PersonaAgent: when large language model agents meet personalization at test time\.InFirst Workshop on Multi\-Turn Interactions in Large Language Models,External Links:[Link](https://openreview.net/forum?id=fgCOkyJG3f)Cited by:[§A\.3](https://arxiv.org/html/2605.23382#A1.SS3.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.23382#S1.p2.1),[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang,et al\.\(2025\)Group sequence policy optimization\.arXiv preprint arXiv:2507\.18071\.Cited by:[§2](https://arxiv.org/html/2605.23382#S2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InThirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=uccHPGDlao)Cited by:[§1](https://arxiv.org/html/2605.23382#S1.p2.1)\.
- Y\. Zheng, C\. Gao, X\. Li, X\. He, Y\. Li, and D\. Jin \(2021\)Disentangling user interest and conformity for recommendation with causal embedding\.InProceedings of the web conference 2021,pp\. 2980–2991\.Cited by:[§A\.3](https://arxiv.org/html/2605.23382#A1.SS3.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
- Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, B\. K\. H\. Low, and P\. P\. Liang \(2026\)MEM1: learning to synergize memory and reasoning for efficient long\-horizon agents\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=XY8AaxDSLb)Cited by:[§A\.3](https://arxiv.org/html/2605.23382#A1.SS3.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.23382#S2.p3.1)\.
## Appendix AExtended Related Work
### A\.1RL Optimization for LLMs and Agentic Decision Making
##### GDPO\.
GDPO\[Liuet al\.,[2026b](https://arxiv.org/html/2605.23382#bib.bib40)\]is particularly relevant to our work from the perspective of multi\-reward reinforcement learning\. It shows that directly applying GRPO\-style normalization to the*summed*reward in multi\-reward settings can lead to training signal collapse, where distinct reward combinations are mapped to nearly identical advantages, thereby reducing optimization resolution and harming convergence\. To address this issue, GDPO decouples group\-wise normalization across individual reward components before aggregation, leading to more fine\-grained advantage estimates and substantially improved training stability\.
This insight directly informs our design\. In personalized Agentic RL, the reward is naturally multi\-faceted: a trajectory should be evaluated not only by its generic task quality, but also by how well it aligns with a specific user’s preferences\. Our PARPO follows the same high\-level principle that reward components should not be naively pooled before normalization\. However, we extend this idea to a more challenging personalized setting, where the reward components are not only heterogeneous in semantics, but also heterogeneous across users\. This motivates our reward decoupling between generic quality and personalized preference, as well as our user\-anchor calibration for stabilizing optimization under cross\-user reward\-scale mismatch\.
##### GiGPO\.
GiGPO\[Fenget al\.,[2026b](https://arxiv.org/html/2605.23382#bib.bib41)\]extends group\-relative policy optimization by introducing a group\-in\-group comparison structure for LLM agent training\. Its key insight is that finer\-grained grouping can improve optimization when trajectories are diverse and difficult to compare globally\. This perspective is particularly relevant to personalized settings, where user heterogeneity makes pooled comparisons unreliable\. Our work shares the motivation that structured comparison is important, but goes further by explicitly decoupling generic quality from personalized preference and introducing user\-anchor calibration for cross\-user reward heterogeneity\.
##### SkillRL\.
SkillRL\[Xiaet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib45)\]demonstrates that successful trajectories of LLM agents can be distilled into reusable skills and recursively integrated back into future policy learning\. This work is central to our memory design because it shows that long\-horizon agent improvement benefits from behavioral abstraction, rather than relying only on flat trajectory replay or prompt concatenation\. Our PSGM is inspired by this principle of reusable skill evolution, but differs in that it organizes skills with explicit user\-, scenario\-, and trajectory\-level structure\. In this sense, SkillRL motivates why skills should be retained and reused, while our work studies how such skills should be personalized and retrieved under heterogeneous user preferences\.
##### MemRL\.
MemRL\[Zhanget al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib43)\]studies how memory retrieval and memory maintenance can themselves be optimized as part of agent behavior\. Instead of treating memory as a passive storage backend, it casts memory access as an active component of sequential decision making\. This is closely aligned with our perspective that personalization depends not only on what the agent currently observes, but also on what user\-relevant history it can recover during rollout\. However, MemRL mainly optimizes memory for general task utility, whereas our work focuses on structured personalized memory and on how such memory should support user\-contingent policy optimization\.
### A\.2Non\-verifiable and Open\-ended Reward Design
##### OpenRubrics\.
OpenRubrics\[Liuet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib21)\]addresses a major bottleneck in non\-verifiable RL: how to obtain structured and scalable supervision when direct correctness signals are unavailable\. By automatically synthesizing high\-quality rubrics for open\-ended tasks, it provides a more interpretable and decomposable supervision interface than scalar scores alone\. This is relevant to our work because it reinforces the broader premise that open\-ended agent tasks require richer reward design than verifiable domains do\. However, OpenRubrics remains largely centered on generic evaluation criteria, whereas our work requires rewards that can differ across users even for the same action\.
### A\.3Personalized Agents, Preference Modeling, and User\-aware Memory
##### PersonaAgent\.
PersonaAgent\[Zhanget al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib26)\]is one of the most directly relevant prior works on personalized agents\. It shows that test\-time personalization can substantially improve agent behavior by combining personalized memory with user\-specific action adaptation, demonstrating the practical importance of explicitly modeling user context rather than adopting a one\-size\-fits\-all policy\. Our work shares the same high\-level motivation, but differs fundamentally in where personalization enters the system: PersonaAgent primarily personalizes at inference time, whereas we study how user specificity should be incorporated into the*training\-time*policy optimization objective\. Thus, PersonaAgent motivates the need for personalization, while our framework turns personalization into a native RL optimization problem\.
##### Mem0\.
Mem0\[Chhikaraet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib33)\]emphasizes scalable long\-term memory for practical AI agents and shows that memory extraction, consolidation, and retrieval are crucial for maintaining useful cross\-session context\. Its core value for our work lies in the observation that user alignment depends on persistent historical information, rather than only on the current prompt\. This insight directly supports our use of structured long\-term memory for personalized agents\. However, Mem0 is primarily a general\-purpose memory infrastructure, while PSGM is explicitly designed as a preference\-aligned skill memory that feeds into policy optimization, rather than serving only as a memory backend\.
##### O\-Mem\.
O\-Mem\[Zhouet al\.,[2026](https://arxiv.org/html/2605.23382#bib.bib34)\]pushes personalized memory further by proposing an omni memory system for personalized, long\-horizon, self\-evolving agents\. A particularly important contribution of O\-Mem is its emphasis on active user profiling and hierarchical retrieval, which helps distinguish salient user characteristics from topical or contextual interaction records\. This is closely related to our motivation for PSGM, since personalization requires not just storing more history, but organizing history in a way that supports targeted retrieval under long\-horizon interaction\. The key difference is that O\-Mem mainly improves personalized consistency and memory quality at inference time, whereas our framework integrates structured personalized retrieval directly into rollout\-time policy optimization\.
##### Preference\-Aware Memory Update\.
Preference\-Aware Memory Update for Long\-Term LLM Agents\[Sunet al\.,[2025](https://arxiv.org/html/2605.23382#bib.bib47)\]focuses on a challenge that is highly relevant to our setting: user preferences are non\-stationary, and long\-term memories must be updated in a preference\-sensitive way rather than simply accumulated\. By explicitly modeling evolving user tendencies and refining memory representations over time, this work shows that static memory is insufficient for personalized agents\. This insight informs our own design philosophy in two ways\. First, it supports the view that user\-aware memory must be adaptive rather than passive\. Second, it motivates our preference\-disentangled reward modeling, since changing or noisy user behavior should not be naively treated as a direct and stationary supervision signal\.
##### CoPD\.
CoPD\[Zhenget al\.,[2021](https://arxiv.org/html/2605.23382#bib.bib36)\]is particularly important to our work from the perspective of preference modeling\. It shows that observed user behavior is often a mixture of genuine interest and conformity effects, implying that naive behavior\-based learning can recover distorted preference signals\. This observation directly inspires our two\-stage preference\-disentangled reward model: we similarly view user trajectories as noisy behavioral evidence in which true preference may be entangled with external, contextual, or socially induced factors\. In our setting, this is critical because personalized policy optimization depends on the quality of user\-conditioned rewards; if those rewards are derived from conflated behavior, the policy will optimize the wrong target\.
### A\.4Summary of Positioning
Taken together, these prior works each address an important but partial aspect of the problem\. GDPO and GiGPO improve the stability and effectiveness of RL optimization for LLMs, especially in settings with heterogeneous or multi\-component rewards; OpenRubrics extends reward design beyond strictly verifiable tasks; SkillRL and MemRL demonstrate the value of reusable skills and active memory in long\-horizon agents; PersonaAgent, Mem0, O\-Mem, and Preference\-Aware Memory Update highlight the importance of personalization, long\-term memory, and adaptive user modeling; and CoPD reveals that observed user behavior must be disentangled before it can serve as a reliable preference signal\.
## Appendix BMethod Implementation Details
### B\.1Implementation\-Aligned Formulation of the Personalized Preference Reward Model
This appendix provides the implementation\-aligned training objectives of the personalized preference reward model in Section[4\.2](https://arxiv.org/html/2605.23382#S4.SS2)\. The model is trained in two stages: multi\-view profile representation learning and collaborative preference disentanglement\.
##### Stage 1: Multi\-view profile representation learning\.
For each useruu, we decompose the profile intoKKsemantic views\{xu\(k\)\}k=1K\\\{x\_\{u\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}and encode them with a frozen sentence encoder:
𝐡u\(k\)=E\(xu\(k\)\)\.\\mathbf\{h\}\_\{u\}^\{\(k\)\}=E\(x\_\{u\}^\{\(k\)\}\)\.\(18\)An attention\-based fusion module produces the profile embedding:
αu\(k\)=exp\(𝐰⊤tanh\(𝐖attn𝐡u\(k\)\+𝐛attn\)\)∑k′=1Kexp\(𝐰⊤tanh\(𝐖attn𝐡u\(k′\)\+𝐛attn\)\),\\alpha\_\{u\}^\{\(k\)\}=\\frac\{\\exp\\\!\\left\(\\mathbf\{w\}^\{\\top\}\\tanh\(\\mathbf\{W\}\_\{\\mathrm\{attn\}\}\\mathbf\{h\}\_\{u\}^\{\(k\)\}\+\\mathbf\{b\}\_\{\\mathrm\{attn\}\}\)\\right\)\}\{\\sum\_\{k^\{\\prime\}=1\}^\{K\}\\exp\\\!\\left\(\\mathbf\{w\}^\{\\top\}\\tanh\(\\mathbf\{W\}\_\{\\mathrm\{attn\}\}\\mathbf\{h\}\_\{u\}^\{\(k^\{\\prime\}\)\}\+\\mathbf\{b\}\_\{\\mathrm\{attn\}\}\)\\right\)\},\(19\)𝐮profile=LayerNorm\(𝐖out∑k=1Kαu\(k\)𝐡u\(k\)\)\.\\mathbf\{u\}\_\{\\mathrm\{profile\}\}=\\mathrm\{LayerNorm\}\\\!\\left\(\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\sum\_\{k=1\}^\{K\}\\alpha\_\{u\}^\{\(k\)\}\\mathbf\{h\}\_\{u\}^\{\(k\)\}\\right\)\.\(20\)
To make user embeddings discriminative, we optimize a user\-level InfoNCE loss:
ℒuser\-InfoNCE=−∑ulogexp\(sim\(𝐮profile,𝐮profile\+\)/τc\)∑vexp\(sim\(𝐮profile,𝐮profile\(v\)\)/τc\)\.\\mathcal\{L\}\_\{\\mathrm\{user\\text\{\-\}InfoNCE\}\}=\-\\sum\_\{u\}\\log\\frac\{\\exp\\\!\\left\(\\operatorname\{sim\}\(\\mathbf\{u\}\_\{\\mathrm\{profile\}\},\\mathbf\{u\}^\{\+\}\_\{\\mathrm\{profile\}\}\)/\\tau\_\{c\}\\right\)\}\{\\sum\_\{v\}\\exp\\\!\\left\(\\operatorname\{sim\}\(\\mathbf\{u\}\_\{\\mathrm\{profile\}\},\\mathbf\{u\}^\{\(v\)\}\_\{\\mathrm\{profile\}\}\)/\\tau\_\{c\}\\right\)\}\.\(21\)To preserve view information, we reconstruct each encoded view from the fused profile representation:
𝐡^u\(k\)=𝐖rec\(k\)𝐮profile\+𝐛rec\(k\),\\hat\{\\mathbf\{h\}\}\_\{u\}^\{\(k\)\}=\\mathbf\{W\}\_\{\\mathrm\{rec\}\}^\{\(k\)\}\\mathbf\{u\}\_\{\\mathrm\{profile\}\}\+\\mathbf\{b\}\_\{\\mathrm\{rec\}\}^\{\(k\)\},\(22\)ℒrecon=∑u∑k=1K‖𝐡^u\(k\)−𝐡u\(k\)‖22\.\\mathcal\{L\}\_\{\\mathrm\{recon\}\}=\\sum\_\{u\}\\sum\_\{k=1\}^\{K\}\\left\\\|\\hat\{\\mathbf\{h\}\}\_\{u\}^\{\(k\)\}\-\\mathbf\{h\}\_\{u\}^\{\(k\)\}\\right\\\|\_\{2\}^\{2\}\.\(23\)The stage\-1 objective is
ℒstage1=ℒuser\-InfoNCE\+λreconℒrecon\.\\mathcal\{L\}\_\{\\mathrm\{stage1\}\}=\\mathcal\{L\}\_\{\\mathrm\{user\\text\{\-\}InfoNCE\}\}\+\\lambda\_\{\\mathrm\{recon\}\}\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\.\(24\)
##### Stage 2: LightGCN\-based collaborative preference disentanglement\.
We build a user–item interaction graph and apply LightGCN:
𝐄\(ℓ\+1\)=𝐀^𝐄\(ℓ\),𝐄final=1L\+1∑ℓ=0L𝐄\(ℓ\)\.\\mathbf\{E\}^\{\(\\ell\+1\)\}=\\hat\{\\mathbf\{A\}\}\\mathbf\{E\}^\{\(\\ell\)\},\\qquad\\mathbf\{E\}\_\{\\mathrm\{final\}\}=\\frac\{1\}\{L\+1\}\\sum\_\{\\ell=0\}^\{L\}\\mathbf\{E\}^\{\(\\ell\)\}\.\(25\)After splitting𝐄final\\mathbf\{E\}\_\{\\mathrm\{final\}\}, we obtain user and item collaborative embeddings𝐮cf\\mathbf\{u\}\_\{\\mathrm\{cf\}\}and𝐢cf\\mathbf\{i\}\_\{\\mathrm\{cf\}\}\.
The collaborative user embedding is disentangled into two branches:
𝐮int=InterestEncoder\(𝐮cf\),𝐮conf=ConformityEncoder\(𝐮cf\)\.\\mathbf\{u\}\_\{\\mathrm\{int\}\}=\\mathrm\{InterestEncoder\}\(\\mathbf\{u\}\_\{\\mathrm\{cf\}\}\),\\qquad\\mathbf\{u\}\_\{\\mathrm\{conf\}\}=\\mathrm\{ConformityEncoder\}\(\\mathbf\{u\}\_\{\\mathrm\{cf\}\}\)\.\(26\)Letp~i∈\[0,1\]\\tilde\{p\}\_\{i\}\\in\[0,1\]denote the normalized popularity of itemii\. The implementation uses two popularity\-weighted contrastive objectives\.
For the interest branch, let
ωiint=exp\(1−p~i\)\.\\omega\_\{i\}^\{\\mathrm\{int\}\}=\\exp\(1\-\\tilde\{p\}\_\{i\}\)\.\(27\)Then
ℒint=1B∑\(u,i\+\)\[−log\(ωi\+int\+ϵ\)−𝐮int⊤𝐢cf\+τ\+log∑jexp\(𝐮int⊤𝐢cf\(j\)τ\)\]\.\\mathcal\{L\}\_\{\\mathrm\{int\}\}=\\frac\{1\}\{B\}\\sum\_\{\(u,i^\{\+\}\)\}\\left\[\-\\log\\\!\\left\(\\omega\_\{i^\{\+\}\}^\{\\mathrm\{int\}\}\+\\epsilon\\right\)\-\\frac\{\\mathbf\{u\}\_\{\\mathrm\{int\}\}^\{\\top\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\+\}\}\{\\tau\}\+\\log\\sum\_\{j\}\\exp\\\!\\left\(\\frac\{\\mathbf\{u\}\_\{\\mathrm\{int\}\}^\{\\top\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\(j\)\}\}\{\\tau\}\\right\)\\right\]\.\(28\)
For the conformity branch, let
ωiconf=exp\(p~i\)\.\\omega\_\{i\}^\{\\mathrm\{conf\}\}=\\exp\(\\tilde\{p\}\_\{i\}\)\.\(29\)Then
ℒconf=1B∑\(u,i\+\)\[−log\(ωi\+conf\+ϵ\)−𝐮conf⊤𝐢cf\+τ\+log∑jexp\(𝐮conf⊤𝐢cf\(j\)τ\)\]\.\\mathcal\{L\}\_\{\\mathrm\{conf\}\}=\\frac\{1\}\{B\}\\sum\_\{\(u,i^\{\+\}\)\}\\left\[\-\\log\\\!\\left\(\\omega\_\{i^\{\+\}\}^\{\\mathrm\{conf\}\}\+\\epsilon\\right\)\-\\frac\{\\mathbf\{u\}\_\{\\mathrm\{conf\}\}^\{\\top\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\+\}\}\{\\tau\}\+\\log\\sum\_\{j\}\\exp\\\!\\left\(\\frac\{\\mathbf\{u\}\_\{\\mathrm\{conf\}\}^\{\\top\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\(j\)\}\}\{\\tau\}\\right\)\\right\]\.\(30\)
##### Branch fusion and recommendation loss\.
The two branch embeddings are normalized and fused by branch attention:
\[αint,αconf\]=softmax\(𝐖2ReLU\(𝐖1\[𝐮^int;𝐮^conf\]\)T\),\[\\alpha\_\{\\mathrm\{int\}\},\\alpha\_\{\\mathrm\{conf\}\}\]=\\mathrm\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{W\}\_\{2\}\\,\\mathrm\{ReLU\}\\\!\\bigl\(\\mathbf\{W\}\_\{1\}\[\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{int\}\};\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{conf\}\}\]\\bigr\)\}\{T\}\\right\),\(31\)𝐮fused=s\(αint𝐮^int\+αconf𝐮^conf\)\.\\mathbf\{u\}\_\{\\mathrm\{fused\}\}=s\\left\(\\alpha\_\{\\mathrm\{int\}\}\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{int\}\}\+\\alpha\_\{\\mathrm\{conf\}\}\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{conf\}\}\\right\)\.\(32\)Using positive and negative items\(i\+,i−\)\(i^\{\+\},i^\{\-\}\), the BPR\-style recommendation loss is
ℒrec=1B∑\(u,i\+,i−\)softplus\(𝐮fused⊤𝐢cf−−𝐮fused⊤𝐢cf\+\)\.\\mathcal\{L\}\_\{\\mathrm\{rec\}\}=\\frac\{1\}\{B\}\\sum\_\{\(u,i^\{\+\},i^\{\-\}\)\}\\operatorname\{softplus\}\\left\(\\mathbf\{u\}\_\{\\mathrm\{fused\}\}^\{\\top\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\-\}\-\\mathbf\{u\}\_\{\\mathrm\{fused\}\}^\{\\top\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\+\}\\right\)\.\(33\)
##### Orthogonality, user contrast, and regularization\.
To encourage branch specialization, we use an orthogonality penalty:
ℒorth=1B∑u\(𝐮^int⊤𝐮^conf\)2\.\\mathcal\{L\}\_\{\\mathrm\{orth\}\}=\\frac\{1\}\{B\}\\sum\_\{u\}\\left\(\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{int\}\}^\{\\top\}\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{conf\}\}\\right\)^\{2\}\.\(34\)
For user contrast, let\{𝐠m\}m=1M\\\{\\mathbf\{g\}\_\{m\}\\\}\_\{m=1\}^\{M\}be the fused embeddings of the unique users in the current batch, normalized as𝐠^m\\hat\{\\mathbf\{g\}\}\_\{m\}\. We form the similarity matrix
Smn=𝐠^m⊤𝐠^nτ,S\_\{mn\}=\\frac\{\\hat\{\\mathbf\{g\}\}\_\{m\}^\{\\top\}\\hat\{\\mathbf\{g\}\}\_\{n\}\}\{\\tau\},\(35\)and optimize
ℒuser=CE\(S,𝐲\),\\mathcal\{L\}\_\{\\mathrm\{user\}\}=\\mathrm\{CE\}\(S,\\mathbf\{y\}\),\(36\)where𝐲=\[1,2,…,M\]\\mathbf\{y\}=\[1,2,\\dots,M\]denotes the identity target\.
Theℓ2\\ell\_\{2\}regularizer is
ℒreg=12B\(‖𝐮cf‖22\+‖𝐢cf\+‖22\+‖𝐢cf−‖22\)\.\\mathcal\{L\}\_\{\\mathrm\{reg\}\}=\\frac\{1\}\{2B\}\\left\(\\\|\\mathbf\{u\}\_\{\\mathrm\{cf\}\}\\\|\_\{2\}^\{2\}\+\\\|\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\+\}\\\|\_\{2\}^\{2\}\+\\\|\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\-\}\\\|\_\{2\}^\{2\}\\right\)\.\(37\)
##### Action alignment loss\.
When item text embeddings are available, the implementation additionally trains the action encoder to stay aligned with the collaborative filtering space\. Let
𝐪\+=ActionEncoder\(E\(i\+\)\),𝐪−=ActionEncoder\(E\(i−\)\)\.\\mathbf\{q\}^\{\+\}=\\mathrm\{ActionEncoder\}\(E\(i^\{\+\}\)\),\\qquad\\mathbf\{q\}^\{\-\}=\\mathrm\{ActionEncoder\}\(E\(i^\{\-\}\)\)\.\(38\)We first align the projected positive item text to the positive collaborative item embedding:
ℒalign\-cos=1B∑\(u,i\+\)\(1−cos\(𝐪\+,sg\(𝐢cf\+\)\)\),\\mathcal\{L\}\_\{\\mathrm\{align\\text\{\-\}cos\}\}=\\frac\{1\}\{B\}\\sum\_\{\(u,i^\{\+\}\)\}\\left\(1\-\\cos\(\\mathbf\{q\}^\{\+\},\\operatorname\{sg\}\(\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\+\}\)\)\\right\),\(39\)wheresg\(⋅\)\\operatorname\{sg\}\(\\cdot\)denotes stop\-gradient\. We further add an action\-space BPR term:
ℒalign\-bpr=1B∑\(u,i\+,i−\)softplus\(𝐮fused⊤𝐪−−𝐮fused⊤𝐪\+\)\.\\mathcal\{L\}\_\{\\mathrm\{align\\text\{\-\}bpr\}\}=\\frac\{1\}\{B\}\\sum\_\{\(u,i^\{\+\},i^\{\-\}\)\}\\operatorname\{softplus\}\\left\(\\mathbf\{u\}\_\{\\mathrm\{fused\}\}^\{\\top\}\\mathbf\{q\}^\{\-\}\-\\mathbf\{u\}\_\{\\mathrm\{fused\}\}^\{\\top\}\\mathbf\{q\}^\{\+\}\\right\)\.\(40\)Thus,
ℒalign=ℒalign\-cos\+ℒalign\-bpr\.\\mathcal\{L\}\_\{\\mathrm\{align\}\}=\\mathcal\{L\}\_\{\\mathrm\{align\\text\{\-\}cos\}\}\+\\mathcal\{L\}\_\{\\mathrm\{align\\text\{\-\}bpr\}\}\.\(41\)
##### Total stage\-2 objective\.
The full stage\-2 loss is
ℒstage2=ℒrec\+λintℒint\+λconfℒconf\+λorthℒorth\+λuserℒuser\+λregℒreg\+λalignℒalign\.\\mathcal\{L\}\_\{\\mathrm\{stage2\}\}=\\mathcal\{L\}\_\{\\mathrm\{rec\}\}\+\\lambda\_\{\\mathrm\{int\}\}\\mathcal\{L\}\_\{\\mathrm\{int\}\}\+\\lambda\_\{\\mathrm\{conf\}\}\\mathcal\{L\}\_\{\\mathrm\{conf\}\}\+\\lambda\_\{\\mathrm\{orth\}\}\\mathcal\{L\}\_\{\\mathrm\{orth\}\}\+\\lambda\_\{\\mathrm\{user\}\}\\mathcal\{L\}\_\{\\mathrm\{user\}\}\+\\lambda\_\{\\mathrm\{reg\}\}\\mathcal\{L\}\_\{\\mathrm\{reg\}\}\+\\lambda\_\{\\mathrm\{align\}\}\\mathcal\{L\}\_\{\\mathrm\{align\}\}\.\(42\)In the current implementation, the default coefficients areλint=0\.2\\lambda\_\{\\mathrm\{int\}\}=0\.2,λconf=0\.2\\lambda\_\{\\mathrm\{conf\}\}=0\.2,λorth=0\.1\\lambda\_\{\\mathrm\{orth\}\}=0\.1,λuser=3\.0\\lambda\_\{\\mathrm\{user\}\}=3\.0,λreg=10−4\\lambda\_\{\\mathrm\{reg\}\}=10^\{\-4\}, andλalign=0\.5\\lambda\_\{\\mathrm\{align\}\}=0\.5\.
##### Implementation\-time action alignment for reward inference\.
At inference time, the action text embedding is not used directly\. Instead, the model constructs an aligned collaborative representation by nearest\-neighbor retrieval over known item text embeddings\. Given an action text embedding𝐞a\\mathbf\{e\}\_\{a\}, let𝒩K\(a\)\\mathcal\{N\}\_\{K\}\(a\)be its top\-KKnearest items in text space, with weights
πj=softmax\(sim\(𝐞a,𝐞j\)0\.1\)\.\\pi\_\{j\}=\\mathrm\{softmax\}\\\!\\left\(\\frac\{\\operatorname\{sim\}\(\\mathbf\{e\}\_\{a\},\\mathbf\{e\}\_\{j\}\)\}\{0\.1\}\\right\)\.\(43\)The aligned collaborative action embedding is
𝐚cf=∑j∈𝒩K\(a\)πj𝐢cf\(j\)\.\\mathbf\{a\}\_\{\\mathrm\{cf\}\}=\\sum\_\{j\\in\\mathcal\{N\}\_\{K\}\(a\)\}\\pi\_\{j\}\\mathbf\{i\}\_\{\\mathrm\{cf\}\}^\{\(j\)\}\.\(44\)The final action embedding is
𝐚final=0\.5𝐚^cf\+0\.5𝐚^proj,\\mathbf\{a\}\_\{\\mathrm\{final\}\}=0\.5\\,\\widehat\{\\mathbf\{a\}\}\_\{\\mathrm\{cf\}\}\+0\.5\\,\\widehat\{\\mathbf\{a\}\}\_\{\\mathrm\{proj\}\},\(45\)where𝐚^cf\\widehat\{\\mathbf\{a\}\}\_\{\\mathrm\{cf\}\}and𝐚^proj\\widehat\{\\mathbf\{a\}\}\_\{\\mathrm\{proj\}\}are normalized embeddings\. The deployed branch and fused scores are then
rint\(u,a\)=𝐮^int⊤𝐚^final,rconf\(u,a\)=𝐮^conf⊤𝐚^final,r\_\{\\mathrm\{int\}\}\(u,a\)=\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{int\}\}^\{\\top\}\\hat\{\\mathbf\{a\}\}\_\{\\mathrm\{final\}\},\\qquad r\_\{\\mathrm\{conf\}\}\(u,a\)=\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{conf\}\}^\{\\top\}\\hat\{\\mathbf\{a\}\}\_\{\\mathrm\{final\}\},\(46\)rfused\(u,a\)=𝐮^fused⊤𝐚^final\.r\_\{\\mathrm\{fused\}\}\(u,a\)=\\hat\{\\mathbf\{u\}\}\_\{\\mathrm\{fused\}\}^\{\\top\}\\hat\{\\mathbf\{a\}\}\_\{\\mathrm\{final\}\}\.\(47\)
### B\.2Implementation\-Aligned Retrieval and Scoring in PSGM
This appendix formalizes the graph\-based retrieval mechanism in Section[4\.3](https://arxiv.org/html/2605.23382#S4.SS3)according to the current implementation\.
##### Semantic initialization and 2\-hop expansion\.
Given query embedding𝐪\\mathbf\{q\}, PSGM first retrieves an initial candidate set:
𝒮init\(q\)=TopMs∈𝒮cos\(𝐪,𝐬\)\.\\mathcal\{S\}\_\{\\mathrm\{init\}\}\(q\)=\\operatorname\{TopM\}\_\{s\\in\\mathcal\{S\}\}\\cos\(\\mathbf\{q\},\\mathbf\{s\}\)\.\(48\)For each candidate skill, the system performs a 2\-hop expansion:
skill→owner user→sibling skills,\\text\{skill\}\\rightarrow\\text\{owner user\}\\rightarrow\\text\{sibling skills\},\(49\)where the first hop traverses an incomingOWNSedge and the second hop traverses outgoingOWNSedges from the owner user node\.
##### Scoring components\.
For a user nodeuuand candidate skill nodess, the implementation uses
fsem\(q,s\)=cos\(𝐪,𝐬\),f\_\{\\mathrm\{sem\}\}\(q,s\)=\\cos\(\\mathbf\{q\},\\mathbf\{s\}\),\(50\)fuser\(u,s\)=cos\(𝐮,𝐬\),f\_\{\\mathrm\{user\}\}\(u,s\)=\\cos\(\\mathbf\{u\},\\mathbf\{s\}\),\(51\)fcomm\(u,s\)=\{1\.0,ifuandsbelong to the same selected community,0\.3,if both are assigned but belong to different communities,0,otherwise,f\_\{\\mathrm\{comm\}\}\(u,s\)=\\begin\{cases\}1\.0,&\\text\{if \}u\\text\{ and \}s\\text\{ belong to the same selected community\},\\\\ 0\.3,&\\text\{if both are assigned but belong to different communities\},\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(52\)fcomp\(s\)=1\+κ∑e∈ℰcomp\(s\)we,f\_\{\\mathrm\{comp\}\}\(s\)=1\+\\kappa\\sum\_\{e\\in\\mathcal\{E\}\_\{\\mathrm\{comp\}\}\(s\)\}w\_\{e\},\(53\)fconf\(s\)=min\(∑e∈ℰconf\(s\)we,1\.0\),f\_\{\\mathrm\{conf\}\}\(s\)=\\min\\\!\\left\(\\sum\_\{e\\in\\mathcal\{E\}\_\{\\mathrm\{conf\}\}\(s\)\}w\_\{e\},\\;1\.0\\right\),\(54\)whereκ\\kappais the complement boost factor from the graph configuration\.
##### Final graph\-aware score\.
The final score is
score\(q,s,u\)=fsem\(q,s\)⋅\(α\+βfuser\(u,s\)\)⋅\(1\+γfcomm\(u,s\)\)⋅fcomp\(s\)⋅\(1−δfconf\(s\)\)\.\\operatorname\{score\}\(q,s,u\)=f\_\{\\mathrm\{sem\}\}\(q,s\)\\cdot\\bigl\(\\alpha\+\\beta f\_\{\\mathrm\{user\}\}\(u,s\)\\bigr\)\\cdot\\bigl\(1\+\\gamma f\_\{\\mathrm\{comm\}\}\(u,s\)\\bigr\)\\cdot f\_\{\\mathrm\{comp\}\}\(s\)\\cdot\\bigl\(1\-\\delta f\_\{\\mathrm\{conf\}\}\(s\)\\bigr\)\.\(55\)The hyperparametersα,β,γ,δ\\alpha,\\beta,\\gamma,\\deltaare fixed in the graph configuration\. In the current default implementation, they are set toα=0\.3\\alpha=0\.3,β=0\.3\\beta=0\.3,γ=0\.2\\gamma=0\.2, andδ=0\.7\\delta=0\.7\.
##### Implementation note\.
The current code includes tool\- and scenario\-validation hooks in the retrieval pipeline, but these checks are implemented as permissive placeholders and therefore do not yet impose additional filtering beyond the graph\-aware score above\.
### B\.3Implementation\-Aligned Dual\-Track Optimization
This appendix formalizes the current dual\-track policy optimization logic corresponding to Section[4\.1](https://arxiv.org/html/2605.23382#S4.SS1)\.
##### Evaluation\-time fused reward\.
Given an episode, the judge score isRjudgeR\_\{\\mathrm\{judge\}\}Letrintr\_\{\\mathrm\{int\}\}andrconfr\_\{\\mathrm\{conf\}\}denote the aggregated branch\-specific neural rewards for that episode\. Using the reward statistics from the training distribution,
r~int=σ\(rint−μintσint\),r~conf=σ\(rconf−μconfσconf\)\.\\tilde\{r\}\_\{\\mathrm\{int\}\}=\\sigma\\\!\\left\(\\frac\{r\_\{\\mathrm\{int\}\}\-\\mu\_\{\\mathrm\{int\}\}\}\{\\sigma\_\{\\mathrm\{int\}\}\}\\right\),\\qquad\\tilde\{r\}\_\{\\mathrm\{conf\}\}=\\sigma\\\!\\left\(\\frac\{r\_\{\\mathrm\{conf\}\}\-\\mu\_\{\\mathrm\{conf\}\}\}\{\\sigma\_\{\\mathrm\{conf\}\}\}\\right\)\.\(56\)
##### Training\-time dual\-track decomposition\.
For PARPO\-style policy optimization, we useRbaseR\_\{\\mathrm\{base\}\}andRpersR\_\{\\mathrm\{pers\}\}, which denote the generic reward and the personalized reward, respectively\.
##### Base advantage\.
For prompt groupgg,
Aibase=Rbase\(τi\)−R¯base\(g\)Std\(\{Rbase\(τj\)\}j∈g\)\+ϵ\.A\_\{i\}^\{\\mathrm\{base\}\}=\\frac\{R\_\{\\mathrm\{base\}\}\(\\tau\_\{i\}\)\-\\bar\{R\}\_\{\\mathrm\{base\}\}^\{\(g\)\}\}\{\\operatorname\{Std\}\\\!\\bigl\(\\\{R\_\{\\mathrm\{base\}\}\(\\tau\_\{j\}\)\\\}\_\{j\\in g\}\\bigr\)\+\\epsilon\}\.\(57\)
##### User\-anchor update\.
For each useruu, the implementation stores a persistent anchor
𝒜u=\{mu,vu,cu\},\\mathcal\{A\}\_\{u\}=\\\{m\_\{u\},v\_\{u\},c\_\{u\}\\\},\(58\)wheremum\_\{u\}is the EMA mean,vuv\_\{u\}is the EMA variance, andcuc\_\{u\}counts the number of updates\. Given the current batch personalized rewards for that user, the anchor is updated as
mu←\{R¯upers,cu=0,ρmu\+\(1−ρ\)R¯upers,cu\>0,m\_\{u\}\\leftarrow\\begin\{cases\}\\bar\{R\}\_\{u\}^\{\\mathrm\{pers\}\},&c\_\{u\}=0,\\\\ \\rho m\_\{u\}\+\(1\-\\rho\)\\bar\{R\}\_\{u\}^\{\\mathrm\{pers\}\},&c\_\{u\}\>0,\\end\{cases\}\(59\)vu←\{max\(Var\(ℛupers\),10−6\),cu=0,ρvu\+\(1−ρ\)Var\(ℛupers\),cu\>0\.v\_\{u\}\\leftarrow\\begin\{cases\}\\max\(\\operatorname\{Var\}\(\\mathcal\{R\}\_\{u\}^\{\\mathrm\{pers\}\}\),10^\{\-6\}\),&c\_\{u\}=0,\\\\ \\rho v\_\{u\}\+\(1\-\\rho\)\\operatorname\{Var\}\(\\mathcal\{R\}\_\{u\}^\{\\mathrm\{pers\}\}\),&c\_\{u\}\>0\.\\end\{cases\}\(60\)
##### Personalized advantage\.
LetR¯pers\(g\)\\bar\{R\}\_\{\\mathrm\{pers\}\}^\{\(g\)\}be the within\-group mean personalized reward\. The personalized baseline is
bu,g=max\(R¯pers\(g\),mu−γpvu\)\.b\_\{u,g\}=\\max\\\!\\left\(\\bar\{R\}\_\{\\mathrm\{pers\}\}^\{\(g\)\},\\,m\_\{u\}\-\\gamma\_\{p\}\\sqrt\{v\_\{u\}\}\\right\)\.\(61\)The personalized advantage is
Aipers=Rpers\(τi\)−bui,gvui\+ϵ\.A\_\{i\}^\{\\mathrm\{pers\}\}=\\frac\{R\_\{\\mathrm\{pers\}\}\(\\tau\_\{i\}\)\-b\_\{u\_\{i\},g\}\}\{\\sqrt\{v\_\{u\_\{i\}\}\}\+\\epsilon\}\.\(62\)
##### Fused advantage and PPO\-style update\.
The total trajectory\-level advantage is
Aitotal=wbaseAibase\+wpersAipers\.A\_\{i\}^\{\\mathrm\{total\}\}=w\_\{\\mathrm\{base\}\}A\_\{i\}^\{\\mathrm\{base\}\}\+w\_\{\\mathrm\{pers\}\}A\_\{i\}^\{\\mathrm\{pers\}\}\.\(63\)This advantage is broadcast to the token level and used in the standard PPO\-style clipped policy loss:
ℒpolicy=1B∑imax\(−ri\(θ\)Aitotal,−clip\(ri\(θ\),1−η,1\+η\)Aitotal\)\.\\mathcal\{L\}\_\{\\mathrm\{policy\}\}=\\frac\{1\}\{B\}\\sum\_\{i\}\\max\\\!\\left\(\-r\_\{i\}\(\\theta\)A\_\{i\}^\{\\mathrm\{total\}\},\-\\operatorname\{clip\}\\\!\\bigl\(r\_\{i\}\(\\theta\),1\-\\eta,1\+\\eta\\bigr\)A\_\{i\}^\{\\mathrm\{total\}\}\\right\)\.\(64\)When enabled in training, KL regularization is applied separately by the actor update loop rather than being absorbed into the advantage definition\.
## Appendix CTheoretical Analysis of PARPO
In this section, we provide a theoretical analysis of PARPO\. Our goal is not to establish global convergence guarantees for the full PPO/GRPO training dynamics, but rather to address three questions: \(1\) why personalized optimization is preferable to user\-agnostic optimization under heterogeneous user preferences; \(2\) why standard GRPO incurs structural bias in personalized settings; and \(3\) why PARPO can reduce such bias through reward decomposition, user\-aware grouping, and user\-specific anchor calibration\.
Our analysis follows the same problem formulation as in the main text, but focuses on trajectory\-level rewards and advantage estimation errors in order to highlight the key structure of personalized policy optimization\.
Given a user profilepu∈𝒫p\_\{u\}\\in\\mathcal\{P\}, a user queryq∈𝒬q\\in\\mathcal\{Q\}, and a trajectoryτ\\taugenerated by policyπ\\pi, the total reward is defined as
R\(τ,pu,q\)=αRbase\(τ,q\)\+\(1−α\)Rpers\(τ,pu,hu\),R\(\\tau,p\_\{u\},q\)=\\alpha R\_\{\\text\{base\}\}\(\\tau,q\)\+\(1\-\\alpha\)R\_\{\\text\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\),\(65\)whereRbaseR\_\{\\text\{base\}\}denotes the general\-quality reward,RpersR\_\{\\text\{pers\}\}denotes the personalized preference reward, andhuh\_\{u\}denotes the historical interaction records of useruu\.
For a fixed useruuand queryqq, we define the user\-specific value function as
Vu\(q\)=𝔼τ∼π\(⋅∣pu,q\)\[R\(τ,pu,q\)\]\.V\_\{u\}\(q\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\(\\cdot\\mid p\_\{u\},q\)\}\\left\[R\(\\tau,p\_\{u\},q\)\\right\]\.\(66\)If user identities are ignored and rewards are pooled across users, we define the pooled value function as
Vpool\(q\)=𝔼u,τ\[R\(τ,pu,q\)∣q\]\.V\_\{\\mathrm\{pool\}\}\(q\)=\\mathbb\{E\}\_\{u,\\tau\}\\left\[R\(\\tau,p\_\{u\},q\)\\mid q\\right\]\.\(67\)Accordingly, the true personalized advantage of trajectoryτ\\taufor useruuunder queryqqis
Au∗\(τ∣q\)=R\(τ,pu,q\)−Vu\(q\)\.A\_\{u\}^\{\*\}\(\\tau\\mid q\)=R\(\\tau,p\_\{u\},q\)\-V\_\{u\}\(q\)\.\(68\)
Since both GRPO and PARPO use normalized relative advantages, we further define the normalized user\-specific advantage as
A¯u∗\(τ∣q\)=R\(τ,pu,q\)−Vu\(q\)σu\(q\)\+ϵ,\\bar\{A\}\_\{u\}^\{\*\}\(\\tau\\mid q\)=\\frac\{R\(\\tau,p\_\{u\},q\)\-V\_\{u\}\(q\)\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\},\(69\)whereσu\(q\)\\sigma\_\{u\}\(q\)denotes the standard deviation of trajectory rewards for useruuunder queryqq, andϵ\>0\\epsilon\>0is a numerical stabilizer\. Correspondingly, the pooled normalized advantage used by standard GRPO is written as
A¯GRPO\(τ∣q\)=R\(τ,pu,q\)−Vpool\(q\)σpool\(q\)\+ϵ,\\bar\{A\}^\{\\mathrm\{GRPO\}\}\(\\tau\\mid q\)=\\frac\{R\(\\tau,p\_\{u\},q\)\-V\_\{\\mathrm\{pool\}\}\(q\)\}\{\\sigma\_\{\\mathrm\{pool\}\}\(q\)\+\\epsilon\},\(70\)whereσpool\(q\)\\sigma\_\{\\mathrm\{pool\}\}\(q\)denotes the pooled reward standard deviation across users\.
To characterize the heterogeneity induced by personalized preferences, we further define the mean personalized reward of useruuunder queryqqas
μu\(q\)=𝔼τ\[Rpers\(τ,pu,hu\)∣u,q\],\\mu\_\{u\}\(q\)=\\mathbb\{E\}\_\{\\tau\}\\left\[R\_\{\\text\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\\mid u,q\\right\],\(71\)the pooled mean personalized reward as
μpool\(q\)=𝔼u,τ\[Rpers\(τ,pu,hu\)∣q\],\\mu\_\{\\mathrm\{pool\}\}\(q\)=\\mathbb\{E\}\_\{u,\\tau\}\\left\[R\_\{\\text\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\\mid q\\right\],\(72\)and the mean personalized reward within the similar\-user groupG\(u\)G\(u\)as
μG\(u\)\(q\)=𝔼u′∈G\(u\),τ\[Rpers\(τ,pu′,hu′\)∣q\]\.\\mu\_\{G\(u\)\}\(q\)=\\mathbb\{E\}\_\{u^\{\\prime\}\\in G\(u\),\\tau\}\\left\[R\_\{\\text\{pers\}\}\(\\tau,p\_\{u^\{\\prime\}\},h\_\{u^\{\\prime\}\}\)\\mid q\\right\]\.\(73\)
We also define the global heterogeneity measure
ℋ\(q\)=𝔼u\[\(μu\(q\)−μpool\(q\)\)2\],\\mathcal\{H\}\(q\)=\\mathbb\{E\}\_\{u\}\\left\[\\left\(\\mu\_\{u\}\(q\)\-\\mu\_\{\\mathrm\{pool\}\}\(q\)\\right\)^\{2\}\\right\],\(74\)and the local within\-group heterogeneity measure
ℋG\(q\)=𝔼u\[\(μu\(q\)−μG\(u\)\(q\)\)2\]\.\\mathcal\{H\}\_\{G\}\(q\)=\\mathbb\{E\}\_\{u\}\\left\[\\left\(\\mu\_\{u\}\(q\)\-\\mu\_\{G\(u\)\}\(q\)\\right\)^\{2\}\\right\]\.\(75\)
In PARPO, the personalized advantage does not directly use the true user\-specific meanμu\(q\)\\mu\_\{u\}\(q\)as the baseline\. Instead, it constructs an approximate user\-specific baseline by combining similar\-user group statistics and a user\-specific historical anchor:
μ~u\(q\)=max\(μG\(u\)\(q\),bu\(q\)−ϵu\),\\tilde\{\\mu\}\_\{u\}\(q\)=\\max\(\\mu\_\{G\(u\)\}\(q\),\\,b\_\{u\}\(q\)\-\\epsilon\_\{u\}\),\(76\)wherebu\(q\)b\_\{u\}\(q\)denotes the historical reward anchor of useruu, andϵu\\epsilon\_\{u\}denotes an adaptive margin term\. In the following analysis, we assume that there existsσmin\>0\\sigma\_\{\\min\}\>0such that
σu\(q\)≥σmin,σpool\(q\)≥σmin\.\\sigma\_\{u\}\(q\)\\geq\\sigma\_\{\\min\},\\qquad\\sigma\_\{\\mathrm\{pool\}\}\(q\)\\geq\\sigma\_\{\\min\}\.\(77\)
### C\.1Necessity of Personalized Optimization
We first show that under heterogeneous user preferences, personalized decision\-making is never worse than user\-agnostic decision\-making, and that the gain can be explicitly characterized by preference heterogeneity\.
Consider a fixed queryqqand two candidate trajectoriesτ1,τ2\\tau\_\{1\},\\tau\_\{2\}\. For each useruu, define
zu=ℙ\(τ1≻τ2∣u,q\),z\_\{u\}=\\mathbb\{P\}\(\\tau\_\{1\}\\succ\\tau\_\{2\}\\mid u,q\),\(78\)namely, the probability that useruuprefers trajectoryτ1\\tau\_\{1\}overτ2\\tau\_\{2\}under queryqq\.
If the policy is user\-agnostic, it must choose the same trajectory for all users\. Its optimal value is therefore
Vavg\(q\)=max\(𝔼u\[zu\],1−𝔼u\[zu\]\)\.V\_\{\\mathrm\{avg\}\}\(q\)=\\max\\big\(\\mathbb\{E\}\_\{u\}\[z\_\{u\}\],\\,1\-\\mathbb\{E\}\_\{u\}\[z\_\{u\}\]\\big\)\.\(79\)If the policy is user\-aware, it may choose different trajectories for different users, yielding
Vpers\(q\)=𝔼u\[max\(zu,1−zu\)\]\.V\_\{\\mathrm\{pers\}\}\(q\)=\\mathbb\{E\}\_\{u\}\\big\[\\max\(z\_\{u\},1\-z\_\{u\}\)\\big\]\.\(80\)
###### Theorem 1\.
For any fixed queryqq, we have
Vpers\(q\)≥Vavg\(q\)\.V\_\{\\mathrm\{pers\}\}\(q\)\\geq V\_\{\\mathrm\{avg\}\}\(q\)\.\(81\)Moreover, the gain can be written as
Δpers\(q\):=Vpers\(q\)−Vavg\(q\)=𝔼u\[\|zu−12\|\]−\|𝔼u\[zu\]−12\|\.\\Delta\_\{\\mathrm\{pers\}\}\(q\):=V\_\{\\mathrm\{pers\}\}\(q\)\-V\_\{\\mathrm\{avg\}\}\(q\)=\\mathbb\{E\}\_\{u\}\\left\[\\left\|z\_\{u\}\-\\frac\{1\}\{2\}\\right\|\\right\]\-\\left\|\\mathbb\{E\}\_\{u\}\[z\_\{u\}\]\-\\frac\{1\}\{2\}\\right\|\.\(82\)Hence, when user preferences are heterogeneous,Δpers\(q\)\\Delta\_\{\\mathrm\{pers\}\}\(q\)is typically positive and increases with the degree of preference disagreement\.
###### Proof\.
Define
f\(z\)=max\(z,1−z\)=12\+\|z−12\|\.f\(z\)=\\max\(z,1\-z\)=\\frac\{1\}\{2\}\+\\left\|z\-\\frac\{1\}\{2\}\\right\|\.\(83\)Since the absolute value function is convex,f\(z\)f\(z\)is also convex\. By Jensen’s inequality,
f\(𝔼u\[zu\]\)≤𝔼u\[f\(zu\)\]\.f\(\\mathbb\{E\}\_\{u\}\[z\_\{u\}\]\)\\leq\\mathbb\{E\}\_\{u\}\[f\(z\_\{u\}\)\]\.\(84\)Substituting the definition offf, we obtain
max\(𝔼u\[zu\],1−𝔼u\[zu\]\)≤𝔼u\[max\(zu,1−zu\)\],\\max\\big\(\\mathbb\{E\}\_\{u\}\[z\_\{u\}\],\\,1\-\\mathbb\{E\}\_\{u\}\[z\_\{u\}\]\\big\)\\leq\\mathbb\{E\}\_\{u\}\[\\max\(z\_\{u\},1\-z\_\{u\}\)\],\(85\)which gives
Vavg\(q\)≤Vpers\(q\)\.V\_\{\\mathrm\{avg\}\}\(q\)\\leq V\_\{\\mathrm\{pers\}\}\(q\)\.\(86\)Furthermore, sincef\(z\)=12\+\|z−12\|f\(z\)=\\frac\{1\}\{2\}\+\|z\-\\frac\{1\}\{2\}\|, we directly obtain
Vpers\(q\)−Vavg\(q\)=𝔼u\[\|zu−12\|\]−\|𝔼u\[zu\]−12\|\.V\_\{\\mathrm\{pers\}\}\(q\)\-V\_\{\\mathrm\{avg\}\}\(q\)=\\mathbb\{E\}\_\{u\}\\left\[\\left\|z\_\{u\}\-\\frac\{1\}\{2\}\\right\|\\right\]\-\\left\|\\mathbb\{E\}\_\{u\}\[z\_\{u\}\]\-\\frac\{1\}\{2\}\\right\|\.\(87\)This completes the proof\. ∎
The above theorem shows not only that personalized decision\-making is preferable to user\-agnostic decision\-making, but also that the gain is explicitly governed by the heterogeneity of user preferences\. Therefore, the more diverse the user preferences are, the more necessary personalized optimization becomes\.
### C\.2Structural Bias of Standard GRPO
The previous result establishes the necessity of personalized optimization\. We now explain why, even when personalized rewards are included, standard GRPO still incurs structural bias in personalized settings\.
The pooled normalized advantage used by standard GRPO is
A¯GRPO\(τ∣q\)=R\(τ,pu,q\)−Vpool\(q\)σpool\(q\)\+ϵ\.\\bar\{A\}^\{\\mathrm\{GRPO\}\}\(\\tau\\mid q\)=\\frac\{R\(\\tau,p\_\{u\},q\)\-V\_\{\\mathrm\{pool\}\}\(q\)\}\{\\sigma\_\{\\mathrm\{pool\}\}\(q\)\+\\epsilon\}\.\(88\)In contrast, the true normalized user\-specific advantage should be
A¯u∗\(τ∣q\)=R\(τ,pu,q\)−Vu\(q\)σu\(q\)\+ϵ\.\\bar\{A\}\_\{u\}^\{\*\}\(\\tau\\mid q\)=\\frac\{R\(\\tau,p\_\{u\},q\)\-V\_\{u\}\(q\)\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\.\(89\)Therefore, the estimation error consists of both baseline mismatch and normalization\-scale mismatch\.
###### Proposition 2\.
For any useruu, queryqq, and trajectoryτ\\tau, the normalized advantage estimation error of standard GRPO satisfies
\|A¯GRPO\(τ∣q\)−A¯u∗\(τ∣q\)\|≤\|Vu\(q\)−Vpool\(q\)\|σmin\+ϵ\+\|R\(τ,pu,q\)−Vu\(q\)\|⋅\|σu\(q\)−σpool\(q\)\|\(σmin\+ϵ\)2\.\\left\|\\bar\{A\}^\{\\mathrm\{GRPO\}\}\(\\tau\\mid q\)\-\\bar\{A\}\_\{u\}^\{\*\}\(\\tau\\mid q\)\\right\|\\leq\\frac\{\|V\_\{u\}\(q\)\-V\_\{\\mathrm\{pool\}\}\(q\)\|\}\{\\sigma\_\{\\min\}\+\\epsilon\}\+\\frac\{\|R\(\\tau,p\_\{u\},q\)\-V\_\{u\}\(q\)\|\\cdot\|\\sigma\_\{u\}\(q\)\-\\sigma\_\{\\mathrm\{pool\}\}\(q\)\|\}\{\(\\sigma\_\{\\min\}\+\\epsilon\)^\{2\}\}\.\(90\)Moreover, by reward decomposition,
\|Vu\(q\)−Vpool\(q\)\|≤α\|Vubase\(q\)−Vpoolbase\(q\)\|\+\(1−α\)\|Vupers\(q\)−Vpoolpers\(q\)\|\.\|V\_\{u\}\(q\)\-V\_\{\\mathrm\{pool\}\}\(q\)\|\\leq\\alpha\|V\_\{u\}^\{\\text\{base\}\}\(q\)\-V\_\{\\mathrm\{pool\}\}^\{\\text\{base\}\}\(q\)\|\+\(1\-\\alpha\)\|V\_\{u\}^\{\\text\{pers\}\}\(q\)\-V\_\{\\mathrm\{pool\}\}^\{\\text\{pers\}\}\(q\)\|\.\(91\)If the general\-quality term varies only mildly across users, then the dominant bias term of standard GRPO is controlled by the personalized heterogeneityℋ\(q\)\\mathcal\{H\}\(q\)\.
###### Proof\.
Taking the difference between the two normalized advantages yields
A¯GRPO\(τ∣q\)−A¯u∗\(τ∣q\)=R\(τ,pu,q\)−Vpool\(q\)σpool\(q\)\+ϵ−R\(τ,pu,q\)−Vu\(q\)σu\(q\)\+ϵ\.\\bar\{A\}^\{\\mathrm\{GRPO\}\}\(\\tau\\mid q\)\-\\bar\{A\}\_\{u\}^\{\*\}\(\\tau\\mid q\)=\\frac\{R\(\\tau,p\_\{u\},q\)\-V\_\{\\mathrm\{pool\}\}\(q\)\}\{\\sigma\_\{\\mathrm\{pool\}\}\(q\)\+\\epsilon\}\-\\frac\{R\(\\tau,p\_\{u\},q\)\-V\_\{u\}\(q\)\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\.\(92\)Let
x=R\(τ,pu,q\),a=Vpool\(q\),b=Vu\(q\),s1=σpool\(q\),s2=σu\(q\)\.x=R\(\\tau,p\_\{u\},q\),\\quad a=V\_\{\\mathrm\{pool\}\}\(q\),\\quad b=V\_\{u\}\(q\),\\quad s\_\{1\}=\\sigma\_\{\\mathrm\{pool\}\}\(q\),\\quad s\_\{2\}=\\sigma\_\{u\}\(q\)\.Then
x−as1\+ϵ−x−bs2\+ϵ=\(x−a\)\(s2\+ϵ\)−\(x−b\)\(s1\+ϵ\)\(s1\+ϵ\)\(s2\+ϵ\)\.\\frac\{x\-a\}\{s\_\{1\}\+\\epsilon\}\-\\frac\{x\-b\}\{s\_\{2\}\+\\epsilon\}=\\frac\{\(x\-a\)\(s\_\{2\}\+\\epsilon\)\-\(x\-b\)\(s\_\{1\}\+\\epsilon\)\}\{\(s\_\{1\}\+\\epsilon\)\(s\_\{2\}\+\\epsilon\)\}\.\(93\)Rearranging the numerator gives
\(x−a\)\(s2\+ϵ\)−\(x−b\)\(s1\+ϵ\)=\(x−b\)\(s2−s1\)\+\(b−a\)\(s2\+ϵ\)\.\(x\-a\)\(s\_\{2\}\+\\epsilon\)\-\(x\-b\)\(s\_\{1\}\+\\epsilon\)=\(x\-b\)\(s\_\{2\}\-s\_\{1\}\)\+\(b\-a\)\(s\_\{2\}\+\\epsilon\)\.\(94\)Hence,
\|A¯GRPO−A¯u∗\|≤\|b−a\|\(s2\+ϵ\)\(s1\+ϵ\)\(s2\+ϵ\)\+\|x−b\|\|s2−s1\|\(s1\+ϵ\)\(s2\+ϵ\)\.\\left\|\\bar\{A\}^\{\\mathrm\{GRPO\}\}\-\\bar\{A\}\_\{u\}^\{\*\}\\right\|\\leq\\frac\{\|b\-a\|\(s\_\{2\}\+\\epsilon\)\}\{\(s\_\{1\}\+\\epsilon\)\(s\_\{2\}\+\\epsilon\)\}\+\\frac\{\|x\-b\|\|s\_\{2\}\-s\_\{1\}\|\}\{\(s\_\{1\}\+\\epsilon\)\(s\_\{2\}\+\\epsilon\)\}\.\(95\)Usings1,s2≥σmins\_\{1\},s\_\{2\}\\geq\\sigma\_\{\\min\}, we obtain
\|A¯GRPO−A¯u∗\|≤\|Vu\(q\)−Vpool\(q\)\|σmin\+ϵ\+\|R\(τ,pu,q\)−Vu\(q\)\|⋅\|σu\(q\)−σpool\(q\)\|\(σmin\+ϵ\)2\.\\left\|\\bar\{A\}^\{\\mathrm\{GRPO\}\}\-\\bar\{A\}\_\{u\}^\{\*\}\\right\|\\leq\\frac\{\|V\_\{u\}\(q\)\-V\_\{\\mathrm\{pool\}\}\(q\)\|\}\{\\sigma\_\{\\min\}\+\\epsilon\}\+\\frac\{\|R\(\\tau,p\_\{u\},q\)\-V\_\{u\}\(q\)\|\\cdot\|\\sigma\_\{u\}\(q\)\-\\sigma\_\{\\mathrm\{pool\}\}\(q\)\|\}\{\(\\sigma\_\{\\min\}\+\\epsilon\)^\{2\}\}\.\(96\)
By reward decomposition,
R=αRbase\+\(1−α\)Rpers,R=\\alpha R\_\{\\text\{base\}\}\+\(1\-\\alpha\)R\_\{\\text\{pers\}\},\(97\)which implies
Vu\(q\)−Vpool\(q\)=α\(Vubase\(q\)−Vpoolbase\(q\)\)\+\(1−α\)\(Vupers\(q\)−Vpoolpers\(q\)\)\.V\_\{u\}\(q\)\-V\_\{\\mathrm\{pool\}\}\(q\)=\\alpha\\big\(V\_\{u\}^\{\\text\{base\}\}\(q\)\-V\_\{\\mathrm\{pool\}\}^\{\\text\{base\}\}\(q\)\\big\)\+\(1\-\\alpha\)\\big\(V\_\{u\}^\{\\text\{pers\}\}\(q\)\-V\_\{\\mathrm\{pool\}\}^\{\\text\{pers\}\}\(q\)\\big\)\.\(98\)Applying the triangle inequality yields
\|Vu\(q\)−Vpool\(q\)\|≤α\|Vubase\(q\)−Vpoolbase\(q\)\|\+\(1−α\)\|Vupers\(q\)−Vpoolpers\(q\)\|\.\|V\_\{u\}\(q\)\-V\_\{\\mathrm\{pool\}\}\(q\)\|\\leq\\alpha\|V\_\{u\}^\{\\text\{base\}\}\(q\)\-V\_\{\\mathrm\{pool\}\}^\{\\text\{base\}\}\(q\)\|\+\(1\-\\alpha\)\|V\_\{u\}^\{\\text\{pers\}\}\(q\)\-V\_\{\\mathrm\{pool\}\}^\{\\text\{pers\}\}\(q\)\|\.\(99\)This completes the proof\. ∎
This proposition shows that the bias of standard GRPO has two sources: \(1\) baseline mismatch, since it usesVpool\(q\)V\_\{\\mathrm\{pool\}\}\(q\)instead of the trueVu\(q\)V\_\{u\}\(q\); and \(2\) normalization\-scale mismatch, since it usesσpool\(q\)\\sigma\_\{\\mathrm\{pool\}\}\(q\)instead ofσu\(q\)\\sigma\_\{u\}\(q\)\. In personalized settings, the dominant bias typically comes from the personalized component, whose strength is characterized by
ℋ\(q\)=𝔼u\[\(μu\(q\)−μpool\(q\)\)2\]\.\\mathcal\{H\}\(q\)=\\mathbb\{E\}\_\{u\}\\left\[\(\\mu\_\{u\}\(q\)\-\\mu\_\{\\mathrm\{pool\}\}\(q\)\)^\{2\}\\right\]\.\(100\)Whenℋ\(q\)\\mathcal\{H\}\(q\)is large, standard GRPO suffers from significant cross\-user preference mixing bias\.
### C\.3Bias Reduction Mechanism of PARPO
We now analyze why PARPO can reduce the above bias\. Since our implementation is centered on single\-user anchor calibration, we first study the individual\-level bias bound induced by user\-specific historical anchors\. We then present a more general extension in which anchor calibration is combined with local group statistics\.
For completeness, we first write a generalized personalized advantage form that combines a batch\-level statistic with a user\-specific anchor:
Aipers=Rpers\(τi,pu,hu\)−max\(mean\(\{Rpers\(τj,pu,hu\)\}j=1G\),bu−ϵu\)σu\+ϵ\.A\_\{i\}^\{\\text\{pers\}\}=\\frac\{R\_\{\\text\{pers\}\}\(\\tau\_\{i\},p\_\{u\},h\_\{u\}\)\-\\max\\left\(\\mathrm\{mean\}\(\\\{R\_\{\\text\{pers\}\}\(\\tau\_\{j\},p\_\{u\},h\_\{u\}\)\\\}\_\{j=1\}^\{G\}\),\\;b\_\{u\}\-\\epsilon\_\{u\}\\right\)\}\{\\sigma\_\{u\}\+\\epsilon\}\.\(101\)For theoretical analysis, we abstract it as
A¯persP\-GRPO\(τ∣u,q\)=Rpers\(τ,pu,hu\)−μ~u\(q\)σu\+ϵ,\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\(\\tau\\mid u,q\)=\\frac\{R\_\{\\text\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\-\\tilde\{\\mu\}\_\{u\}\(q\)\}\{\\sigma\_\{u\}\+\\epsilon\},\(102\)where the approximate user\-specific baseline is
μ~u\(q\)=max\(μG\(u\)\(q\),bu\(q\)−ϵu\)\.\\tilde\{\\mu\}\_\{u\}\(q\)=\\max\(\\mu\_\{G\(u\)\}\(q\),\\,b\_\{u\}\(q\)\-\\epsilon\_\{u\}\)\.\(103\)Correspondingly, the true normalized personalized advantage is
A¯pers∗\(τ∣u,q\)=Rpers\(τ,pu,hu\)−μu\(q\)σu\+ϵ\.\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)=\\frac\{R\_\{\\text\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\-\\mu\_\{u\}\(q\)\}\{\\sigma\_\{u\}\+\\epsilon\}\.\(104\)
We first consider an anchor\-only abstraction that isolates the effect of user\-specific historical calibration, where the personalized baseline is determined purely by the user’s own anchor\.
###### Theorem 3\(Individual\-level bias bound with user\-specific anchor\)\.
For any useruu, queryqq, and trajectoryτ\\tau, consider the anchor\-calibrated personalized advantage
A¯persanchor\(τ∣u,q\)=Rpers\(τ,pu,hu\)−\(bu\(q\)−ϵu\)σu\(q\)\+ϵ\.\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\\mathrm\{anchor\}\}\(\\tau\\mid u,q\)=\\frac\{R\_\{\\mathrm\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\-\(b\_\{u\}\(q\)\-\\epsilon\_\{u\}\)\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\.\(105\)If the user\-specific historical anchor satisfies
\|bu\(q\)−μu\(q\)\|≤δu,\|b\_\{u\}\(q\)\-\\mu\_\{u\}\(q\)\|\\leq\\delta\_\{u\},\(106\)then
\|A¯persanchor\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|≤δu\+ϵuσu\(q\)\+ϵ\.\\left\|\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\\mathrm\{anchor\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\leq\\frac\{\\delta\_\{u\}\+\\epsilon\_\{u\}\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\.\(107\)Moreover, in expectation over users,
𝔼u\[\|A¯persanchor\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|\]≤δ¯\+ϵ¯σmin\+ϵ\.\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\\mathrm\{anchor\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\right\]\\leq\\frac\{\\bar\{\\delta\}\+\\bar\{\\epsilon\}\}\{\\sigma\_\{\\min\}\+\\epsilon\}\.\(108\)
###### Proof\.
By definition,
A¯persanchor\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\\displaystyle\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\\mathrm\{anchor\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)=Rpers\(τ,pu,hu\)−\(bu\(q\)−ϵu\)σu\(q\)\+ϵ−Rpers\(τ,pu,hu\)−μu\(q\)σu\(q\)\+ϵ\\displaystyle=\\frac\{R\_\{\\mathrm\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\-\(b\_\{u\}\(q\)\-\\epsilon\_\{u\}\)\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\-\\frac\{R\_\{\\mathrm\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\-\\mu\_\{u\}\(q\)\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\(109\)=μu\(q\)−bu\(q\)\+ϵuσu\(q\)\+ϵ\.\\displaystyle=\\frac\{\\mu\_\{u\}\(q\)\-b\_\{u\}\(q\)\+\\epsilon\_\{u\}\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\.Hence,
\|A¯persanchor\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|\\displaystyle\\left\|\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\\mathrm\{anchor\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|=\|μu\(q\)−bu\(q\)\+ϵu\|σu\(q\)\+ϵ\\displaystyle=\\frac\{\|\\mu\_\{u\}\(q\)\-b\_\{u\}\(q\)\+\\epsilon\_\{u\}\|\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\(110\)≤\|bu\(q\)−μu\(q\)\|\+ϵuσu\(q\)\+ϵ\\displaystyle\\leq\\frac\{\|b\_\{u\}\(q\)\-\\mu\_\{u\}\(q\)\|\+\\epsilon\_\{u\}\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}≤δu\+ϵuσu\(q\)\+ϵ\.\\displaystyle\\leq\\frac\{\\delta\_\{u\}\+\\epsilon\_\{u\}\}\{\\sigma\_\{u\}\(q\)\+\\epsilon\}\.
Taking expectation over users and usingσu\(q\)≥σmin\\sigma\_\{u\}\(q\)\\geq\\sigma\_\{\\min\}yields
𝔼u\[\|A¯persanchor−A¯pers∗\|\]≤δ¯\+ϵ¯σmin\+ϵ\.\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\\mathrm\{anchor\}\}\-\\bar\{A\}\_\{\\mathrm\{pers\}\}^\{\*\}\\right\|\\right\]\\leq\\frac\{\\bar\{\\delta\}\+\\bar\{\\epsilon\}\}\{\\sigma\_\{\\min\}\+\\epsilon\}\.\(111\)This completes the proof\. ∎
We next consider a more general extension in which the user\-specific anchor is further combined with local group statistics\.
To analyze this more general group\-augmented form of PARPO, we state the following conditions\.
##### Condition 1 \(Within\-group heterogeneity contraction\)\.
LetG\(u\)G\(u\)be the local user group constructed based on user\-profile similarity\. We call the grouping*effective*if it reduces personalized preference heterogeneity, namely
ℋG\(q\)≤ℋ\(q\)\.\\mathcal\{H\}\_\{G\}\(q\)\\leq\\mathcal\{H\}\(q\)\.\(112\)This condition reflects the design goal of user\-aware grouping: if users with similar preferences are grouped together, then the local group meanμG\(u\)\(q\)\\mu\_\{G\(u\)\}\(q\)should be closer to the true user preference centerμu\(q\)\\mu\_\{u\}\(q\)than the global pooled meanμpool\(q\)\\mu\_\{\\mathrm\{pool\}\}\(q\)\.
##### Condition 2 \(Anchor consistency\)\.
We assume that the historical anchorbu\(q\)b\_\{u\}\(q\)is a bounded\-error estimate of the true user preference centerμu\(q\)\\mu\_\{u\}\(q\), i\.e\.,
\|bu\(q\)−μu\(q\)\|≤δu,\|b\_\{u\}\(q\)\-\\mu\_\{u\}\(q\)\|\\leq\\delta\_\{u\},\(113\)and denote the average anchor error by
δ¯:=𝔼u\[δu\]<∞\.\\bar\{\\delta\}:=\\mathbb\{E\}\_\{u\}\[\\delta\_\{u\}\]<\\infty\.\(114\)
##### Condition 3 \(Bounded margin term\)\.
We further assume that the adaptive margin term is bounded in expectation:
ϵ¯:=𝔼u\[ϵu\]<∞\.\\bar\{\\epsilon\}:=\\mathbb\{E\}\_\{u\}\[\\epsilon\_\{u\}\]<\\infty\.\(115\)
Under these conditions, the personalized normalized advantage estimation error of PARPO admits the following bound\.
###### Theorem 4\.
For any useruuand queryqq, the personalized normalized advantage estimation error of PARPO satisfies
\|A¯persP\-GRPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|≤\|μG\(u\)\(q\)−μu\(q\)\|\+δu\+ϵuσu\+ϵ\.\\left\|\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\leq\\frac\{\|\\mu\_\{G\(u\)\}\(q\)\-\\mu\_\{u\}\(q\)\|\+\\delta\_\{u\}\+\\epsilon\_\{u\}\}\{\\sigma\_\{u\}\+\\epsilon\}\.\(116\)Moreover, in expectation over users,
𝔼u\[\|A¯persP\-GRPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|\]≤ℋG\(q\)\+δ¯\+ϵ¯σmin\+ϵ\.\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\right\]\\leq\\frac\{\\sqrt\{\\mathcal\{H\}\_\{G\}\(q\)\}\+\\bar\{\\delta\}\+\\bar\{\\epsilon\}\}\{\\sigma\_\{\\min\}\+\\epsilon\}\.\(117\)
###### Proof\.
By definition,
A¯persP\-GRPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\\displaystyle\\small\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)=Rpers\(τ,pu,hu\)−μ~u\(q\)σu\+ϵ−Rpers\(τ,pu,hu\)−μu\(q\)σu\+ϵ\\displaystyle=\\frac\{R\_\{\\text\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\-\\tilde\{\\mu\}\_\{u\}\(q\)\}\{\\sigma\_\{u\}\+\\epsilon\}\-\\frac\{R\_\{\\text\{pers\}\}\(\\tau,p\_\{u\},h\_\{u\}\)\-\\mu\_\{u\}\(q\)\}\{\\sigma\_\{u\}\+\\epsilon\}\(118\)=μu\(q\)−μ~u\(q\)σu\+ϵ\.\\displaystyle=\\frac\{\\mu\_\{u\}\(q\)\-\\tilde\{\\mu\}\_\{u\}\(q\)\}\{\\sigma\_\{u\}\+\\epsilon\}\.Hence,
\|A¯persP\-GRPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|=\|μ~u\(q\)−μu\(q\)\|σu\+ϵ\.\\left\|\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|=\\frac\{\|\\tilde\{\\mu\}\_\{u\}\(q\)\-\\mu\_\{u\}\(q\)\|\}\{\\sigma\_\{u\}\+\\epsilon\}\.\(119\)
Since
μ~u\(q\)=max\(μG\(u\)\(q\),bu\(q\)−ϵu\),\\tilde\{\\mu\}\_\{u\}\(q\)=\\max\(\\mu\_\{G\(u\)\}\(q\),\\,b\_\{u\}\(q\)\-\\epsilon\_\{u\}\),\(120\)the triangle inequality gives
\|μ~u\(q\)−μu\(q\)\|≤\|μG\(u\)\(q\)−μu\(q\)\|\+\|bu\(q\)−μu\(q\)\|\+ϵu\.\|\\tilde\{\\mu\}\_\{u\}\(q\)\-\\mu\_\{u\}\(q\)\|\\leq\|\\mu\_\{G\(u\)\}\(q\)\-\\mu\_\{u\}\(q\)\|\+\|b\_\{u\}\(q\)\-\\mu\_\{u\}\(q\)\|\+\\epsilon\_\{u\}\.\(121\)By Condition 2,
\|μ~u\(q\)−μu\(q\)\|≤\|μG\(u\)\(q\)−μu\(q\)\|\+δu\+ϵu\.\|\\tilde\{\\mu\}\_\{u\}\(q\)\-\\mu\_\{u\}\(q\)\|\\leq\|\\mu\_\{G\(u\)\}\(q\)\-\\mu\_\{u\}\(q\)\|\+\\delta\_\{u\}\+\\epsilon\_\{u\}\.\(122\)Substituting this into the previous expression yields
\|A¯persP\-GRPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|≤\|μG\(u\)\(q\)−μu\(q\)\|\+δu\+ϵuσu\+ϵ\.\\left\|\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\leq\\frac\{\|\\mu\_\{G\(u\)\}\(q\)\-\\mu\_\{u\}\(q\)\|\+\\delta\_\{u\}\+\\epsilon\_\{u\}\}\{\\sigma\_\{u\}\+\\epsilon\}\.\(123\)
Taking expectation over users and usingσu≥σmin\\sigma\_\{u\}\\geq\\sigma\_\{\\min\}, we obtain
𝔼u\[\|A¯persP\-GRPO−A¯pers∗\|\]≤𝔼u\[\|μG\(u\)\(q\)−μu\(q\)\|\]\+δ¯\+ϵ¯σmin\+ϵ\.\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\\right\|\\right\]\\leq\\frac\{\\mathbb\{E\}\_\{u\}\[\|\\mu\_\{G\(u\)\}\(q\)\-\\mu\_\{u\}\(q\)\|\]\+\\bar\{\\delta\}\+\\bar\{\\epsilon\}\}\{\\sigma\_\{\\min\}\+\\epsilon\}\.\(124\)By Cauchy–Schwarz,
𝔼u\[\|μG\(u\)\(q\)−μu\(q\)\|\]≤𝔼u\[\(μG\(u\)\(q\)−μu\(q\)\)2\]=ℋG\(q\)\.\\mathbb\{E\}\_\{u\}\[\|\\mu\_\{G\(u\)\}\(q\)\-\\mu\_\{u\}\(q\)\|\]\\leq\\sqrt\{\\mathbb\{E\}\_\{u\}\[\(\\mu\_\{G\(u\)\}\(q\)\-\\mu\_\{u\}\(q\)\)^\{2\}\]\}=\\sqrt\{\\mathcal\{H\}\_\{G\}\(q\)\}\.\(125\)Therefore,
𝔼u\[\|A¯persP\-GRPO−A¯pers∗\|\]≤ℋG\(q\)\+δ¯\+ϵ¯σmin\+ϵ\.\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\\right\|\\right\]\\leq\\frac\{\\sqrt\{\\mathcal\{H\}\_\{G\}\(q\)\}\+\\bar\{\\delta\}\+\\bar\{\\epsilon\}\}\{\\sigma\_\{\\min\}\+\\epsilon\}\.\(126\)This completes the proof\. ∎
###### Corollary 5\.
To further compare the upper bounds of PARPO and standard GRPO, define the local heterogeneity contraction ratio as
ρ\(q\)=ℋG\(q\)ℋ\(q\),0<ρ\(q\)≤1,\\rho\(q\)=\\frac\{\\mathcal\{H\}\_\{G\}\(q\)\}\{\\mathcal\{H\}\(q\)\},\\qquad 0<\\rho\(q\)\\leq 1,\(127\)and define the anchor–margin residual as
η\(q\)=δ¯\+ϵ¯\.\\eta\(q\)=\\bar\{\\delta\}\+\\bar\{\\epsilon\}\.\(128\)Here,ρ\(q\)\\rho\(q\)quantifies how much user\-aware grouping compresses cross\-user preference heterogeneity, whileη\(q\)\\eta\(q\)measures the additional error introduced by anchor estimation and conservative margins\.
Theorem[4](https://arxiv.org/html/2605.23382#Thmtheorem4)should be viewed as a generalized extension of the anchor\-calibrated analysis above\. In the main implementation studied in this paper, the core practical mechanism is the user\-specific historical anchor itself, while the group\-augmented form provides a broader perspective on how local preference statistics can further reduce bias when such structure is available\.
By Theorem[4](https://arxiv.org/html/2605.23382#Thmtheorem4), we have
𝔼u\[\|A¯persP\-GRPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|\]≤ρ\(q\)ℋ\(q\)\+η\(q\)σmin\+ϵ\.\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\right\]\\leq\\frac\{\\sqrt\{\\rho\(q\)\\mathcal\{H\}\(q\)\}\+\\eta\(q\)\}\{\\sigma\_\{\\min\}\+\\epsilon\}\.\(129\)On the other hand, the dominant bias term of standard GRPO is controlled by the global heterogeneity:
𝔼u\[\|A¯GRPO\(τ∣q\)−A¯u∗\(τ∣q\)\|\]≲ℋ\(q\)σmin\+ϵ\.\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}^\{\\mathrm\{GRPO\}\}\(\\tau\\mid q\)\-\\bar\{A\}\_\{u\}^\{\*\}\(\\tau\\mid q\)\\right\|\\right\]\\lesssim\\frac\{\\sqrt\{\\mathcal\{H\}\(q\)\}\}\{\\sigma\_\{\\min\}\+\\epsilon\}\.\(130\)If the grouping is strictly contractive, i\.e\.,ρ\(q\)<1\\rho\(q\)<1, and the residual satisfies
η\(q\)≤\(1−ρ\(q\)\)ℋ\(q\),\\eta\(q\)\\leq\(1\-\\sqrt\{\\rho\(q\)\}\)\\sqrt\{\\mathcal\{H\}\(q\)\},\(131\)then
ρ\(q\)ℋ\(q\)\+η\(q\)≤ℋ\(q\)\.\\sqrt\{\\rho\(q\)\\mathcal\{H\}\(q\)\}\+\\eta\(q\)\\leq\\sqrt\{\\mathcal\{H\}\(q\)\}\.\(132\)Consequently,
𝔼u\[\|A¯persP\-GRPO\(τ∣u,q\)−A¯pers∗\(τ∣u,q\)\|\]≤𝔼u\[\|A¯GRPO\(τ∣q\)−A¯u∗\(τ∣q\)\|\]\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}\_\{\\text\{pers\}\}^\{\\mathrm\{P\\text\{\-\}GRPO\}\}\(\\tau\\mid u,q\)\-\\bar\{A\}\_\{\\text\{pers\}\}^\{\*\}\(\\tau\\mid u,q\)\\right\|\\right\]\\leq\\mathbb\{E\}\_\{u\}\\left\[\\left\|\\bar\{A\}^\{\\mathrm\{GRPO\}\}\(\\tau\\mid q\)\-\\bar\{A\}\_\{u\}^\{\*\}\(\\tau\\mid q\)\\right\|\\right\]\(133\)at the level of dominant heterogeneity terms\.
This corollary shows that the advantage of PARPO over standard GRPO does not come merely from adding a personalized reward\. Rather, it comes from a structural trade\-off: the heterogeneity reduction achieved by user\-aware grouping can outweigh the additional error introduced by anchor estimation and conservative margins\. In this sense, the improvement of PARPO is structural rather than heuristic\.
In particular, when the quality of user\-profile clustering improves,ρ\(q\)\\rho\(q\)becomes smaller; when richer user histories are available, the anchor estimation errorδ¯\\bar\{\\delta\}typically also decreases\. Therefore, the theoretical improvement margin of PARPO increases jointly with grouping quality and anchor estimation quality\.
Theorem[3](https://arxiv.org/html/2605.23382#Thmtheorem3), Theorem[4](https://arxiv.org/html/2605.23382#Thmtheorem4), and Corollary[5](https://arxiv.org/html/2605.23382#Thmtheorem5)jointly explain the key design components of PARPO from the perspective of bias control:
- •Reward decomposition: It explicitly separates the general\-quality reward from the personalized preference reward, so that user heterogeneity is handled mainly withinRpersR\_\{\\text\{pers\}\}rather than being mixed into a single reward scale\.
- •User\-specific anchor calibration: In the implementation studied in this paper, it usesbu\(q\)b\_\{u\}\(q\)as a stable user\-level historical reference, avoiding over\-reliance on pooled cross\-user statistics and reducing user\-specific baseline mismatch\.
- •Group\-augmented extension: When reliable local preference structure is available, the local group meanμG\(u\)\(q\)\\mu\_\{G\(u\)\}\(q\)can further refine the personalized baseline, thereby shrinking global heterogeneityℋ\(q\)\\mathcal\{H\}\(q\)into local heterogeneityℋG\(q\)\\mathcal\{H\}\_\{G\}\(q\)\.
These results clarify how PARPO controls personalized advantage estimation bias in both the anchor\-centered setting studied in this paper and the more general group\-augmented extension\.
### C\.4Summary
In summary, the theoretical analysis leads to the following three conclusions:
1. 1\.Under heterogeneous user preferences, the optimal value of personalized decision\-making is no smaller than that of user\-agnostic decision\-making, and the gain can be explicitly characterized by preference dispersion\. This justifies the necessity of personalized optimization\.
2. 2\.In personalized settings, standard GRPO uses pooled baselines and pooled normalization scales for relative comparison, which introduces cross\-user preference mixing bias governed by the global heterogeneityℋ\(q\)\\mathcal\{H\}\(q\)\.
3. 3\.In the implementation studied in this paper, PARPO reduces this bias primarily through reward decomposition and user\-specific anchor calibration, yielding an individual\-level personalized advantage\-estimation bound controlled by anchor error and conservative margin\. More generally, when local preference grouping is available, PARPO can further shrink the dominant error term from global heterogeneityℋ\(q\)\\mathcal\{H\}\(q\)to local heterogeneityℋG\(q\)\\mathcal\{H\}\_\{G\}\(q\)\.
Therefore, PARPO is not merely GRPO with an additional personalized reward term\. Instead, it moves toward user\-conditioned policy optimization at both the objective level and the advantage\-estimation level\. In the setting studied in this paper, this improvement is driven primarily by user\-specific anchor calibration, while group\-based refinement provides a more general extension when local preference structure is available\.
## Appendix DAdditional Experimental Details
### D\.1Compute Resources
Unless otherwise specified, our experiments were conducted on NVIDIA H100 GPUs\. For experiments using 8B models, we used 8 H100 GPUs\. For experiments using 4B models, we used 4 H100 GPUs\. Additional implementation and runtime configuration details, including model loading and execution parameters, are provided in the released codebase\.
### D\.2Environment Details
##### ETAPP\.
ETAPP is a public benchmark for personalized personal\-assistant agents\. It evaluates whether an agent can solve daily\-life tasks while adapting to user\-specific preferences and behaviors\. The benchmark contains 16 user personas, each associated with a structured profile and environment state\. User profiles include demographic information and preferences across multiple aspects such as entertainment, lifestyle, technology, exercise, shopping, and travel\. In addition, each user is associated with tool\-specific profile information that captures fine\-grained behavioral preferences\.
The environment covers a set of daily\-life scenarios, including scheduling, email management, health monitoring, music, navigation, shopping, smart\-home control, weather inquiry, and web browsing\. These functionalities are instantiated through tool APIs together with sandbox databases, enabling the agent to interact with a realistic but controlled user environment\.
##### ETAPP\-Hard\.
ETAPP\-hard is a more challenging split constructed on top of the same ETAPP environment\. Compared with the original ETAPP split, ETAPP\-Hard places stronger demands on multi\-tool coordination, personalization, proactive behavior, and implicit reasoning\. Hard instances are designed to require the agent to jointly reason over 3–5 tool categories, infer unstated constraints from user profiles or environment states, and resolve trade\-offs through multi\-step decision\-making\.
The final ETAPP\-Hard dataset used in this paper contains 200 synthesized hard instructions\. After expanding them over 16 user personas, we obtain 3,200 total instances, which are split into 2,880 training examples and 320 test examples\.
##### SJAgent\.
SJAgent is a realistic merchant\-assistant environment built from merchant data on 1688, a major Chinese B2B e\-commerce platform\. It is designed for merchant\-side decision\-making and recommendation tasks, including product\-opportunity diagnosis, industry insight, competitive analysis, and selection recommendation\. Given a merchant query together with merchant profiles, historical interactions, and candidate products, the agent performs multi\-step analysis and outputs a structured report\.
The environment is organized as a long\-horizon analytical pipeline\. A planner first selects relevant skills and decomposes the task into a directed acyclic graph of sub\-steps, after which the system retrieves market evidence, performs intermediate analyses, and generates the final report\. The underlying skill library contains 18 specialized analytic skills covering industry trend analysis, ranking analysis, competition diagnosis, pricing and sales forecast, merchant–category matching, review mining, audience profiling, risk screening, and trend analysis\.
### D\.3ETAPP\-Hard Construction Pipeline
ETAPP\-Hard is constructed through a two\-stage pipeline that synthesizes hard user instructions and converts them into training instances for agent learning\.
##### Overview\.
The goal of ETAPP\-Hard is to create more demanding personalized assistant tasks than those in the original ETAPP split\. Compared with the original benchmark, ETAPP\-Hard emphasizes multi\-tool orchestration, deep personalization, implicit constraints, multi\-step reasoning, and conflict resolution\.
##### Source data\.
The synthesis pipeline uses the original ETAPP instruction set, 16 user profiles, user\-specific environment databases, shared environment resources, and tool definitions\. To ensure grounded generation, each synthesis call includes both a user profile summary and sampled rows from the relevant databases\.
##### Stage I: hard instruction synthesis\.
In the first stage, we synthesize 200 hard instructions with a large language model\. For each synthesis attempt, we randomly sample a user, a timestamp, a location, and a cross\-category tool combination\. We then retrieve several seed instructions with overlapping tool usage as inspirations, sample concrete tools from the selected categories, and construct a prompt containing the user profile, sampled environment states, current time, location, and target tools\.
The language model is instructed to generate a natural but challenging user request together with personalization and proactivity keypoints\. To control difficulty, we predefine combinations involving 3, 4, or 5 tool categories\. The default synthesis setting uses a temperature of 0\.85 and a random seed of 42\.
##### Stage II: conversion to training records\.
In the second stage, each synthesized hard instruction is expanded over all 16 user personas to form training instances\. For each instance, we assemble the corresponding tool schemas, construct a coarse user\-status description from the timestamp, and fill the ETAPP system template with user profile information, preferences, current status, available tools, and the task description\. The resulting dataset is shuffled with seed 42 and split into train and test subsets with a ratio of 9:1\.
##### Dataset scale\.
The final ETAPP\-Hard dataset contains 200 hard instructions and 3,200 total instances after expansion over 16 user personas\. We use 2,880 instances for training and 320 for testing\.
##### Prompt templates\.
Below we include the core prompts used in the ETAPP\-hard synthesis pipeline\.
Prompt A\.1: ETAPP\-Hard synthesis system promptYou are an expert at designing challenging evaluation tasks for AI personal assistants\.Your goal is to create HARD, COMPLEX instructions that test an AI assistant’s ability to:1\.Multi\-tool orchestration: require using 3–5 different tool categories together\.2\.Deep personalisation: the assistant must deeply leverage the user’s profile, preferences, and current data\.3\.Implicit constraints: the user does NOT explicitly state all constraints; the assistant must infer them from context, such as schedule conflicts, dietary restrictions, budget limits, and health conditions\.4\.Multi\-step reasoning: information from one tool call is needed to decide what to do with another tool\.5\.Conflict resolution: the task involves trade\-offs or requires the assistant to propose alternatives\.Important rules:•The query should sound natural, like a real person talking to their AI assistant\.•The query should be 1–3 sentences, not a detailed specification\.•The complexity should come from the context rather than the query length\.•The query should not mention specific tool names\.•The query should be in English\.•Each instruction must include keypoints explaining the expected personalization and proactivity\.•Output valid JSON only, with no extra text before or after the JSON object\.
Prompt A\.2: ETAPP\-Hard synthesis user prompt templateGenerate ONE hard/complex instruction for the user "\{user\_name\}"\.\#\# User Profile\{profile\_summary\}\#\# User’s Current Database State \(sample data\)\{db\_context\_str\}\#\# Context•Current time: \{timestamp\}•Location: \{location\}•Tool categories to involve: \{list\_of\_combo\_categories\}\#\# Seed Instructions for Inspiration \(simpler versions — your output should be HARDER\)\{seed\_queries\}\#\# Available Tools for This Task\{tools\_json\}\#\# Output Format \(strict JSON only, no markdown fences\)Return a JSON object with the following fields:•"timestamp":"\{timestamp\}"•"query": the natural\-language user query \(1–3 sentences\)•"keypoint for personal": a list of personalization keypoints•"keypoint for proactive": a list of proactive keypoints•"available\_tools\_name":\{tools\_json\}•"location":"\{location\}"•"difficulty":"hard"•"complexity\_reason": a brief explanation of why this instruction is complexImportant:•The query must sound natural and conversational\.•Include 3–5 keypoints for personal and 3–6 keypoints for proactive\.•Each keypoint should reference specific tools by name\.•Ground the query in the user’s actual database state shown above\.•Output only the JSON object and nothing else\.
### D\.4Evaluation Metrics and Official Prompts
##### ETAPP and ETAPP\-Hard\.
ETAPP and ETAPP\-Hard are evaluated with four metrics: Procedure, Personal, Proactive, and Judge\. Procedure evaluates whether the agent follows a correct and coherent problem\-solving process, including task completion, avoidance of unnecessary actions, accurate tool use, and clear summarization\. Personal evaluates whether the agent appropriately reflects the user’s profile and preferences\. Proactive evaluates whether the agent provides meaningful assistance beyond the explicitly stated request, such as identifying latent needs, surfacing risks, or proposing useful alternatives\.
Judge is the normalized aggregate score:
Judge=Procedure\+Personal\+Proactive15,\\mathrm\{Judge\}=\\frac\{\\mathrm\{Procedure\}\+\\mathrm\{Personal\}\+\\mathrm\{Proactive\}\}\{15\},which ranges from 0 to 1\. The three fine\-grained dimensions are scored on a 0–5 scale by an LLM judge conditioned on the user query, user profile, the assistant trajectory, and instance\-level keypoints\. In the original ETAPP split, these keypoints are manually annotated; in ETAPP\-Hard, they are synthesized together with the instructions\.
##### Official ETAPP evaluation prompt\.
The official ETAPP evaluation prompt defines the three dimensions above, gives detailed scoring criteria, and requires the evaluator to return a structured JSON output containing both explanations and final scores\. The prompt also specifies that successful tool usage must be supported by actual tool feedback rather than self\-inferred answers, so invalid or hallucinated tool invocations are penalized\.
Prompt A\.3: Official ETAPP evaluation prompt templateI need you to evaluate whether the solution provided by my artificial intelligence assistant completes user instructions and meets user preferences\.Evaluation Metrics:•Procedure•Personalization•ProactivityEvaluation Guidelines:1\.Procedure Analysis: assess the AI assistant’s entire solution process, including tool usage, logic, and final output\.2\.Personalization Assessment: evaluate whether the assistant considered the user’s specific preferences, profile details, and context\.3\.Proactivity Behavior Assessment: evaluate whether the assistant anticipated additional needs or proposed meaningful helpful actions\.Analysis Format:Your analysis should follow this JSON structure:\{output\_format\}Evaluation Input:1\.User Query: \{query\}2\.User Profile: \{profile\}3\.Personal LLM Assistant Solution: \{output\}
##### SJAgent\.
SJAgent is evaluated with an LLM\-as\-a\-judge protocol\. The judge reads the full trajectory, including the merchant query, merchant profile, planner output, retrieved evidence, intermediate analyses, and final report, and assigns five scores, each ranging from 0 to 4: Data Authenticity, Business Logic, Merchant Profile Match, Task Completion, and Market Analysis Depth\. The final scalar reward is defined as
r=120∑i=15Di∈\[0,1\]\.r=\\frac\{1\}\{20\}\\sum\_\{i=1\}^\{5\}D\_\{i\}\\in\[0,1\]\.
Data Authenticity measures whether factual claims are grounded in traceable evidence\. Business Logic evaluates whether the recommendation follows a coherent evidence\-to\-judgment\-to\-recommendation chain\. Merchant Profile Match measures personalization\. Task Completion evaluates whether the report fully closes the analysis\-to\-recommendation loop\. Market Analysis Depth evaluates the breadth and depth of market analysis\. The official SJAgent judge prompt follows a structured scoring rubric and returns both per\-dimension scores and detailed rationales\. If parsing or evaluation fails, the reward is set to zero\.
### D\.5Training Details
##### SJAgent\.
For SJAgent, only the planner model is optimized during reinforcement learning\. The analyzer, report generator, and execution engine remain fixed and continue to call an external frozen model endpoint\. We experiment with Qwen\-based planner models of different scales, including a 4B default setting and an 8B larger variant\.
Our RL training uses a shared configuration across algorithmic variants to ensure fair comparison\. The default setup uses a training batch size of 32, a validation batch size of 64, a maximum prompt length of 15,000, a maximum response length of 4,000, and 4 rollout samples per prompt\. Training is conducted on 4 H100 GPUs on a single node, and the actor is optimized with AdamW using a learning rate of1×10−61\\times 10^\{\-6\}\.
To improve determinism and efficiency, SJAgent uses a skill\-cache mechanism that intercepts skill execution and serves results from a pre\-warmed cache instead of repeatedly issuing live backend requests\. All RL\-family baselines on SJAgent are implemented through the same training pipeline, with algorithm\-specific differences controlled by configuration switches\.
##### ETAPP\-Hard\.
For ETAPP\-Hard, the hard training set is generated by the two\-stage construction pipeline described above and converted into RLHFDataset parquet files\. The ETAPP environment uses a unified tool interface based on OpenAI\-style function schemas, with a maximum interaction budget of 20 turns per episode\. The ETAPP\-Hard dataset contains 2,880 training instances and 320 test instances\.
##### Implementation consistency\.
Across all experiments, we keep model scales, benchmark settings, tool interfaces, and evaluation protocols aligned whenever possible\. Closed\-source baselines are evaluated under the same benchmark settings as open\-source methods, and open\-source baselines share the same environment setup for each benchmark\.
## Appendix EAdditional Reinforcement Learning Analysis on ETAPP\-Hard
In this appendix, we provide additional reinforcement learning results on the more challenging ETAPP\-Hard benchmark\. Consistent with the main text, all reinforcement learning experiments in this section are conducted without using our personalized reward model or Skill Graph\. Therefore, the analysis here is not intended as a comparison between full systems, but rather to further examine the behavior of different RL optimization methods and reward optimization objectives under a more difficult setting\.
### E\.1Comparison of RL Training Strategies on ETAPP\-Hard
We first compare GRPO, GSPO, GiGPO, and PARPO on ETAPP\-Hard\. Since ETApp\-Hard involves more complex tasks and more challenging interaction scenarios, it provides a stricter test of optimization stability, planning ability, and the ability to carry out effective tool use in multi\-step environments\.
As shown in Figure[5](https://arxiv.org/html/2605.23382#A5.F5), PARPO still exhibits the best overall training dynamics even under the more challenging ETAPP\-Hard setting\. In particular, PARPO consistently achieves higher training reward than the other baselines, indicating that it learns a more effective policy update direction during optimization\. At the same time, PARPO also maintains superior training and validation success rates, suggesting that its advantage is not limited to fitting the training environment but can also transfer better to held\-out evaluation settings\.
From a behavioral perspective, PARPO also achieves a higher tool\-call success rate\. This suggests that PARPO improves not only the final task completion outcome, but also the intermediate capabilities required for effective interaction with the environment, especially in scenarios that require proactive decision\-making and external tool usage\. In addition, while achieving better performance, PARPO keeps the KL divergence low and stable, indicating that the performance gain does not come from excessively deviating from the initial policy, but rather from more stable and efficient policy improvement\.
Overall, the ETAPP\-Hard results are consistent with the observations on ETApp in the main text: PARPO shows stronger performance in training efficiency, optimization stability, and final success rate\. This suggests that the advantage of PARPO is not limited to the standard setting, but can also generalize to more difficult proactive task\-oriented interaction scenarios\.
Figure 5:RL training dynamics of Qwen3\-8B on ETAPP\-Hard, comparing GRPO, GSPO, GiGPO, and PARPO\. Consistent with the main text, all RL training experiments in this section are conducted without using our personalized reward model or Skill Graph\.
### E\.2Effect of Personalized Reward Optimization on ETAPP\-Hard
We further analyze the effect of personalized reward optimization on ETAPP\-Hard\. As in the main text, we compare GRPO, GSPO, GiGPO, and PARPO, and decompose the training process according to the four reward dimensions defined in ETApp, namely personalization, judge, proactivity, and procedure\. Here,judgeis a weighted aggregate metric computed from multiple evaluation criteria, and is therefore better interpreted as a summary of overall response quality rather than a single atomic capability\.
As shown in Figure[6](https://arxiv.org/html/2605.23382#A5.F6), PARPO continues to show consistent advantages across all four reward dimensions under the more difficult ETAPP\-Hard setting\. Although the absolute scores of all methods are lower than those on standard ETApp due to the increased difficulty, PARPO still achieves higher personalization, proactivity, and procedure scores throughout training, while the judge metric also shows a better overall trend\. These observations suggest that the benefits of PARPO are not confined to one isolated aspect, but instead lead to more robust improvements across multiple complementary dimensions\.
More specifically, the improvement on personalization remains the most pronounced, indicating that PARPO is still better able to learn behaviors aligned with user preferences and contextual needs under challenging scenarios\. In addition, the sustained gain on proactivity suggests that PARPO is more effective at helping the model learn when to take initiative, when to invoke tools, and how to advance the interaction process more appropriately\. PARPO also achieves better results on the procedure dimension, indicating that it improves not only the final output quality, but also the organization and execution quality of the intermediate steps used to complete the task\.
Figure[7](https://arxiv.org/html/2605.23382#A5.F7)further summarizes the final EMA scores at the last training step\. PARPO achieves the best results across all four reward dimensions, with the most notable advantage on personalization\. Meanwhile, the improvement on the judge score, as a weighted aggregate metric, also reflects a consistent gain in overall response quality\. Taken together, both the training curves and the final summary results show that the benefits of personalized reward optimization extend beyond the standard setting and remain effective in the more complex ETAPP\-Hard scenario\.
Overall, these results indicate that explicitly optimizing decomposed personalized rewards provides a more effective learning signal in difficult interactive environments\. Its advantage is reflected not only in the final scores, but also in more stable training, more balanced improvements across reward dimensions, and stronger proactive behaviors, further demonstrating the effectiveness and generalization ability of PARPO\.
Figure 6:Personalized reward decomposition of Qwen3\-8B on ETAPP\-Hard, comparing GRPO, GSPO, GiGPO, and PARPO\.Figure 7:Final EMA scores at the last training step across different reward dimensions on ETAPP\-Hard\.
## Appendix FBroader Impacts
This work studies personalized Agentic RL for user\-conditioned tasks, aiming to improve how agents adapt to diverse user preferences in applications such as e\-commerce assistance, travel planning, and daily scheduling\. A potential positive societal impact is that personalized agents may provide more relevant and user\-aligned recommendations, improving usability and decision support quality\.
At the same time, this work may introduce risks\. Personalized optimization depends on modeling user preferences and interaction histories, which may raise privacy concerns\. Inaccurate or biased preference modeling may also reinforce stereotypes, amplify filter bubbles, or produce unfair recommendations across users\. In addition, in domains such as e\-commerce, personalized agents could be misused to optimize for engagement in manipulative ways rather than users’ long\-term interests\.
To mitigate these risks, future deployments should incorporate privacy protection, user control over preference data, bias auditing across user groups, and monitoring for harmful or manipulative behavior\. We hope this work encourages future research on personalized agents that are not only more effective, but also safer, fairer, and more transparent\.
## Appendix GCase Study: Shopping\-Cart Alternatives
QuestionUser Query\.*Take a look at my shopping cart and recommend some cost\-effective alternatives for this type of product\.*User Context\.Jordan Carter is a 28\-year\-old social media blogger in Los Angeles\. Her profile suggests preferences for trendy and aesthetically appealing products, vegan\-friendly lifestyle choices, travel gear, fashion, and tech accessories\.
PARPO \(Ours\)Here is your personalized cost\-effective alternatives guide\.Current cart total:$374\.95\.Potential savings:up to $117\.12\.•Smart Home Speaker \($129\.99\):recommends Amazon Echo alternatives priced at $49\.99 and $54\.99, highlighting both savings and aesthetic fit for Jordan’s home office\.•Organic Cotton Bed Sheets \($59\.99\):suggests lower\-cost cotton or microfiber options such as UTOPIA Bedding and Mellanni, with estimated savings of $25–$37\.•The Midnight Library \($14\.99\):suggests Kindle, audiobook, or similar literary alternatives with lower prices and better portability\.The response explicitly connects recommendations to Jordan’s lifestyle, including home aesthetics, travel convenience, and day\-to\-day content creation\.
SkillRLThe system reports that tool\-based search does not return valid results, then falls back to practical recommendations based on Jordan’s profile\.•Confirms the shopping cart contents\.•Recommends Amazon Echo Dot as a lower\-cost smart\-speaker alternative\.•Suggests vegan sneaker alternatives from Reformation or Pact\.•Suggests lower\-cost travel backpack alternatives from Columbia or Patagonia\.The response also states that these options remain compatible with Jordan’s preference for sustainable, trendy, and content\-creation\-friendly products\.
GPT\-4oHere is a breakdown of the items in your cart and some lower\-cost alternatives\.•Smart speaker:suggests Anker or Echo Dot alternatives in the $50–80 range\.•Bed sheets:suggests Mellanni or Amazon Basics in the $30–40 range\.•Book:suggests eBook or second\-hand versions at lower prices\.•Vegan sneakers:suggests NAE Vegan Shoes or Will’s Vegan Store\.•Travel backpack:suggests Lowepro or Amazon Basics options in the $40–60 range\.The response ends by asking whether the user wants more specific alternatives for any item\.
Claude 3\.5 SonnetThe assistant says that the current shopping cart has not been viewed or accessed in the provided context\. It asks the user to clarify:•what product or category is in the cart,•the desired budget range,•and preferred brand or quality level\.It then states that, once such information is provided, it can recommend lower\-cost options aligned with Jordan’s preferences\.Similar Articles
Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use
This paper formalizes trust calibration for agentic tool use as a preference learning problem, using Gaussian processes and Bayesian optimization to decide when an AI agent's actions should be autonomous or require human approval.
@TheTuringPost: 15 Policy Optimization and Preference Optimization techniques important in 2026 GRPO DPO REINFORCE++ DAPO (Dynamic sAmp…
A comprehensive guide to 15 policy optimization and preference optimization techniques important in 2026, including GRPO, DPO, REINFORCE++, and many newer variants, mapping the landscape of reasoning RL methods.
PAFO: Pareto Fairness Optimization for Personalized Reward Modeling
This paper proposes PAFO, a Pareto fairness optimization framework to mitigate personalized reward bias in reward models for LLMs, improving accuracy for minority user groups without harming majority groups.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.