LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation
Summary
Latte introduces a framework that represents personalization as forecasting a peer-anchored relative preference state using latent trajectories, injecting a soft token into a frozen LLM to achieve personalized generation. It outperforms existing personalization methods on Amazon Reviews 2023 and MemoryCD datasets.
View Cached Full Text
Cached at: 05/27/26, 09:07 AM
# Latte: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation
Source: [https://arxiv.org/html/2605.26612](https://arxiv.org/html/2605.26612)
Jinze Li1, Xiaoyan Yang2, Shuo Yang1, Jinfeng Xu1, Yue Shen2, Jian Wang2, Jinjie Gu2, Edith Cheuk\-Han Ngai1,† 1The University of Hong Kong2Ant Healthcare, Ant Group †Corresponding authors
###### Abstract
Personalized generation with frozen large language models requires a conditioning signal that is both compact and current\. Existing personalization methods typically retrieve or summarize user histories in text, or compress them into static latent profiles and soft prompts\. These approaches are efficient, but they treat a user’s past behavior as an aggregate profile and therefore mix stable identity, recent drift, and item content in the same representation\. We proposeLAtentTrajectoryTracking andExtrapolation \(Latte\), a framework that represents personalization as forecasting a peer anchored relative preference state\. For each historical session,Lattesubtracts a time masked baseline formed from comparable users who responded to the same item, producing a state that measures how the target user differs from peers under a shared item context\. A lightweight sequence predictor then forecasts the next state in this trajectory, and a State to Token Bridge injects the forecast into a frozen instruction tuned LLM through a single anchored soft token\. We provide a latent factor analysis showing when peer anchoring cancels shared item variation and why temporal forecasting trades off stale averages against noisy recent states\. Experiments on Amazon Reviews 2023 and MemoryCD show thatLatteconsistently outperforms retrieval, summary memory, static latent profiles, difference aware latent profiles, and soft prompt compression baselines\. On Amazon Reviews 2023,Latteimproves average ROUGE\-L from 0\.219 for a static latent profile and 0\.245 for the strongest added latent compression baseline to 0\.259\. Additional pairwise comparisons and diagnostic analyses suggest that the improvement is mainly due to forecasting user\-specific trajectory information, rather than merely adding a soft prompt interface\.
## 1Introduction
Large language models \(LLMs\)\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib36)\)are increasingly used in settings where the same input should lead to different outputs for different users\. A review assistant should reflect a user’s writing style\. A recommendation explanation should emphasize the criteria that the user cares about\. A long running conversational agent should adapt as a user changes goals, tone, or interests across sessions\. These scenarios raise a basic question for personalized generation\. What should a frozen LLM condition on when it generates for a particular user at a particular moment?
Most existing systems answer this question with a user profile\. Prompt based methods retrieve or summarize previous interactions and place the resulting text in the context\(Salemiet al\.,[2024b](https://arxiv.org/html/2605.26612#bib.bib6); Kumaret al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib7); Mysoreet al\.,[2023](https://arxiv.org/html/2605.26612#bib.bib8); Salemiet al\.,[2024a](https://arxiv.org/html/2605.26612#bib.bib11)\)\. Latent methods compress the history into an embedding, a steering direction, a user module, or a small set of soft prompt vectors\(Qiuet al\.,[2025a](https://arxiv.org/html/2605.26612#bib.bib1),[b](https://arxiv.org/html/2605.26612#bib.bib2); Hebert and others,[2024](https://arxiv.org/html/2605.26612#bib.bib3); Liu and others,[2024](https://arxiv.org/html/2605.26612#bib.bib4); Ninget al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib5)\)\. These approaches are compact and often effective, but they usually treat the user as a static object\. They aggregate past behavior into one representation and reuse it for future generation\.
Static aggregation is limiting because the relevant user signal is often temporal\. A reviewer may become more technical after years of short impressions\. A reader may move from genre fiction to literary criticism\. A conversational user may revise constraints over several sessions\. In such cases, the useful signal is not only what the user has preferred on average, but where the user appears to be now\. This distinction is especially important for frozen LLM personalization, because the model can often generate fluent and item relevant text from the target metadata alone\. A stale or content dominated profile may therefore look plausible while failing to match the user’s current behavior\.
This paper studies personalized generation as latent state forecasting\. Instead of compressing the full history into a single profile, we construct a sequence of relative preference states and forecast the state that should condition the next generation\. This view separates three problems that static profiles merge\. First, each historical response should be converted into a state that is comparable across different items\. Second, the current state should be predicted from the ordered trajectory rather than estimated by an unordered average\. Third, the predicted state should be injected into a frozen LLM without adding user specific parameters\.
We proposeLAtentTrajectoryTracking andExtrapolation \(Latte\)\. For each historical session,Latteforms a time masked peer baseline from comparable users who responded to the same item before the target timestamp\. It subtracts this baseline from the target user’s response embedding and normalizes the residual\. The resulting vector is a peer anchored relative state\. It asks how the user responded relative to similar peers under the same item context, which reduces shared item variation before temporal modeling\.
Given the sequence of peer anchored states,Lattetrains a lightweight predictor to forecast the next state with a direct regression objective\. This decouples state prediction from the language modeling loss\. The separation is useful because generation loss alone can allow the conditioning vector to collapse into a low rank shortcut while the frozen LLM relies on item metadata\. After forecasting, a State to Token Bridge maps the predicted state into the token embedding space of the frozen LLM\. At inference time, the bridge replaces one placeholder token, and a natural language anchor tells the model how to interpret the injected state\.
We evaluateLatteon Amazon Reviews 2023 and MemoryCD against retrieval, summary memory, static latent profiles, recent and time decayed latent profiles, a DEP style difference aware static profile, and a PERSOMA style soft prompt compression baseline\. The strongest comparisons are therefore not only against simple static profiles, but also against latent compression and difference aware user modeling\. Across datasets,Latteimproves lexical overlap and history aware preference judgments\. On Amazon Reviews 2023, it improves average ROUGE\-L from \.245 for the strongest added latent compression baseline to \.259\. Preference fidelity metrics, peer leakage controls, bootstrap intervals, and collapse diagnostics indicate that the gains come from forecasting user specific trajectory information rather than from the soft prompt slot alone\.
Our contributions are as follows\.
- •We formulate frozen LLM personalization as forecasting a peer anchored relative preference state, turning user history from a static profile into a time ordered latent trajectory\.
- •We introduceLatte, a modular framework that constructs same item peer residual states, predicts the next state with a lightweight sequence model, and injects the forecast into a frozen LLM through one anchored soft token\.
- •We provide analytical and empirical evidence that peer anchoring reduces shared item variation, trajectory forecasting improves over static latent compression, and the observed gains are not explained by peer leakage, bridge mismatch, or representation collapse\.
## 2Related Work
Personalized generation and retrieval profiles\.Prompt based personalization conditions LLMs on user histories, retrieved examples, or textual summaries\. LaMP\(Salemiet al\.,[2024b](https://arxiv.org/html/2605.26612#bib.bib6)\), LongLaMP\(Kumaret al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib7)\), PEARL\(Mysoreet al\.,[2023](https://arxiv.org/html/2605.26612#bib.bib8)\), and retrieval optimization methods\(Salemiet al\.,[2024a](https://arxiv.org/html/2605.26612#bib.bib11)\)establish retrieval as a strong baseline\. Recent benchmarks make the evaluation more demanding\. PersonalLLM studies individual preference variation at scale\(Zolloet al\.,[2025](https://arxiv.org/html/2605.26612#bib.bib38)\)\. PrefEval evaluates whether LLMs infer and follow user preferences in long multi session conversations\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26612#bib.bib17)\)\. HYDRA factorizes black box personalization into shared and user specific components over retrieved histories\(Zhuanget al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib39)\)\. These works motivate our use of retrieval and history aware evaluation\.Lattediffers by replacing retrieved or summarized text with a forecast latent state\.
Latent personalization\.Latent methods compress user information into embeddings, soft prompts, steering vectors, or user modules\. PERSOMA compresses extensive history into soft prompt embeddings\(Hebert and others,[2024](https://arxiv.org/html/2605.26612#bib.bib3)\)\. PPlug and User\-LLM introduce plug in user profile representations\(Liu and others,[2024](https://arxiv.org/html/2605.26612#bib.bib4); Ninget al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib5)\)\. DEP and DPL show that inter user differences are useful for personalization\(Qiuet al\.,[2025a](https://arxiv.org/html/2605.26612#bib.bib1),[b](https://arxiv.org/html/2605.26612#bib.bib2)\)\. Personalized steering vectors and parameter efficient adaptation provide alternative latent control mechanisms\(Caoet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib40); Tanet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib22)\)\. This line of work is closest to ours\. We use the same core insight that relative signals can be more informative than user only signals\. The key difference is temporal\. DEP and DPL construct static difference aware representations from selected or aggregated histories\.Latteconstructs a sequence of peer anchored states and forecasts the next state before generation\. Our DEP style baseline uses the same peer anchored states but averages them, which isolates the contribution of trajectory forecasting\.
Long horizon memory\.Long term memory benchmarks such as LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib12)\), LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2605.26612#bib.bib13)\), PersonaMem\(Jianget al\.,[2025](https://arxiv.org/html/2605.26612#bib.bib14)\), PerLTQA\(Duet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib15)\), PrefEval\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26612#bib.bib17)\), and MemoryCD\(Zhanget al\.,[2026](https://arxiv.org/html/2605.26612#bib.bib16)\)document failures of long context and retrieval based personalization\. Memory architectures store hidden states, key value memories, or retrieved chunks, as in Memorizing Transformers\(Wuet al\.,[2022](https://arxiv.org/html/2605.26612#bib.bib43)\), Recurrent Memory Transformer\(Bulatovet al\.,[2022](https://arxiv.org/html/2605.26612#bib.bib44)\), RETRO\(Borgeaudet al\.,[2022](https://arxiv.org/html/2605.26612#bib.bib45)\), and MEMORYLLM\(Wanget al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib42)\)\.Latteaddresses a complementary bottleneck\. Instead of storing more text or activations, it learns a compact current preference state that can be injected through one token\.
Sequential user modeling\.Sequential recommendation models evolving user behavior with recurrent, attentive, bidirectional, diffusion, and instance adaptive architectures\(Hidasiet al\.,[2016](https://arxiv.org/html/2605.26612#bib.bib23); Kang and McAuley,[2018](https://arxiv.org/html/2605.26612#bib.bib24); Sunet al\.,[2019](https://arxiv.org/html/2605.26612#bib.bib25); Yanget al\.,[2023](https://arxiv.org/html/2605.26612#bib.bib46); Konget al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib41)\)\. Time series forecasting studies long horizon prediction with decomposed Transformers, frequency models, patching, and task general temporal backbones\(Wuet al\.,[2021](https://arxiv.org/html/2605.26612#bib.bib47); Zhouet al\.,[2022](https://arxiv.org/html/2605.26612#bib.bib48); Nieet al\.,[2023](https://arxiv.org/html/2605.26612#bib.bib29); Wuet al\.,[2023](https://arxiv.org/html/2605.26612#bib.bib49); Zenget al\.,[2023](https://arxiv.org/html/2605.26612#bib.bib28)\)\. We borrow the trajectory view, but the predicted object is not the next item and not a scalar time series\. It is a peer normalized latent state used to condition a frozen language model\.
## 3Method
Figure 1:LATTE forecasts peer anchored preference trajectories for personalized generation\.Top: a static latent profile aggregates the user’s history into one vector and can miss recent preference shifts, while LATTE forecasts the user’s current preference state\. Bottom: LATTE first constructs peer anchored relative states from historical sessions, then uses a trajectory predictor to forecast the current state, and finally injects the forecast into a frozen LLM through a State\-to\-Token Bridge\.Lattehas three stages\. Stage 1 constructs one peer anchored relative state for each historical session\. Stage 2 trains a predictor to forecast the next state from the observed trajectory\. Stage 3 maps the forecast into the hidden dimension of a frozen LLM with a State\-to\-Token Bridge and injects it as one anchored soft prompt token\. The stages are trained separately, which makes the representation target, predictor, and bridge module independently testable\.
### 3\.1Preliminaries
Dynamic personalization setting\.A useruuhas a chronological historyℋu,T−1=\{\(i1,u1,τ1\),…,\(iT−1,uT−1,τT−1\)\}\\mathcal\{H\}\_\{u,T\-1\}=\\\{\(i\_\{1\},u\_\{1\},\\tau\_\{1\}\),\\ldots,\(i\_\{T\-1\},u\_\{T\-1\},\\tau\_\{T\-1\}\)\\\}, whereiti\_\{t\}is an item or context,utu\_\{t\}is the user’s textual response, andτt\\tau\_\{t\}is the timestamp\. At timeTT, the model receives target metadataxTx\_\{T\}for itemiTi\_\{T\}and must generateyTy\_\{T\}in the user’s current style and preference state\. We train and evaluate with temporal splits, where later responses are never used to build earlier histories\.
Encoder and base model\.Letenc\(⋅\)\\mathrm\{enc\}\(\\cdot\)be a frozen sentence encoder that maps text to adddimensional unit norm embedding\. We use bge m3\(Chenet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib35)\)withd=1024d=1024\. The base generator is a frozen instruction tuned LLMℳ\\mathcal\{M\}with token embedding matrixE∈ℝ\|V\|×hE\\in\\mathbb\{R\}^\{\|V\|\\times h\}\. Personalization is performed by adding a placeholder token\[PREF\_TOKEN\]and replacing its embedding at runtime\. All LLM weights remain frozen\.
Static latent profile\.The standard latent profile compresses the history into a single vector
𝝅\(u\)=A\(enc\(u1\),…,enc\(uT−1\)\),\\boldsymbol\{\\pi\}\(u\)=A\\big\(\\mathrm\{enc\}\(u\_\{1\}\),\\ldots,\\mathrm\{enc\}\(u\_\{T\-1\}\)\\big\),\(1\)whereAAmay be mean pooling, attention pooling, or a learned encoder\(Qiuet al\.,[2025a](https://arxiv.org/html/2605.26612#bib.bib1); Hebert and others,[2024](https://arxiv.org/html/2605.26612#bib.bib3)\)\.Lattereplaces this static object with a sequence𝐩1\(u\),…,𝐩T−1\(u\)\\mathbf\{p\}\_\{1\}\(u\),\\ldots,\\mathbf\{p\}\_\{T\-1\}\(u\)and a forecast𝐩^T\(u\)\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)\.
Notation\.We write𝐩~t\(u\)\\tilde\{\\mathbf\{p\}\}\_\{t\}\(u\)for the unnormalized peer anchored residual,𝐩t\(u\)∈ℝd\\mathbf\{p\}\_\{t\}\(u\)\\in\\mathbb\{R\}^\{d\}for its normalized state,q¯t\(u\)\\bar\{q\}\_\{t\}\(u\)for the peer baseline,𝚫t\(u\)=𝐩t\(u\)−𝐩t−1\(u\)\\boldsymbol\{\\Delta\}\_\{t\}\(u\)=\\mathbf\{p\}\_\{t\}\(u\)\-\\mathbf\{p\}\_\{t\-1\}\(u\)for an adjacent change, and𝐩^T\(u\)\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)for the predicted current state\. Bold lowercase symbols denote vectors\.
### 3\.2Latent Preference Trajectory
Peer anchored state\.Raw response embeddings are not suitable trajectory states because their changes often reflect item changes\. We therefore construct a relative state for each historical session\. Let𝒩m\(it,t\)\\mathcal\{N\}\_\{m\}\(i\_\{t\},t\)be up tommpeer users who reviewed the same itemiti\_\{t\}before timestampτt\\tau\_\{t\}\. The target user is excluded\. Peers with fewer than four earlier interactions are excluded\. Letrv,itr\_\{v,i\_\{t\}\}be peervv’s response to itemiti\_\{t\}\. We define the earlier interaction setℐu,t=\{ℓ:τℓ<τt\}\\mathcal\{I\}\_\{u,t\}=\\\{\\ell:\\tau\_\{\\ell\}<\\tau\_\{t\}\\\}and use the profile summary
ϕu\(t\)=\{∑ℓ∈ℐu,tenc\(uℓ\)‖∑ℓ∈ℐu,tenc\(uℓ\)‖2,ℐu,t≠∅,𝟎,ℐu,t=∅\.\\phi\_\{u\}\(t\)=\\begin\{cases\}\\frac\{\\sum\_\{\\ell\\in\\mathcal\{I\}\_\{u,t\}\}\\mathrm\{enc\}\(u\_\{\\ell\}\)\}\{\\left\\\|\\sum\_\{\\ell\\in\\mathcal\{I\}\_\{u,t\}\}\\mathrm\{enc\}\(u\_\{\\ell\}\)\\right\\\|\_\{2\}\},&\\mathcal\{I\}\_\{u,t\}\\neq\\emptyset,\\\\\[6\.0pt\] \\mathbf\{0\},&\\mathcal\{I\}\_\{u,t\}=\\emptyset\.\\end\{cases\}\(2\)If either the target profile or all peer profiles are zero, we use uniform weights\. Otherwise, peer weights are computed with temperatureγ\\gamma,
wu,v,t=exp\(γ⟨ϕu\(t\),ϕv\(t\)⟩\)∑v′∈𝒩m\(it,t\)exp\(γ⟨ϕu\(t\),ϕv′\(t\)⟩\)\.w\_\{u,v,t\}=\\frac\{\\exp\(\\gamma\\langle\\phi\_\{u\}\(t\),\\phi\_\{v\}\(t\)\\rangle\)\}\{\\sum\_\{v^\{\\prime\}\\in\\mathcal\{N\}\_\{m\}\(i\_\{t\},t\)\}\\exp\(\\gamma\\langle\\phi\_\{u\}\(t\),\\phi\_\{v^\{\\prime\}\}\(t\)\\rangle\)\}\.\(3\)The baseline, residual, and normalized state are
q¯t\(u\)=∑v∈𝒩m\(it,t\)wu,v,tenc\(rv,it\),𝐩~t\(u\)=enc\(ut\)−q¯t\(u\),𝐩t\(u\)=𝐩~t\(u\)‖𝐩~t\(u\)‖2\.\\bar\{q\}\_\{t\}\(u\)=\\sum\_\{v\\in\\mathcal\{N\}\_\{m\}\(i\_\{t\},t\)\}w\_\{u,v,t\}\\,\\mathrm\{enc\}\(r\_\{v,i\_\{t\}\}\),\\quad\\tilde\{\\mathbf\{p\}\}\_\{t\}\(u\)=\\mathrm\{enc\}\(u\_\{t\}\)\-\\bar\{q\}\_\{t\}\(u\),\\quad\\mathbf\{p\}\_\{t\}\(u\)=\\frac\{\\tilde\{\\mathbf\{p\}\}\_\{t\}\(u\)\}\{\\\|\\tilde\{\\mathbf\{p\}\}\_\{t\}\(u\)\\\|\_\{2\}\}\.\(4\)The peer baseline is used only for historical sessions\. At test time, no peer review of the target itemiTi\_\{T\}is placed in the LLM prompt or used to construct𝐩^T\(u\)\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)\.
Why subtraction helps\.The following proposition states the sense in which peer anchoring removes shared item variation\. It applies to the raw residual𝐩~t\\tilde\{\\mathbf\{p\}\}\_\{t\}\. The predictor uses the normalized direction𝐩t\\mathbf\{p\}\_\{t\}, which preserves the relative direction of𝐩~t\\tilde\{\\mathbf\{p\}\}\_\{t\}when the residual norm is nonzero\.
###### Proposition 1\(Peer anchoring under an additive embedding model\)\.
Condition on the peer set and weights in Eq\.[3](https://arxiv.org/html/2605.26612#S3.E3)\. Assume a response embedding for useruuon itemiti\_\{t\}has the form
enc\(ut\)=𝐜it,t\+𝐬u,t\+ϵu,t,\\mathrm\{enc\}\(u\_\{t\}\)=\\mathbf\{c\}\_\{i\_\{t\},t\}\+\\mathbf\{s\}\_\{u,t\}\+\\boldsymbol\{\\epsilon\}\_\{u,t\},\(5\)where𝐜it,t\\mathbf\{c\}\_\{i\_\{t\},t\}is an item component shared by users at timett,𝐬u,t\\mathbf\{s\}\_\{u,t\}is a user state, andϵu,t\\boldsymbol\{\\epsilon\}\_\{u,t\}is zero mean noise independent across users after conditioning on the weights\. Assume each peer response has the same item component and peer state𝐬v,t\\mathbf\{s\}\_\{v,t\}\. Then
𝔼\[𝐩~t\(u\)∣𝐬u,t,\{𝐬v,t\}v∈𝒩m\]=𝐬u,t−∑v∈𝒩m\(it,t\)wu,v,t𝐬v,t\.\\mathbb\{E\}\\big\[\\tilde\{\\mathbf\{p\}\}\_\{t\}\(u\)\\mid\\mathbf\{s\}\_\{u,t\},\\\{\\mathbf\{s\}\_\{v,t\}\\\}\_\{v\\in\\mathcal\{N\}\_\{m\}\}\\big\]=\\mathbf\{s\}\_\{u,t\}\-\\sum\_\{v\\in\\mathcal\{N\}\_\{m\}\(i\_\{t\},t\)\}w\_\{u,v,t\}\\mathbf\{s\}\_\{v,t\}\.\(6\)Thus the shared item term is removed in expectation\. If item variation has covarianceΣc\\Sigma\_\{c\}and encoder noise has covarianceσ2I\\sigma^\{2\}I, the raw embedding includes the additional covarianceΣc\\Sigma\_\{c\}, while the anchored residual has noise covarianceσ2\(1\+∑vwu,v,t2\)I\\sigma^\{2\}\(1\+\\sum\_\{v\}w\_\{u,v,t\}^\{2\}\)Iaround the relative state\.
###### Proof\.
Substitute Eq\.[5](https://arxiv.org/html/2605.26612#S3.E5)into Eq\.[4](https://arxiv.org/html/2605.26612#S3.E4)\. The weighted peer baseline contains𝐜it,t\\mathbf\{c\}\_\{i\_\{t\},t\}because the weights sum to one\. The item component cancels, and the remaining expected value is the user state minus the weighted peer state\. The covariance comparison follows from the independence of the item component and the zero mean encoder noise\. ∎
Why forecasting helps\.A static profile can be optimal when the state is stationary and noise is high, but it becomes temporally stale under drift\. The next proposition separates this bias effect from the variance reduction obtained by averaging\.
###### Proposition 2\(Bias and variance under local linear drift\)\.
Assume normalized relative states are locally approximated by𝐩t=𝐚\+t𝐠\+𝛈t\\mathbf\{p\}\_\{t\}=\\mathbf\{a\}\+t\\mathbf\{g\}\+\\boldsymbol\{\\eta\}\_\{t\}, where𝔼\[𝛈t\]=0\\mathbb\{E\}\[\\boldsymbol\{\\eta\}\_\{t\}\]=0and𝛈t\\boldsymbol\{\\eta\}\_\{t\}is independent across sessions with covarianceσ2I\\sigma^\{2\}I\. Let𝛍T=𝐚\+T𝐠\\boldsymbol\{\\mu\}\_\{T\}=\\mathbf\{a\}\+T\\mathbf\{g\}be the conditional mean next state\. The static average estimator𝐩¯=1T−1∑t=1T−1𝐩t\\bar\{\\mathbf\{p\}\}=\\frac\{1\}\{T\-1\}\\sum\_\{t=1\}^\{T\-1\}\\mathbf\{p\}\_\{t\}has squared biasT24‖𝐠‖22\\frac\{T^\{2\}\}\{4\}\\\|\\mathbf\{g\}\\\|\_\{2\}^\{2\}and variance tracedσ2/\(T−1\)d\\sigma^\{2\}/\(T\-1\)for estimating𝛍T\\boldsymbol\{\\mu\}\_\{T\}\. The last state estimator𝐩T−1\\mathbf\{p\}\_\{T\-1\}has squared bias‖𝐠‖22\\\|\\mathbf\{g\}\\\|\_\{2\}^\{2\}and variance tracedσ2d\\sigma^\{2\}\. Thus the static average can have lower mean squared error than the last state when drift is small relative to observation noise, while an order aware linear forecast can remove the static lag under the local linear model\.
###### Proof\.
The expectation of the static average is𝐚\+T2𝐠\\mathbf\{a\}\+\\frac\{T\}\{2\}\\mathbf\{g\}, while the target mean is𝐚\+T𝐠\\mathbf\{a\}\+T\\mathbf\{g\}\. Its squared bias is therefore‖T2𝐠‖22\\\|\\frac\{T\}\{2\}\\mathbf\{g\}\\\|\_\{2\}^\{2\}, and its variance trace isdσ2/\(T−1\)d\\sigma^\{2\}/\(T\-1\)\. The expectation of the last state is𝐚\+\(T−1\)𝐠\\mathbf\{a\}\+\(T\-1\)\\mathbf\{g\}, so its squared bias is‖𝐠‖22\\\|\\mathbf\{g\}\\\|\_\{2\}^\{2\}, and its variance trace isdσ2d\\sigma^\{2\}\. ForT−1≥2T\-1\\geq 2, an ordinary least squares extrapolator fitted to the ordered pairs\(t,𝐩t\)\(t,\\mathbf\{p\}\_\{t\}\)is unbiased for𝝁T\\boldsymbol\{\\mu\}\_\{T\}under this model because the design contains both an intercept and the time index\. ∎
Remark\.Proposition[2](https://arxiv.org/html/2605.26612#Thmproposition2)is a bias and variance statement, not a claim that the most recent state should always beat an average\. Averaging can improve downstream generation when residual states are noisy or when stable user traits dominate short term drift\. This is why the experiments compare static averages, last state forecasts, exponential smoothing, learned attention, and recurrent predictors\. The learned predictors are intended to use order while still smoothing across multiple observations\.
Temporal prediction target\.Adjacent changes decompose as
𝚫t\(u\)=𝐩t\(u\)−𝐩t−1\(u\)\.\\boldsymbol\{\\Delta\}\_\{t\}\(u\)=\\mathbf\{p\}\_\{t\}\(u\)\-\\mathbf\{p\}\_\{t\-1\}\(u\)\.\(7\)The subtraction in Eq\.[4](https://arxiv.org/html/2605.26612#S3.E4)reduces the item component before the predictor models these changes\. The current state is forecast as
𝐩^T\(u\)=fθ\(𝐩1\(u\),…,𝐩T−1\(u\)\)\.\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)=f\_\{\\theta\}\\big\(\\mathbf\{p\}\_\{1\}\(u\),\\ldots,\\mathbf\{p\}\_\{T\-1\}\(u\)\\big\)\.\(8\)
### 3\.3Trajectory Forecasting
Objective\.The predictor is trained offline to regress to a held out constructed state𝐩T\(u\)\\mathbf\{p\}\_\{T\}\(u\)from the next chronological session\. This target is a well defined derived statistic, namely the next response embedding after peer normalization\. It is not a directly observed latent quantity\. The loss is
ℒpred\(θ\)=𝔼u,T\[1−cos\(𝐩^T\(u\),𝐩T\(u\)\)\+λ‖𝐩^T\(u\)−𝐩T\(u\)‖22\]\.\\mathcal\{L\}\_\{\\mathrm\{pred\}\}\(\\theta\)=\\mathbb\{E\}\_\{u,T\}\\left\[1\-\\cos\\big\(\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\),\\mathbf\{p\}\_\{T\}\(u\)\\big\)\+\\lambda\\left\\\|\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)\-\\mathbf\{p\}\_\{T\}\(u\)\\right\\\|\_\{2\}^\{2\}\\right\]\.\(9\)Rolling temporal holdouts provide training pairs\. A prefix predicts the next session state, and validation and test responses are never used to construct earlier histories, peer weights, or predictor inputs\. Predictor training starts once at least four constructed states are available\. Earlier sessions still provide states for later prefixes\.
Predictor family\.We evaluate six predictors\. P0 uses the last state,𝐩^T=𝐩T−1\\hat\{\\mathbf\{p\}\}\_\{T\}=\\mathbf\{p\}\_\{T\-1\}\. P1 uses a linear trend,𝐩^T=𝐩T−1\+\(𝐩T−1−𝐩T−2\)\\hat\{\\mathbf\{p\}\}\_\{T\}=\\mathbf\{p\}\_\{T\-1\}\+\(\\mathbf\{p\}\_\{T\-1\}\-\\mathbf\{p\}\_\{T\-2\}\)\. P2 is an exponential moving average,𝐩^T=β𝐩T−1\+\(1−β\)𝐩¯\\hat\{\\mathbf\{p\}\}\_\{T\}=\\beta\\mathbf\{p\}\_\{T\-1\}\+\(1\-\\beta\)\\bar\{\\mathbf\{p\}\}, where𝐩¯=1T−1∑t=1T−1𝐩t\\bar\{\\mathbf\{p\}\}=\\frac\{1\}\{T\-1\}\\sum\_\{t=1\}^\{T\-1\}\\mathbf\{p\}\_\{t\}\. P3 is learned attention pooling with the most recent state as query\. It scores each previous state by
st\\displaystyle s\_\{t\}=𝐯a⊤tanh\(Wh𝐩t\+Wq𝐩T−1\+𝐛a\),\\displaystyle=\\mathbf\{v\}\_\{a\}^\{\\top\}\\tanh\(W\_\{h\}\\mathbf\{p\}\_\{t\}\+W\_\{q\}\\mathbf\{p\}\_\{T\-1\}\+\\mathbf\{b\}\_\{a\}\),\(10\)αt\\displaystyle\\alpha\_\{t\}=exp\(st\)∑ℓ=1T−1exp\(sℓ\),\\displaystyle=\\frac\{\\exp\(s\_\{t\}\)\}\{\\sum\_\{\\ell=1\}^\{T\-1\}\\exp\(s\_\{\\ell\}\)\},𝐩^T\\displaystyle\\hat\{\\mathbf\{p\}\}\_\{T\}=norm\(Woconcat\(∑t=1T−1αt𝐩t,𝐩T−1\)\)\.\\displaystyle=\\mathrm\{norm\}\\left\(W\_\{o\}\\,\\mathrm\{concat\}\\left\(\\sum\_\{t=1\}^\{T\-1\}\\alpha\_\{t\}\\mathbf\{p\}\_\{t\},\\mathbf\{p\}\_\{T\-1\}\\right\)\\right\)\.wherenorm\\mathrm\{norm\}denotes unit normalization\. P4 is a one layer GRU\(Choet al\.,[2014](https://arxiv.org/html/2605.26612#bib.bib26)\)followed by a linear head\. P5 is a two layer Transformer encoder\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.26612#bib.bib27)\)\. All predictors output onedddimensional vector\.
### 3\.4State\-to\-Token Bridge and Decoupled Training
Anchored soft token\.The forecast state is exposed to the frozen LLM through a lightweight State\-to\-Token Bridge, denoted STB\. The bridge is an interface between the predicted state space and the token embedding space, not the source of the personalization representation\. Formally,
𝐞pref=Bψ\(𝐩^T\(u\)\)=Projψ2\(gψ1\(𝐩^T\(u\)\)\),embed\(\[PREF\_TOKEN\]\)=𝐞pref\.\\mathbf\{e\}\_\{\\mathrm\{pref\}\}=B\_\{\\psi\}\(\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)\)=\\mathrm\{Proj\}\_\{\\psi\_\{2\}\}\\big\(g\_\{\\psi\_\{1\}\}\(\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)\)\\big\),\\quad\\mathrm\{embed\}\(\\texttt\{\[PREF\\\_TOKEN\]\}\)=\\mathbf\{e\}\_\{\\mathrm\{pref\}\}\.\(11\)Heregψ1g\_\{\\psi\_\{1\}\}is a 512 dimensional bottleneck state filter, andProjψ2\\mathrm\{Proj\}\_\{\\psi\_\{2\}\}maps the filtered state to the LLM hidden size\. The token is preceded by a natural language anchor shown in Appendix[A](https://arxiv.org/html/2605.26612#A1)\. Without the anchor, the LLM receives an embedding in an otherwise unmarked slot and cannot reliably infer its role\. We use the same STB architecture for all one token latent baselines, so the experiments isolate the representation being injected rather than the injection mechanism\.
Training the bridge\.The STB is trained with generation loss and auxiliary state regularization,
ℒbridge=ℒNLL\+α\(ℒrecon\+βℒsparsity\),\\mathcal\{L\}\_\{\\mathrm\{bridge\}\}=\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\+\\alpha\\big\(\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\+\\beta\\mathcal\{L\}\_\{\\mathrm\{sparsity\}\}\\big\),\(12\)whereℒNLL\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}is the negative log likelihood of the user response,ℒrecon\\mathcal\{L\}\_\{\\mathrm\{recon\}\}reconstructs the input state from the bottleneck through an auxiliary decoder, andℒsparsity\\mathcal\{L\}\_\{\\mathrm\{sparsity\}\}is a KL activation penalty targeting rateρ=0\.05\\rho=0\.05\. The auxiliary decoder is used only during bridge training\. The base LLM, encoder, and predictor remain frozen\. The main model trains the STB with observed training session states\. For latent baselines, we train a representation specific STB with the same architecture, optimizer, number of examples, and validation criterion\.
Why decoupling matters\.End to end training can optimize the predictor, state filter, and token projection only throughℒNLL\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\. This gives a weak identifiability signal\. A frozen LLM can often generate plausible text from the item metadata alone, which allows a constant or low rank conditioning vector to obtain reasonable loss\. Decoupled regression prevents this shortcut by requiring the predictor output to match the held out state before the LLM sees it\. This is analogous to collapse avoidance in self supervised representation learning\(Grillet al\.,[2020](https://arxiv.org/html/2605.26612#bib.bib34); Bardeset al\.,[2022](https://arxiv.org/html/2605.26612#bib.bib33)\), where constant outputs can satisfy part of the objective unless variance preserving structure is imposed\.
## 4Experiments
### 4\.1Experimental Setup
Datasets\.We evaluate on Amazon Reviews 2023\(Houet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib37)\)and MemoryCD\(Zhanget al\.,[2026](https://arxiv.org/html/2605.26612#bib.bib16)\)\. For Amazon Reviews 2023, we use Books, Movies\_and\_TV, and CDs\_and\_Vinyl\. We follow the preprocessing protocol ofAu and Lin \([2026](https://arxiv.org/html/2605.26612#bib.bib10)\): reviews after 2016, at least eight earlier reviews per user, at least four time valid peer reviews per historical item, and at least 30 characters per response\. After filtering, each category has about 12\.6K users and 31\.4K target instances\. The last review of each user is test, the second last review is validation, and the third last review is the train target for the injection module\. Rolling prefixes before these held out sessions train the predictor\. For MemoryCD, we use the Books domain and sample 500 users with at least 100 reviews each\. Table[1](https://arxiv.org/html/2605.26612#S4.T1)reports the peer coverage and history statistics induced by these filters\.
Table 1:Dataset coverage after chronological and peer availability filtering\. Target instances are rolling prediction instances used for training, validation, and test\. Peer coverage counts time valid same item peers for historical sessions\.DatasetUsersTarget inst\.Median historyMedian peersUser retentionAmazon Books12\.9K32\.1K181319\.4%Amazon Movies\_and\_TV12\.5K31\.0K151121\.7%Amazon CDs\_and\_Vinyl12\.4K31\.1K141224\.1%MemoryCD Books50048\.2K10317100\.0%Models\.The base LLM is Llama 3\.1 8B Instruct\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib36)\)\. The encoder is bge m3\(Chenet al\.,[2024](https://arxiv.org/html/2605.26612#bib.bib35)\)\. The STB maps 1024 dimensional states to a 512 dimensional compressive bottleneck with KL sparsity targetρ=0\.05\\rho=0\.05\. The bottleneck is intentionally undercomplete because its role is to denoise the state before token projection, not to learn an overcomplete dictionary\. The token projection is a two layer MLP\. The default predictor is P4, a one layer GRU with hidden size 512 trained for 15 epochs with AdamW and learning rate3×10−43\\times 10^\{\-4\}\. The predictor loss usesλ=0\.01\\lambda=0\.01unless stated otherwise\. The STB is trained for six epochs withα=0\.01\\alpha=0\.01andβ=10−3\\beta=10^\{\-3\}\.
Baselines\.We compare to text, memory, and latent baselines\.*No personalization*uses only target metadata\.*Recent text*concatenates the most recentKKreviews\.*Retrieved text*retrieves theKKuser reviews most similar to the target metadata by bge m3 cosine similarity\.*Summary memory*summarizes the full history into a compact textual profile and places the summary in the prompt\.*Static latent profile*averages all past response embeddings and injects the result with its own trained STB\.*Recent latent*averages only the latest eight states\.*Time decayed latent*uses exponential time decay over all states\.*DEP style static*averages the same peer anchored states used byLatteand injects the average without trajectory prediction\.*PERSOMA style*uses 16 learned soft prompt tokens produced by a history compression encoder trained with the same frozen LLM and generation objective, following the soft prompt compression paradigm ofHebert and others \([2024](https://arxiv.org/html/2605.26612#bib.bib3)\)\.*Lattezero*uses the last state as the forecast\.*LatteEMA*uses P2\.
Metrics\.We report ROUGE\-1, ROUGE\-L, and BLEU\. Because lexical overlap does not fully measure personalization, we also report history aware pairwise win rate\. The judge is Qwen3 235B\(Yanget al\.,[2025](https://arxiv.org/html/2605.26612#bib.bib50)\)\. It sees target metadata, a compact early history, a compact recent history, and two anonymized generations\. It is asked which generation better matches the user’s current writing style and content preference while remaining faithful to the item\. Each pair is evaluated twice with candidate order swapped\. Contradictory choices are counted as ties\. We additionally report preference fidelity metrics in Section[4\.4](https://arxiv.org/html/2605.26612#S4.SS4)\.
Reproducibility details\.We usem=16m=16peers and temperatureγ=10\\gamma=10for Eq\.[3](https://arxiv.org/html/2605.26612#S3.E3)\. Since profile summaries are unit normalized, this temperature concentrates the peer baseline on behaviorally similar same item peers rather than an unweighted item average\. Predictor batch size is 256\. STB batch size is 32 with gradient accumulation 4\. Decoding uses temperature 0\.7, topp=0\.9p=0\.9, and maximum generation length 160 tokens\. All runs use user level chronological splits\. Peer retrieval excludes the target user and every peer interaction after the relevant session timestamp\.
### 4\.2Main Results
Table 2:Main results on Amazon Reviews 2023 averaged over Books, Movies\_and\_TV, and CDs\_and\_Vinyl\. HistWin is history aware pairwise win rate against the static latent profile\. Higher is better for all metrics\.MethodR\-1R\-LBLEUHistWinNo personalization\.199\.176\.05727\.1Recent text,K=8K=8\.216\.192\.07235\.2Retrieved text,K=8K=8\.224\.204\.07941\.4Retrieved text,K=32K=32\.231\.211\.08444\.8Summary memory\.233\.214\.08546\.2Static latent profile\.237\.219\.08850\.0Recent latent, last 8\.241\.223\.09051\.8Time decayed latent\.248\.231\.09654\.9DEP style static\.255\.240\.10157\.4PERSOMA style\.260\.245\.10458\.6Lattezero\.239\.222\.09051\.0LatteEMA\.252\.235\.09955\.8Lattelearned attention\.266\.252\.11061\.4LatteGRU\.273\.259\.11464\.0Table[2](https://arxiv.org/html/2605.26612#S4.T2)shows that latent conditioning is stronger than text profiles and that static latent profiles are not the strongest non trajectory baseline\. Increasing the retrieval budget fromK=8K=8toK=32K=32helps, but it remains below static latent compression\. Summary memory is also below the latent baselines, which suggests that the gain is not simply due to exposing more history in natural language\. The closest comparisons are DEP style static and PERSOMA style\. They use difference aware or soft prompt compression but do not forecast a current state\.LatteGRU improves average ROUGE\-L by 1\.4 points over the PERSOMA style baseline and by 1\.9 points over the DEP style static baseline\. The comparison also shows that the trajectory problem is not solved by using the latest state alone\.Lattezero has lower temporal lag than a static average but higher variance, while EMA, learned attention, and GRU smooth over multiple states while preserving order information\.
Table 3:Direct comparisons against the strongest non trajectory baselines on Amazon Reviews 2023\. Each row comparesLatteGRU to one baseline over the same user test cases\. Intervals are 95 percent user bootstrap intervals\.ComparisonΔ\\DeltaROUGE\-LDirect HistWinppvalueLatteGRU vs Time decayed latent\+\.028 \[\+\.023,\+\.033\]61\.8 \[59\.7,63\.9\]<\.001LatteGRU vs DEP style static\+\.019 \[\+\.014,\+\.024\]60\.3 \[58\.0,62\.5\]<\.001LatteGRU vs PERSOMA style\+\.014 \[\+\.009,\+\.019\]57\.9 \[55\.8,60\.1\]<\.001LatteGRU vsLattelearned attention\+\.007 \[\+\.003,\+\.011\]53\.6 \[51\.7,55\.5\]\.002Table[3](https://arxiv.org/html/2605.26612#S4.T3)tests the main claim against the strongest baselines directly\. The comparison to DEP style static isolates trajectory prediction because both methods use the same peer anchored session states and the same STB architecture\. The comparison to PERSOMA style isolates forecasting from learned soft prompt compression\. The comparison to learned attention shows that the GRU gain is smaller than the gain from replacing static compression with trajectory forecasting\.
### 4\.3Long Horizon Scaling Under Matched History Budgets
Table 4:MemoryCD Books results under matched history budgets\.Method8163264AllHistWinRecent text\.190\.197\.201\.203\.20435\.1Retrieved text\.206\.212\.216\.219\.22043\.6Summary memory\.213\.220\.225\.229\.23147\.3Static latent profile\.216\.220\.222\.223\.22350\.0Time decayed latent\.220\.229\.235\.238\.23954\.0DEP style static\.226\.235\.242\.246\.24756\.1LatteGRU\.239\.251\.264\.275\.27866\.9
Table 5:Preference fidelity on Amazon Reviews 2023\.MethodStyleSimSentAlignVerbErrRecencyFaithStatic latent profile\.61373\.8\.284\.51286\.4Time decayed latent\.63175\.0\.266\.54686\.7DEP style static\.63875\.6\.262\.53187\.0PERSOMA style\.64676\.1\.254\.53687\.2LatteGRU\.68179\.4\.219\.60287\.5
Table[5](https://arxiv.org/html/2605.26612#S4.T5)tests whether the long horizon gain is only a history budget artifact\. It is not\. At every matched budget,Latteis above retrieval, summary memory, static latent compression, and DEP style static compression\. The gap increases with longer histories because the predictor can exploit more trajectory observations, while static compression saturates after the most recent sessions dominate the average\.
### 4\.4Preference Fidelity Beyond Lexical Overlap
Table[5](https://arxiv.org/html/2605.26612#S4.T5)evaluates the personalization claim more directly than ROUGE or BLEU\.Latteimproves style similarity, sentiment alignment, verbosity matching, and recency alignment while keeping item faithfulness comparable to the strongest baselines\. The largest gain appears in Recency, which is the metric most directly tied to the trajectory formulation\.
### 4\.5Ablation Studies
Table 6:Component ablation\.VariantROUGE\-LWin vs FullLattefull\.26550\.0without peer anchor\.22638\.4without prompt anchor\.25245\.9end to end training\.21531\.2without bridge filter\.25846\.7
Table 7:Peer construction diagnostics on Books\.VariantROUGE\-LPeerCosCopyWin vs FullLattefull\.265\.311\.850\.0category peers only\.255\.271\.644\.8random peers\.238\.221\.337\.6peer baseline only\.218\.688\.930\.4future peers unmasked\.272\.495\.853\.1
Predictor architecture ablations are reported in Appendix[B](https://arxiv.org/html/2605.26612#A2)\. Learned attention captures most of the gain, the GRU gives the best downstream generation, and the oracle state estimates the remaining injection ceiling\. Table[7](https://arxiv.org/html/2605.26612#S4.T7)shows that peer anchoring is the largest component\. Replacing anchored states with raw response embeddings nearly removes the gain over static latent compression on Books and remains far below the full model\. Removing the prompt anchor or bridge filter also hurts\. End to end training produces both weak generation performance and a collapsed representation, as quantified in Appendix[C](https://arxiv.org/html/2605.26612#A3)\. Hyperparameter sweeps are reported in Appendix[E](https://arxiv.org/html/2605.26612#A5)and show that performance is stable around the default settings\.
### 4\.6Peer Leakage and Collapse Diagnostics
Table[7](https://arxiv.org/html/2605.26612#S4.T7)separates the value of same item anchoring from item leakage\. Category peers reduce item similarity but lose performance, which indicates that same item peers provide a useful control for content\. Random peers remove too much structure\. The peer baseline alone performs poorly and copies more peer text, which argues against the interpretation thatLattesucceeds by injecting peer consensus\. The invalid future peer condition improves ROUGE\-L, but it raises PeerCos and increases copied peer 8 grams from 1\.8% to 5\.8%\. This more than three fold increase in copying is the clearest leakage signal, and it confirms why time masking is necessary\.
## 5Conclusion
We introducedLatte, a latent trajectory framework for personalized generation with frozen LLMs\. Instead of compressing a user’s history into a static profile,Latteconstructs peer anchored relative states, forecasts the next state from the user trajectory, and injects the forecast through one anchored soft prompt token\. This design separates the personalization object from the injection interface and provides a compact way to condition a frozen LLM on the user’s current preference state\. Our analysis explains why peer anchoring can reduce shared item variation and why forecasting can improve over static averaging when user states drift\. Experiments on Amazon Reviews 2023 and MemoryCD show consistent gains over retrieval, summary memory, static latent profiles, difference aware static profiles, and soft prompt compression baselines\. These results support modeling users as evolving trajectories rather than fixed aggregates, especially in long history settings where recent behavior matters\.
## References
- PeReGrINE: evaluating personalized review fidelity with user\-item graph context\.arXiv preprint arXiv:2604\.07788\.Cited by:[§4\.1](https://arxiv.org/html/2605.26612#S4.SS1.p1.1)\.
- A\. Bardes, J\. Ponce, and Y\. LeCun \(2022\)VICReg: variance\-invariance\-covariance regularization for self\-supervised learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§3\.4](https://arxiv.org/html/2605.26612#S3.SS4.p3.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. B\. Van Den Driessche, J\. Lespiau, B\. Damoc, A\. Clark, D\. De Las Casas, A\. Guy, J\. Menick, R\. Ring, T\. Hennigan, S\. Huang, L\. Maggiore, C\. Jones, A\. Cassirer, A\. Brock, M\. Paganini, G\. Irving, O\. Vinyals, S\. Osindero, K\. Simonyan, J\. Rae, E\. Elsen, and L\. Sifre \(2022\)Improving language models by retrieving from trillions of tokens\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 2206–2240\.External Links:[Link](https://proceedings.mlr.press/v162/borgeaud22a.html)Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- A\. Bulatov, Y\. Kuratov, and M\. Burtsev \(2022\)Recurrent memory transformer\.InAdvances in Neural Information Processing Systems 35 \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- Y\. Cao, T\. Zhang, B\. Cao, Z\. Yin, L\. Lin, F\. Ma, and J\. Chen \(2024\)Personalized steering of large language models: versatile steering vectors through bi\-directional preference optimization\.InAdvances in Neural Information Processing Systems 37 \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p2.1)\.
- J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu \(2024\)BGE M3\-Embedding: multi\-lingual, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.arXiv preprint arXiv:2402\.03216\.Cited by:[§3\.1](https://arxiv.org/html/2605.26612#S3.SS1.p2.5),[§4\.1](https://arxiv.org/html/2605.26612#S4.SS1.p2.5)\.
- K\. Cho, B\. van Merriënboer, Ç\. Gülçehre, D\. Bahdanau, F\. Bougares, H\. Schwenk, and Y\. Bengio \(2014\)Learning phrase representations using RNN encoder\-decoder for statistical machine translation\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§3\.3](https://arxiv.org/html/2605.26612#S3.SS3.p2.6)\.
- Y\. Du, H\. Wang, Z\. Zhao, B\. Liang, B\. Wang, W\. Zhong, Z\. Wang, and K\. Wong \(2024\)PerLTQA: a personal long\-term memory dataset for memory classification, retrieval, and synthesis in question answering\.InarXiv preprint arXiv:2402\.16288,Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey,et al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.26612#S4.SS1.p2.5)\.
- J\. Grill, F\. Strub, F\. Altché, C\. Tallec, P\. H\. Richemond, E\. Buchatskaya, C\. Doersch, B\. A\. Pires, Z\. D\. Guo, M\. G\. Azar, B\. Piot, K\. Kavukcuoglu, R\. Munos, and M\. Valko \(2020\)Bootstrap your own latent: a new approach to self\-supervised learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§3\.4](https://arxiv.org/html/2605.26612#S3.SS4.p3.1)\.
- L\. Hebertet al\.\(2024\)PERSOMA: personalized soft prompt adapter architecture for personalized language prompting\.InKDD GenAIRecP Workshop,Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.26612#S3.SS1.p3.3),[§4\.1](https://arxiv.org/html/2605.26612#S4.SS1.p3.2)\.
- B\. Hidasi, A\. Karatzoglou, L\. Baltrunas, and D\. Tikk \(2016\)Session\-based recommendations with recurrent neural networks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- Y\. Hou, J\. Li, Z\. He, A\. Yan, X\. Chen, and J\. McAuley \(2024\)Bridging language and items for retrieval and recommendation\.arXiv preprint arXiv:2403\.03952\.Cited by:[§4\.1](https://arxiv.org/html/2605.26612#S4.SS1.p1.1)\.
- B\. Jiang, Z\. Hao, Y\. Cho, B\. Li, Y\. Yuan, S\. Chen, L\. Ungar, C\. J\. Taylor, and D\. Roth \(2025\)Know me, respond to me: benchmarking llms for dynamic user profiling and personalized responses at scale\.External Links:2504\.14225,[Link](https://arxiv.org/abs/2504.14225)Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- W\. Kang and J\. McAuley \(2018\)Self\-attentive sequential recommendation\.InProceedings of the 18th IEEE International Conference on Data Mining \(ICDM\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- X\. Kong, J\. Wu, A\. Zhang, L\. Sheng, H\. Lin, X\. Wang, and X\. He \(2024\)Customizing language models with instance\-wise LoRA for sequential recommendation\.InAdvances in Neural Information Processing Systems 37 \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- I\. Kumar, S\. Viswanathan, S\. Yerra, A\. Salemi, R\. A\. Rossi, F\. Dernoncourt, H\. Deilamsalehy, X\. Chen, R\. Zhang, S\. Agarwal, N\. Lipka, C\. V\. Nguyen, T\. H\. Nguyen, and H\. Zamani \(2024\)LongLaMP: a benchmark for personalized long\-form text generation\.External Links:2407\.11016,[Link](https://arxiv.org/abs/2407.11016)Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p1.1)\.
- J\. Liuet al\.\(2024\)PPlug: personalized plug\-and\-play profile models for LLM personalization\.arXiv preprint arXiv:2409\.11901\.Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p2.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of LLM agents\.arXiv preprint arXiv:2402\.17753\.Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- S\. Mysore, Z\. Lu, M\. Wan, L\. Yang, S\. Menezes, T\. Baghaee, E\. B\. Gonzalez, J\. Neville, and T\. Safavi \(2023\)PEARL: personalizing large language model writing assistants with generation\-calibrated retrievers\.arXiv preprint arXiv:2311\.09180\.Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p1.1)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- L\. Ning, L\. Liu, J\. Wu, N\. Wu, D\. Berlowitz, S\. Prakash, B\. Green, S\. O’Banion, and J\. Xie \(2024\)User\-LLM: efficient LLM contextualization with user embeddings\.arXiv preprint arXiv:2402\.13598\.Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p2.1)\.
- Y\. Qiu, T\. Shi, X\. Zhao, F\. Zhu, Y\. Zhang, and F\. Feng \(2025a\)Latent inter\-user difference modeling for LLM personalization\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Note:Oral presentationCited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.26612#S3.SS1.p3.3)\.
- Y\. Qiu, X\. Zhao, Y\. Zhang, Y\. Bai, W\. Wang, H\. Cheng, F\. Feng, and T\. Chua \(2025b\)Measuring what makes you unique: difference\-aware user modeling for enhancing LLM personalization\.InFindings of the Association for Computational Linguistics: ACL 2025,Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p2.1)\.
- A\. Salemi, S\. Kallumadi, and H\. Zamani \(2024a\)Optimization methods for personalizing large language models through retrieval augmentation\.arXiv preprint arXiv:2404\.05970\.Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p1.1)\.
- A\. Salemi, S\. Mysore, M\. Bendersky, and H\. Zamani \(2024b\)LaMP: when large language models meet personalization\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§1](https://arxiv.org/html/2605.26612#S1.p2.1),[§2](https://arxiv.org/html/2605.26612#S2.p1.1)\.
- F\. Sun, J\. Liu, J\. Wu, C\. Pei, X\. Lin, W\. Ou, and P\. Jiang \(2019\)BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer\.InProceedings of the 28th ACM International Conference on Information and Knowledge Management \(CIKM\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- Z\. Tan, Z\. Liu, and M\. Jiang \(2024\)Personalized pieces: efficient personalized large language models through collaborative efforts\.arXiv preprint arXiv:2406\.10471\.Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p2.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§3\.3](https://arxiv.org/html/2605.26612#S3.SS3.p2.6)\.
- Y\. Wang, Y\. Gao, X\. Chen, H\. Jiang, S\. Li, J\. Yang, Q\. Yin, Z\. Li, X\. Li, B\. Yin, J\. Shang, and J\. McAuley \(2024\)MEMORYLLM: towards self\-updatable large language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 50453–50466\.External Links:[Link](https://proceedings.mlr.press/v235/wang24s.html)Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.arXiv preprint arXiv:2410\.10813\.Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- H\. Wu, T\. Hu, Y\. Liu, H\. Zhou, J\. Wang, and M\. Long \(2023\)TimesNet: temporal 2d\-variation modeling for general time series analysis\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=ju_Uqw384Oq)Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- H\. Wu, J\. Xu, J\. Wang, and M\. Long \(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.InAdvances in Neural Information Processing Systems 34 \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- Y\. Wu, M\. N\. Rabe, D\. Hutchins, and C\. Szegedy \(2022\)Memorizing transformers\.InInternational Conference on Learning Representations \(ICLR\),Note:Spotlight presentationExternal Links:[Link](https://openreview.net/forum?id=TrjbxzRcnf-)Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2605.26612#S4.SS1.p4.1)\.
- Z\. Yang, J\. Wu, Z\. Wang, X\. Wang, Y\. Yuan, and X\. He \(2023\)Generate what you prefer: reshaping sequential recommendation via guided diffusion\.InAdvances in Neural Information Processing Systems 36 \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- W\. Zhang, X\. Wei, W\. Huang, Z\. Hui, C\. Wang, M\. Gong, and P\. S\. Yu \(2026\)MemoryCD: benchmarking long\-context user memory of LLM agents for lifelong cross\-domain personalization\.arXiv preprint arXiv:2603\.25973\.Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.26612#S4.SS1.p1.1)\.
- S\. Zhao, M\. Hong, Y\. Liu, D\. Hazarika, and K\. Lin \(2025\)Do LLMs recognize your preferences? evaluating personalized preference following in LLMs\.InInternational Conference on Learning Representations \(ICLR\),Note:Oral presentationExternal Links:[Link](https://openreview.net/forum?id=QWunLKbBGF)Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p1.1),[§2](https://arxiv.org/html/2605.26612#S2.p3.1)\.
- T\. Zhou, Z\. Ma, Q\. Wen, X\. Wang, L\. Sun, and R\. Jin \(2022\)FEDformer: frequency enhanced decomposed transformer for long\-term series forecasting\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 27268–27286\.External Links:[Link](https://proceedings.mlr.press/v162/zhou22g.html)Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p4.1)\.
- Y\. Zhuang, H\. Sun, Y\. Yu, R\. Qiang, Q\. Wang, C\. Zhang, and B\. Dai \(2024\)HYDRA: model factorization framework for black\-box LLM personalization\.InAdvances in Neural Information Processing Systems 37 \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p1.1)\.
- T\. P\. Zollo, A\. W\. T\. Siah, N\. Ye, A\. Li, and H\. Namkoong \(2025\)PersonalLLM: tailoring LLMs to individual preferences\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=2R7498e2Tx)Cited by:[§2](https://arxiv.org/html/2605.26612#S2.p1.1)\.
## Appendix APrompt Template
The full prompt template used byLatteis shown below\.<PREF\_TOKEN\>is a single token whose embedding is overridden at runtime byBψ\(𝐩^T\(u\)\)B\_\{\\psi\}\(\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)\)\.
\[System\] You are a personalized writing assistant that produces reviews in the user’s individual style\. \[User context\] The user has interacted with this platform across many sessions\. A compact latent representation of their current preference follows\. Based on this user’s preference trajectory, their current preference is represented as: <PREF\_TOKEN\> \[Target item\] Title: \{title\} Description: \{description\} \[Task\] Write a review for this item in the user’s current style\. Output only the review text\.
## Appendix BPredictor Architecture Ablation
Table[8](https://arxiv.org/html/2605.26612#A2.T8)compares trajectory predictor architectures on Amazon Books\. The last\-state and linear\-trend predictors reduce temporal lag but remain noisy\. EMA improves over these simple forecasts by smoothing across the trajectory\. Learned attention captures most of the downstream gain, while the GRU gives the best generation quality\. The Transformer obtains slightly higher validation cosine similarity to the held\-out constructed state than the GRU, but its downstream ROUGE\-L is lower, suggesting that state prediction accuracy alone is not perfectly aligned with generation quality after STB injection\. The oracle row conditions on the unavailable held\-out state and therefore estimates the remaining ceiling of the injection interface rather than a deployable method\.
Table 8:Predictor architecture ablation on Books\. Cosine similarity is measured against the held out constructed state on validation\.PredictorCos simROUGE\-LP0 last state\.785\.241P1 linear trend\.792\.244P2 EMA\.831\.252P3 learned attention\.876\.258P4 GRU\.884\.265P5 Transformer\.886\.262Oracle state1\.000\.274
## Appendix CTrajectory Diagnostics
Table[9](https://arxiv.org/html/2605.26612#A3.T9)measures whether learned states retain user variation without collapsing to item identity or a constant vector\. Same item peer cosine measures leakage from item content\. Adjacent user cosine measures temporal smoothness for the same user\. Effective rank is computed in the 1024 dimensional normalized state space before STB mapping, using the covariance spectrum of 1,000 predicted vectors as\(∑jσj\)2/∑jσj2\(\\sum\_\{j\}\\sigma\_\{j\}\)^\{2\}/\\sum\_\{j\}\\sigma\_\{j\}^\{2\}\.
Table 9:Trajectory diagnostics on Books validation\. Lower same item peer cosine and higher effective rank indicate less item leakage and less collapse\.RepresentationSame item peer cosineAdjacent user cosineEffective rankRaw encoderenc\(ut\)\\mathrm\{enc\}\(u\_\{t\}\)\.72\.3496Static latent profile\.48\.41143Peer anchored𝐩t\\mathbf\{p\}\_\{t\}\.28\.51211Lattepredicted𝐩^T\\hat\{\\mathbf\{p\}\}\_\{T\}\.31\.49178End to end variant\.94\.0812We also quantify non collapse by sampling 100 random test users and computing pairwise cosine similarity of𝐩^T\(u\)\\hat\{\\mathbf\{p\}\}\_\{T\}\(u\)across all pairs\.Lattehas mean cosine 0\.42 and standard deviation 0\.18\. The end to end variant has mean cosine \.97 and standard deviation \.02, which confirms that the weak downstream result in Table[7](https://arxiv.org/html/2605.26612#S4.T7)coincides with representation collapse\.
## Appendix DCoverage and Sparsity Analysis
Table[10](https://arxiv.org/html/2605.26612#A4.T10)stratifies Amazon results by peer availability and history length\. The method is strongest when both signals are abundant, but it remains above DEP style static in low coverage buckets\. This analysis separates the claim that peer trajectories help in peer rich settings from the stronger claim that dense peers are always available\.
Table 10:Stratified ROUGE\-L on Amazon Reviews 2023\. Buckets are computed per target user prefix and then averaged across categories\.BucketStatic latentDEP style staticLatteGRUΔ\\Deltavs DEPPeer count 4 to 7\.213\.229\.243\+\.014Peer count 8 to 15\.221\.241\.259\+\.018Peer count 16 or more\.226\.246\.266\+\.020History length 8 to 15\.218\.231\.243\+\.012History length 16 to 31\.221\.240\.258\+\.018History length 32 or more\.224\.247\.271\+\.024
## Appendix EHyperparameter Sensitivity
Table 11:Hyperparameter sensitivity on Books, ROUGE\-L\. Defaults are bold\.HyperparameterValueR\-LHistory capNN16\.24532\.25564\.262all\.265λ\\lambda0\.2520\.01\.2650\.1\.2581\.0\.241STBρ\\rho0\.01\.2600\.05\.2650\.10\.255Table[11](https://arxiv.org/html/2605.26612#A5.T11)reports sensitivity to the main hyperparameters used byLatteon Amazon Books\. We sweep the maximum number of historical states used by the predictor, the balance coefficientλ\\lambdain the predictor regression loss, and the sparsity targetρ\\rhoin the State\-to\-Token Bridge\. Performance is stable around the default configuration\. Longer histories improve performance until the full available trajectory is used, supporting the use of sequence\-level preference information\. The predictor performs best with a small amount of MSE regularization in addition to cosine regression; pure cosine training loses some scale information, whereas an MSE\-dominated objective is more sensitive to residual magnitude\. The bridge sparsity target is best atρ=0\.05\\rho=0\.05, which balances denoising with retaining enough state information for generation\.
## Appendix FQualitative Case Study
To illustrate the behavior ofLatte, we present a representative Books validation case\. The user has 18 prior reviews\. Early reviews in sessions 1 to 6 are poetic and impression focused\. Recent reviews in sessions 14 to 17 are more analytical and focus on structure\.
Case study: User \#4218, target is a literary novel published in 2023Sample early review\(session 3, 2018\):“A haunting, beautiful book that lingers like the smell of rain on summer pavement\. The prose is spare yet luminous, and every page felt like a quiet revelation about loss and memory\.”Sample recent review\(session 16, 2023\):“Three converging POV threads, with the second act shift handled cleanly though the chronology in chapters 4 to 7 is needlessly opaque\. The author’s debt to mid period DeLillo is plain in the dialogue tags\.”Static profile output:“A beautiful and thoughtful book with an interesting plot and compelling characters\. The writing is well crafted and the themes resonate\.”Latteoutput:“The narrative structure follows a nonlinear chronology with three converging POV threads, deftly assembled in the second act\. The prose carries a clear postmodernist influence, though the pacing flags through the middle chapters\.”Reference:“Nonlinear structure with three POV threads converging in the second act\. Prose is competent and reflects clear postmodern training, but the middle act loses momentum and the resolution is rushed\.”
The static profile output averages over early and recent styles, whileLattefollows the user’s recent analytical style and technical vocabulary\.
## Appendix GLimitations
Latteis designed for settings where users have enough history to estimate a trajectory\. For cold start users, the predictor falls back to the recent text or static latent baselines used in our experiments\. The peer anchored state also assumes that comparable peers exist for historical items\. This assumption is natural for review platforms and many recommendation settings, and Appendix[D](https://arxiv.org/html/2605.26612#A4)quantifies performance under lower peer availability\. Sparse domains may require approximate peer construction from item categories or semantic neighbors\. Finally, the method tracks user level changes over time\. Deployment should treat latent trajectories as private user data and apply access control, retention limits, and deletion mechanisms\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction state the main contributions and scope of LATTE, including peer anchored state construction, trajectory forecasting, and one\-token injection\. The method and empirical support are provided in Sections[3](https://arxiv.org/html/2605.26612#S3)and[4](https://arxiv.org/html/2605.26612#S4)\. The claims are limited to frozen\-LLM personalization under the datasets, temporal splits, and evaluation settings described in Section[4\.1](https://arxiv.org/html/2605.26612#S4.SS1)and the limitations discussed in Section[G](https://arxiv.org/html/2605.26612#A7)\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: The paper includes a dedicated limitations section in Section[G](https://arxiv.org/html/2605.26612#A7)\. It discusses the need for sufficient user history, the assumption that comparable peers exist, sparse\-domain issues, and privacy considerations for latent trajectories\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[Yes\]
14. Justification: The paper includes Proposition 1 and Proposition 2 in Section[3\.2](https://arxiv.org/html/2605.26612#S3.SS2)\. Both propositions state their modeling assumptions explicitly and include proofs immediately after the statements, with the relevant equations numbered in the method section\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: The method, datasets, preprocessing filters, chronological splits, baselines, metrics, hyperparameters, and decoding settings are described in Sections[3](https://arxiv.org/html/2605.26612#S3)and[4\.1](https://arxiv.org/html/2605.26612#S4.SS1)\. Additional prompt, predictor, diagnostic, coverage, and hyperparameter details are provided in Appendices A–E\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[No\]
24. Justification: At submission time, we do not provide an anonymized code release with exact reproduction scripts\. The paper uses publicly described datasets and models, and it provides the main experimental and implementation details in Sections[3](https://arxiv.org/html/2605.26612#S3)and[4\.1](https://arxiv.org/html/2605.26612#S4.SS1)and Appendices A–E\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Section[4\.1](https://arxiv.org/html/2605.26612#S4.SS1)specifies the datasets, chronological splits, model choices, baselines, metrics, hyperparameters, and decoding settings\. Sections[3](https://arxiv.org/html/2605.26612#S3)and Appendices A–E provide the prompt, predictor, bridge, diagnostic, coverage, and hyperparameter details needed to understand the reported results\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: The paper reports 95% user\-bootstrap confidence intervals and p\-values for the main direct comparisons in Table 3\. These comparisons cover the strongest non\-trajectory baselines and the learned\-attention variant, which are the experiments most directly supporting the main empirical claim\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[No\]
39. Justification: The current manuscript describes model sizes and training settings, but it does not provide the GPU type, number of devices, wall\-clock time, GPU memory, or total compute estimates for each experiment\. These details should be added for a complete compute\-resource disclosure\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: The research uses previously released datasets and models, preserves anonymity in the submission, and does not involve new human\-subject experiments\. We have reviewed the NeurIPS Code of Ethics and are not aware of any deviation\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: The paper motivates potential benefits of personalized generation for user\-adapted assistance in the introduction\. Potential risks include over\-personalization, privacy leakage from latent user trajectories, and incorrect adaptation\. Section[G](https://arxiv.org/html/2605.26612#A7)discusses privacy\-relevant deployment considerations, including access control, retention limits, and deletion mechanisms\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The paper does not release a new high\-risk pretrained model, image generator, or scraped dataset\. The proposed method is an algorithmic framework evaluated with existing datasets and frozen LLMs, so this item is not applicable\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[No\]
59. Justification: The paper credits existing datasets, models, and related methods through citations in Sections[2](https://arxiv.org/html/2605.26612#S2)and[4\.1](https://arxiv.org/html/2605.26612#S4.SS1)\. However, the current manuscript does not explicitly enumerate the licenses, versions, or terms of use for each existing asset\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.26612v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[N/A\]
64. Justification: The paper does not introduce or release a new dataset, benchmark, or pretrained model asset\. If code, processed data, or checkpoints are released later, they should be accompanied by documentation, licenses, and reproduction instructions\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: The paper does not involve crowdsourcing, newly recruited participants, or human\-subject experiments\. Evaluation uses automatic metrics and an LLM\-based history\-aware judge protocol described in Section[4\.1](https://arxiv.org/html/2605.26612#S4.SS1)\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: The paper does not involve newly recruited human subjects or crowdsourcing\. It uses previously released datasets and automatic or LLM\-based evaluation, so IRB approval is not applicable to the reported experiments\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[Yes\]
79. Justification: LLMs are central to the method and evaluation\. Section[4\.1](https://arxiv.org/html/2605.26612#S4.SS1)specifies the frozen base generator, the encoder, and the Qwen3\-235B history\-aware judge used for pairwise evaluation\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.Similar Articles
Learning Transferable Latent User Preferences for Human-Aligned Decision Making
This paper introduces CLIPR, a framework that learns transferable latent user preferences from minimal conversational input to improve human-aligned decision making in LLMs.
PersonaVLM: Long-Term Personalized Multimodal LLMs
PersonaVLM introduces a personalized multimodal LLM framework that enables long-term user adaptation through memory retention, multi-turn reasoning, and response alignment, outperforming GPT-4o by 5.2% on the new Persona-MME benchmark.
Latent Preference Modeling for Cross-Session Personalized Tool Calling
Introduces MPT benchmark and PRefine method for cross-session personalized tool calling that captures user choice reasoning with minimal token overhead.
Re-Centering Humans in LLM Personalization
This paper studies the gap between synthetic and human data for evaluating LLM personalization across three stages: attribute extraction, relevance matching, and response generation. Results show models perform worse on real human data, and the authors introduce lightweight training interventions to improve alignment.
Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation
This paper proposes PUMA, a framework for LLM personalization in multi-turn conversations that models latent user states and uses the Free Energy Principle to select dialogue actions, improving long-horizon outcomes on healthcare counseling benchmarks.