Re-Centering Humans in LLM Personalization
Summary
This paper investigates the effectiveness of LLM personalization by putting real humans back into the evaluation loop, revealing systematic gaps between human judgments and LLM outputs at every stage of the personalization pipeline, and highlighting the limitations of synthetic data and LLM judges.
View Cached Full Text
Cached at: 06/18/26, 11:59 PM
Paper page - Re-Centering Humans in LLM Personalization
Source: https://huggingface.co/papers/2606.06614 Personalization is becoming a core promise of LLM systems: chatbots remember your job, interests, preferences, and past conversations to tailor responses. But “personalized” does not always mean helpful — it can also feel uncomfortable, offensive, or just unnecessary.
This raises a basic but surprisingly under-examined question: 𝗪𝗵𝗼 𝗱𝗲𝗰𝗶𝗱𝗲𝘀 𝘄𝗵𝗲𝘁𝗵𝗲𝗿 𝗽𝗲𝗿𝘀𝗼𝗻𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗶𝘀 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗵𝗲𝗹𝗽𝗳𝘂𝗹 — 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹, 𝗼𝗿 𝘁𝗵𝗲 𝗽𝗲𝗿𝘀𝗼𝗻 𝗯𝗲𝗶𝗻𝗴 𝗽𝗲𝗿𝘀𝗼𝗻𝗮𝗹𝗶𝘇𝗲𝗱 𝗳𝗼𝗿?
Most existing benchmarks rely heavily on synthetic personas, simulated conversations, and LLM judges. In this work, we put 𝗿𝗲𝗮𝗹 𝗵𝘂𝗺𝗮𝗻𝘀 back into the loop.
We study personalization as a three-stage pipeline:
🧠 𝗔𝘁𝘁𝗿𝗶𝗯𝘂𝘁𝗲 𝗲𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻 — what should the system infer from conversation history? 🎯 𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝗲 𝗺𝗮𝘁𝗰𝗵𝗶𝗻𝗴 — which attributes actually matter for the current request? ✍️ 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 — does personalization improve the user experience?
Using 550 real user conversations and nearly 19,000 human judgments, we find systematic 𝗵𝘂𝗺𝗮𝗻–𝗟𝗟𝗠 𝗴𝗮𝗽𝘀 at every stage:
• Models extract noisy and overgeneralized attributes from real conversations. Synthetic data underestimates this difficulty. • LLMs and humans disagree on which attributes should be used in a new question(κ=0.30), but each agree well within their own group (κ=0.60 and 0.43). • LLMs select 𝟮–𝟯× 𝗺𝗼𝗿𝗲 𝗮𝘁𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝘀 as relevant than humans do, suggesting a tendency to over-personalize. • Even with human-selected relevant attributes, 𝟱𝟰.𝟲% of personalized responses are judged 𝗻𝗼 𝗯𝗲𝘁𝘁𝗲𝗿 𝘁𝗵𝗮𝗻 𝗴𝗲𝗻𝗲𝗿𝗶𝗰 ones by humans. • LLM judges often overestimate personalization quality, sometimes rewarding surface-level attribute mentions that humans do not find useful.
We also find that lightweight training improves attribute verification and relevance matching substantially. But response-level personalization remains much harder, likely because “good personalization” is inherently individual.
𝗢𝘂𝗿 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆 𝗶𝘀 𝘀𝗶𝗺𝗽𝗹𝗲: Synthetic users and LLM judges aren’t enough to capture the complex nature of human preferences. We highlight the importance of 𝗵𝘂𝗺𝗮𝗻 𝗱𝗮𝘁𝗮 and call for 𝗿𝗲-𝗰𝗲𝗻𝘁𝗲𝗿𝗶𝗻𝗴 𝗵𝘂𝗺𝗮𝗻𝘀 in LLM personalization.
Similar Articles
Re-Centering Humans in LLM Personalization
This paper studies the gap between synthetic and human data for evaluating LLM personalization across three stages: attribute extraction, relevance matching, and response generation. Results show models perform worse on real human data, and the authors introduce lightweight training interventions to improve alignment.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
Review Arcade: On the Human Alignment and Gameability of LLM Reviews
This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.
The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment
This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.
LLMs Can Better Capture Human Judgments--With the Right Prompts
This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.