Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework
Summary
This paper proposes a framework to automatically optimize prompts for LLM-based user simulators in conversational recommender systems, addressing issues like positive bias and limited behavioral diversity.
View Cached Full Text
Cached at: 07/02/26, 05:41 AM
# Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework
Source: [https://arxiv.org/html/2607.00010](https://arxiv.org/html/2607.00010)
###### Abstract
Conversational recommender systems \(CRSs\) are a core component of next\-generation intelligent recommender systems because they enable users to actively elicit preferences, clarify intentions, and adapt recommendations in real time\. However, there are two key obstacles in the CRS domain: evaluation and access to training data\. Evaluating CRSs through real human studies is more critical than for traditional recommender systems, yet such studies are both costly and time\-consuming\. Moreover, CRS interaction data are often difficult to obtain for model training due to privacy concerns\. Large language model \(LLM\)\-based user simulators have shown promise in addressing both challenges by generating synthetic user interactions for evaluation and training\. However, existing approaches suffer from systematic positive bias, data leakage, and limited behavioral diversity, and they rely on brittle manual prompt engineering that requires extensive domain expertise\. In this paper, we propose a framework to automatically optimize prompts for LLM\-based user simulators in CRSs, simultaneously mitigating these issues\. Experimental results demonstrate that the proposed framework achieves improved behavioral alignment with human interaction patterns compared to baseline methods across diverse prompt settings\.
## IIntroduction
Recommender systems\[[5](https://arxiv.org/html/2607.00010#bib.bib16),[12](https://arxiv.org/html/2607.00010#bib.bib17)\]play a critical role in information seeking by enabling users to efficiently discover relevant items, content, and information\. These systems deliver both user value and business impact and continue to drive strong academic and industrial interest\[[34](https://arxiv.org/html/2607.00010#bib.bib15),[5](https://arxiv.org/html/2607.00010#bib.bib16),[26](https://arxiv.org/html/2607.00010#bib.bib47),[32](https://arxiv.org/html/2607.00010#bib.bib49)\]\. Conversational recommender systems \(CRSs\) enhance this capability by engaging users in natural language to infer user preferences and reasons behind those preferences\. CRSs allows users to articulate preferences, explore options interactively, and provide fine\-grained feedback\[[6](https://arxiv.org/html/2607.00010#bib.bib20),[35](https://arxiv.org/html/2607.00010#bib.bib13)\]which is fundamentally different from traditional recommender systems and have been recognized as the core component of next\-generation intelligent recommender systems in this Large Language Model \(LLM\) era\[[10](https://arxiv.org/html/2607.00010#bib.bib21),[23](https://arxiv.org/html/2607.00010#bib.bib43),[6](https://arxiv.org/html/2607.00010#bib.bib20)\]\.
Conversational recommender systems typically require extensive user testing prior to deployment, creating costly and time\-consuming bottlenecks in industrial development pipelines, as evaluation depends on real users interacting with the system\. Additionally, collecting conversational data introduces privacy concerns that hinder deployment in regulated settings \(e\.g\., healthcare and financial area\)\. User simulators provide a promising solution to these challenges by enabling scalable, low\-cost, and privacy\-preserving evaluation and training of CRSs without requiring extensive interaction with real users\.
User simulators are automated agents that emulate human interaction patterns in recommendation dialogues by generating responses that approximate real user behavior\. LLM based user simulators have shown its great potential in this LLM era by serving as synthetic user simulators that enable reproducible, cost\-effective user simulation in recommender systems across diverse user populations and interaction contexts\[[1](https://arxiv.org/html/2607.00010#bib.bib40),[21](https://arxiv.org/html/2607.00010#bib.bib2),[2](https://arxiv.org/html/2607.00010#bib.bib36),[16](https://arxiv.org/html/2607.00010#bib.bib39),[27](https://arxiv.org/html/2607.00010#bib.bib61)\]\.
The existing LLM user simulators can be categorized into fine\-tuned/training based simulators\[[3](https://arxiv.org/html/2607.00010#bib.bib58),[37](https://arxiv.org/html/2607.00010#bib.bib3),[36](https://arxiv.org/html/2607.00010#bib.bib1)\]and prompt based simulators\[[28](https://arxiv.org/html/2607.00010#bib.bib10),[38](https://arxiv.org/html/2607.00010#bib.bib55),[39](https://arxiv.org/html/2607.00010#bib.bib57),[14](https://arxiv.org/html/2607.00010#bib.bib60),[4](https://arxiv.org/html/2607.00010#bib.bib30)\]\. In this paper, we focus on prompt\-based simulators, as they are cost\-effective compared to training/fine\-tuning based methods which require fine\-tuning of a large language model\. LLM\-based user simulators operate by taking a structured user profile and a recommended item as input, then generating a natural language response that approximates real user behavior\. In prompt\-based simulators, this behavior is controlled entirely through instructions provided to a frozen LLM, without modifying model weights, contrasting with fine\-tuned approaches that require large labeled datasets and significant compute, making them impractical in privacy\-sensitive or resource\-constrained settings\.
Current prompt\-based LLM user simulators face three critical challenges that limit practical deployment in conversational recommender systems:
\(1\) Systematic positive bias:LLM\-based simulators exhibit unrealistically high acceptance rates, generating overwhelmingly positive responses that fail to capture realistic rejection behavior\. For instance, when presented with movie recommendations, LLM\-based simulators tend to accept most of suggestions regardless of stated user preferences, while real users reject nearly half of the presented recommendations\[[28](https://arxiv.org/html/2607.00010#bib.bib10),[35](https://arxiv.org/html/2607.00010#bib.bib13)\]\.
\(2\) Data leakage:Unlike standard train\-test leakage in machine learning, data leakage in user simulation occurs when user’s profile history items appear as item mentions verbatim\. Rather than reasoning from genuine preferences, the simulator echoes these items back in its responses, inflating apparent accuracy and artificially suppressing diversity metrics in ways that misrepresent true simulator capability\[[38](https://arxiv.org/html/2607.00010#bib.bib55),[28](https://arxiv.org/html/2607.00010#bib.bib10)\]\.
\(3\) Severely constrained behavioral diversity:Simulators exhibit popularity bias, disproportionately favoring well\-known mainstream items over niche content, and temporal clustering, concentrating recommendations within narrow time windows\. This leads to homogenized item mentions and fails to capture the diverse preferences of real user populations, limiting their effectiveness for evaluating recommender systems across varied user personas\[[35](https://arxiv.org/html/2607.00010#bib.bib13),[37](https://arxiv.org/html/2607.00010#bib.bib3),[22](https://arxiv.org/html/2607.00010#bib.bib7)\]\.
While prompt\-based simulators are more accessible, their performance is highly sensitive to prompt quality\[[30](https://arxiv.org/html/2607.00010#bib.bib4),[9](https://arxiv.org/html/2607.00010#bib.bib32),[24](https://arxiv.org/html/2607.00010#bib.bib46)\], and recent frameworks addressing this lack comprehensive solutions that tackle all three challenges simultaneously\[[39](https://arxiv.org/html/2607.00010#bib.bib57),[3](https://arxiv.org/html/2607.00010#bib.bib58),[14](https://arxiv.org/html/2607.00010#bib.bib60)\]\. A critical gap therefore remains in principled methods for automatically optimizing simulator behavior to achieve realistic behavioral patterns\. Beyond CRS evaluation, this work addresses a core challenge in LLM\-driven data engineering: how to automatically generate, optimize, and validate synthetic behavioral data at scale\.
To solve these issues, we propose a framework which achieves joint optimization of simulating user profiles and response patterns and overcomes the three challenges simultaneously\. Moreover, the framework also avoids heavy manual prompt engineering through automatic prompt optimization via interpretable text based gradients\. In this paper we make the following contributions\.
- •Proposal of optimized framework:To the best of our knowledge, we are the first one proposing a framework to automatically optimize the prompt for LLMs based user simulator in CRSs by overcoming the three challenges simultaneously\.
- •Solution of Challenges:In the framework, we design entropy\-aware and textual\-gradient\-based scoring functions that solve the problems of positive bias and popularity bias\. Additionally we design a profile summarization way to reduce the impact of data leakage while retaining the essential profile information\. To evaluate the simulator’s ability in overcoming overacceptance bias, we propose NegFeedback, a proof\-of\-concept metric specifically designed to assess the correctness and rationale of negative user feedback\.
- •Improved behaviour alignment:We demonstrate improved behavioral alignment with human interaction patterns compared to GPT\-3\.5 and GPT\-4 baselines, using local execution with Llama3\.3\. We further validate the proposed metric NegFeedback using both an LLM\-based evaluator and human evaluators\. The LLM evaluator enables us to scale the evaluation, and its alignment with human judgments demonstrates the validity of the LLM evaluator\.
## IIRelated Work
In this section, we organize our review around key challenges to establish the foundation for our approach\.
### II\-ALLM\-Based User Simulation in Recommender Systems
Recent implementations demonstrate significant advances in generating human like responses and maintaining coherent user personas\. Many studies have explored LLM powered simulator adapted for recommendation contexts that leverages contextual understanding and incorporate personality traits and demographic features into user modeling to generate more realistic user interactions\[[21](https://arxiv.org/html/2607.00010#bib.bib2),[36](https://arxiv.org/html/2607.00010#bib.bib1),[15](https://arxiv.org/html/2607.00010#bib.bib12),[31](https://arxiv.org/html/2607.00010#bib.bib11)\]\. However, existing implementations rely predominantly on manual prompt engineering, where researchers craft static templates based on intuition and limited experimentation\[[28](https://arxiv.org/html/2607.00010#bib.bib10),[30](https://arxiv.org/html/2607.00010#bib.bib4)\]\. This manual approach leads to several fundamental problems: prompt brittleness across different domains, inability to systematically optimize for real user behavior, and lack of principled methods for incorporating domain specific requirements\. Recent comprehensive surveys\[[25](https://arxiv.org/html/2607.00010#bib.bib14),[12](https://arxiv.org/html/2607.00010#bib.bib17)\]acknowledge these limitations but do not provide systematic solutions for prompt optimization in user simulation contexts\. The need for automatic prompt optimization in user simulation is further highlighted by the scalability challenges of manual approaches\. As recommendation systems become more complex and diverse, manually crafting prompts for each domain and user type becomes increasingly impractical\.
### II\-BPrompt Optimization and Engineering for LLMs
Recent Automated Prompt Optimization \(APO\) methods include gradient\-based, evolutionary, and RL\-based approaches\[[11](https://arxiv.org/html/2607.00010#bib.bib31),[18](https://arxiv.org/html/2607.00010#bib.bib33)\]\. TextGrad\[[29](https://arxiv.org/html/2607.00010#bib.bib35)\]represents a significant advancement by optimizing prompts through iterative refinement based on textual feedback from the target LLM itself\. This black box approach offers key advantages: \(1\) supports model optimization without requiring model access, and \(2\) allows for dynamic domain specific adaptation\. However, existing APO methods have not addressed user simulation challenges\. Current APO method applications focus on traditional NLP tasks with well\-defined success metrics, while user simulation requires handling conflicting behavioral objectives, complex behavioral patterns beyond simple accuracy, and authentic human behavioral diversity\.
### II\-CEvaluation Metrics and Feedback Mechanisms
Existing user simulation evaluations rely on surface\-level NLP metrics \(e\.g\., BLEU, F1\) that overlook behavioral fidelity, such as realistic acceptance rates and informative rejections\[[38](https://arxiv.org/html/2607.00010#bib.bib55),[33](https://arxiv.org/html/2607.00010#bib.bib56),[3](https://arxiv.org/html/2607.00010#bib.bib58)\]\. While recent work acknowledges these gaps, it lacks systematic methods to assess rejection quality or alignment with human behavior\[[39](https://arxiv.org/html/2607.00010#bib.bib57),[3](https://arxiv.org/html/2607.00010#bib.bib58)\]\. These gaps motivate our development of NegFeedback, a proof\-of\-concept metric designed to assess authenticity and rationale quality of user rejection in user simulation contexts\.
### II\-DThe Bias and Leakage Problem in LLM\-Based User Simulation
LLM\-based simulators suffer from systematic positive bias leading to unrealistically high acceptance rates and popularity skewed responses\[[19](https://arxiv.org/html/2607.00010#bib.bib5),[7](https://arxiv.org/html/2607.00010#bib.bib8),[13](https://arxiv.org/html/2607.00010#bib.bib6)\]\. Prompt sensitivity exacerbates these issues, while data leakage allows access to target items, compromising evaluation integrity\[[38](https://arxiv.org/html/2607.00010#bib.bib55)\]\. Existing methods lack principled tools for aligning simulator behavior with human patterns\. Our framework extends automatic prompt\-tuning with bias correction to address these challenges and improve behavioral fidelity\.
Figure 1:Automatic Prompt tuning\-based user simulator optimization pipeline
## IIIOur Framework
Compared with existing LLM based user simulators in CRSs, our framework leverages TextGrad\[[29](https://arxiv.org/html/2607.00010#bib.bib35)\]to achieve the automatic prompt optimization to avoid the heavy prompt engineering\. To effectively adapt to user simulation task in CRSs, our framework optimizes behavioral fidelity rather than linguistic similarity\. Moreover, our framework introduces an entropy\-aware and textual\-gradient\-based scoring functions to teach the model to optimize prompts automatically to overcome the positive bias and popularity bias\. Moreover, we also design a profile summarization approach to achieve a good trade\-off between reduced data leakage risks and keeping essential profile information\. Lastly, a newly designed metric which will be carefully validated is proposed to evaluate the simulators’ rejection quality to measure their ability in overcoming over\-acceptance bias\. The framework is based on a local LLM Ollama’s Llama\-3\.3\-70B for privacy\-preserving execution\.
Figure[1](https://arxiv.org/html/2607.00010#S2.F1)illustrates the overall pipeline of our approach\. The process starts with raw user interaction histories extracted from the Amazon dataset \(Section[IV\-A](https://arxiv.org/html/2607.00010#S4.SS1)\)\. These histories are then used to generate and summarize user profiles \(Section[III\-B](https://arxiv.org/html/2607.00010#S3.SS2)\)\. Based on the resulting profiles, task\-specific initial evaluation prompts are constructed \(Section[III\-E](https://arxiv.org/html/2607.00010#S3.SS5)\)\. Automated prompt tuning is subsequently performed using the TextGrad optimization loop \(Section[III\-D](https://arxiv.org/html/2607.00010#S3.SS4)\) to refine the simulator behavior across tasks\.
### III\-ABias Correction via Entropy\-Based Diversity Metrics
LLM simulators exhibit systematic biases that undermine evaluation validity:positive bias\(over\-acceptance of recommendations\)\[[28](https://arxiv.org/html/2607.00010#bib.bib10)\],popularity bias\(over\-representation of well\-known items\)\[[35](https://arxiv.org/html/2607.00010#bib.bib13),[37](https://arxiv.org/html/2607.00010#bib.bib3)\], andtemporal clustering\(narrow time\-window recommendations\)\. Prompt sensitivity further amplifies these issues\. To model behavioral diversity, we use entropy\-based metrics where higher entropy reflects greater unpredictability\. We define five complementary measures and each of them is designed to capture different aspects of human behavioral fidelity: \(1\)Item Entropycaptures diversity across genre, era, region, and cultural origin, penalizing duplicates and IMDb Top 250 overuse; \(2\)BinPrefquantifies alignment between simulator and human binary preferences; \(3\)Aspect Entropymeasures justification diversity \(e\.g\., plot, acting, pacing\); \(4\)Sentiment Entropyreflects tone variation, exposing positive bias; \(5\)Semantic Richnessassesses vocabulary diversity\. Our method iteratively optimizes prompts using these metrics, comparing against ground truth patterns and generating feedback to reduce bias:
1. 1\.ItemsTalk: Item entropy quantifying diversity in movie selections\. LetX=\{x1,…,xn\}X=\\\{x\_\{1\},\\dots,x\_\{n\}\\\}be the list of movies binned by \(genre, era, region\): H\(X\)=−∑i=1np\(xi\)logp\(xi\)H\(X\)=\-\\sum\_\{i=1\}^\{n\}p\(x\_\{i\}\)\\log p\(x\_\{i\}\)\(1\)
2. 2\.BinPref: Pearson correlation between simulated acceptance rates\(xi\)\(x\_\{i\}\)and human movie ratings\(yi\)\(y\_\{i\}\): r=∑i=1n\(xi−x¯\)\(yi−y¯\)∑i=1n\(xi−x¯\)2∑i=1n\(yi−y¯\)2r=\\frac\{\\sum\_\{i=1\}^\{n\}\(x\_\{i\}\-\\bar\{x\}\)\(y\_\{i\}\-\\bar\{y\}\)\}\{\\sqrt\{\\sum\_\{i=1\}^\{n\}\(x\_\{i\}\-\\bar\{x\}\)^\{2\}\\sum\_\{i=1\}^\{n\}\(y\_\{i\}\-\\bar\{y\}\)^\{2\}\}\}\(2\)
3. 3\.OpenPref: Aspect and sentiment entropy measuring review richness\. LetX=\{x1,…,xn\}X=\\\{x\_\{1\},\\dots,x\_\{n\}\\\}be aspects \(plot, acting, pacing\) andY=\{y1,…,ym\}Y=\\\{y\_\{1\},\\dots,y\_\{m\}\\\}be sentiments \(neg, neu, pos\): Haspect\(X\)\\displaystyle H\_\{\\text\{aspect\}\}\(X\)=−∑i=1np\(xi\)logp\(xi\)\\displaystyle=\-\\sum\_\{i=1\}^\{n\}p\(x\_\{i\}\)\\log p\(x\_\{i\}\)\(3\)Hsentiment\(Y\)\\displaystyle H\_\{\\text\{sentiment\}\}\(Y\)=−∑i=1mp\(yi\)logp\(yi\)\\displaystyle=\-\\sum\_\{i=1\}^\{m\}p\(y\_\{i\}\)\\log p\(y\_\{i\}\)\(4\)
4. 4\.RecRequest: Multiple diversity measures\. Let𝐰i\\mathbf\{w\}\_\{i\}be word vectors and𝐬i\\mathbf\{s\}\_\{i\}be sentence vectors: TTR=\|Types\|\|Tokens\|\\displaystyle=\\frac\{\|\\text\{Types\}\|\}\{\|\\text\{Tokens\}\|\}\(5\)W2V=1−1\|Words\|2∑i,j=1\|Words\|cos\(𝐰i,𝐰j\)\\displaystyle=1\-\\frac\{1\}\{\|\\text\{Words\}\|^\{2\}\}\\sum\_\{i,j=1\}^\{\|\\text\{Words\}\|\}\\cos\(\\mathbf\{w\}\_\{i\},\\mathbf\{w\}\_\{j\}\)\(6\)SentDiv=1−1\|Sentences\|2∑i,j=1\|Sentences\|cos\(𝐬i,𝐬j\)\\displaystyle=1\-\\frac\{1\}\{\|\\text\{Sentences\}\|^\{2\}\}\\sum\_\{i,j=1\}^\{\|\\text\{Sentences\}\|\}\\cos\(\\mathbf\{s\}\_\{i\},\\mathbf\{s\}\_\{j\}\)\(7\)
A potential concern with entropy\-based optimization is that maximizing entropy could produce outputs that are diverse but behaviorally random rather than human\-like\. Our framework addresses this directly by anchoring all scoring functions to empirically derived human baseline distributions rather than rewarding raw entropy maximization\. Concretely, our evaluation prompts define target entropy ranges derived from real Amazon user behavior\. Outputs that fall outside these ranges in either direction are penalized, meaning the optimizer is explicitly discouraged from overshooting into unrealistic diversity\.
### III\-BProfile Summarization for Data Leakage Prevention
Using complete user histories creates context overload and data leakage when target items appear in history\. We introduce profile summarization: randomly sample user history subsets and abstract into natural language preferences, e\.g\., transforming*The Godfather, Goodfellas, Casino*into “prefers character\-driven crime dramas with moral ambiguity\.” For each user, we sample up tok=5k=5items from their history, and pass titles and years to Llama\-3\.3\-70B with the following constrained prompt:
> “Summarize the user’s taste and behavior in simple English\. Keep the description short without naming specific movies or TV shows\.”
Refer to section[IV\-C](https://arxiv.org/html/2607.00010#S4.SS3)for results\.
### III\-CNegFeedback: Rejection Quality Evaluation
We designed a new metric, NegFeedback, extending\[[28](https://arxiv.org/html/2607.00010#bib.bib10)\]’s feedback task to assess rejection quality\. Unlike prior work focused on accept/reject coherence, NegFeedback evaluates if simulators give persona\-aligned rejection rationales and informative reformulation requests, revealing deeper reasoning capability\. NegFeedback requires simulators to: reject inappropriate recommendations, provide preference based justifications, generate semantically rich reformulation requests without naming movies, and maintain persona\-consistent tone\. An LLM\-based evaluator scores responses \(0–10\) on rejection clarity, justification specificity, reformulation quality, and behavioral consistency\. For each NegFeedback instance, the evaluator LLM \(Llama\-3\.3\-70B\) receives the user profile summary, recommended item description, and simulator response, along with a rubric describing four criteria: \(A\)*rejection clarity*, \(B\)*persona\-grounded justification specificity*, \(C\)*reformulation quality without naming specific titles*, and \(D\)*tone consistency with persona*\. The evaluator outputs four integer scoressA,sB,sC,sD∈\[0,10\]s\_\{A\},s\_\{B\},s\_\{C\},s\_\{D\}\\in\[0,10\], and we define
NegFeedback=14\(sA\+sB\+sC\+sD\)\.\\text\{NegFeedback\}=\\frac\{1\}\{4\}\(s\_\{A\}\+s\_\{B\}\+s\_\{C\}\+s\_\{D\}\)\.\(8\)The same rubric is used for human evaluation and we report inter\-annotator agreement and human–LLM correlation in section[IV\-F2](https://arxiv.org/html/2607.00010#S4.SS6.SSS2)\.
### III\-DTextGrad Optimization Loop
We apply TextGrad optimizationseparately for each task, allowing task\-specific prompt refinement\. For taskii, we:
1. 1\.Initialize promptθi\(0\)\\theta^\{\(0\)\}\_\{i\}with a task\-specific template
2. 2\.For iterationt=0,…,T−1t=0,\\dots,T\-1: 1. \(a\)Sample batch of usersBB, generate responses usingθi\(t\)\\theta^\{\(t\)\}\_\{i\} 2. \(b\)Compute task\-specific metricmi\(θi\(t\)\)m\_\{i\}\(\\theta^\{\(t\)\}\_\{i\}\) 3. \(c\)Query back engine LLM in Textgrad for feedback on improvingmim\_\{i\} 4. \(d\)Update prompt toθi\(t\+1\)\\theta^\{\(t\+1\)\}\_\{i\}
3. 3\.Use optimizedθi\(T\)\\theta^\{\(T\)\}\_\{i\}for final evaluation
In all experiments we runT=3T=3iterations with batch sizeB=100B=100users\. Optimization stops at the last iteration\. Only the guideline and persona sections of the prompt are editable; task instructions and system boilerplate remain fixed\. Edits are constrained to preserve field structure and keep total prompt length belowL=2048L=2048tokens\.
### III\-EText Gradient Design
##### Textgrad Evaluation Prompt Design
For each task, we first established human baselines using responses from 10 randomly selected Amazon users\. These baselines defined the target distributions for the five tasks\. Using these values, we constructed an initial evaluation prompt and performed prompt tuning\. After each iteration, the resulting textual gradients were manually analyzed to extract qualitative feedback in the form of natural language\. Insights from this analysis were then used to refine and update the evaluation prompt by incorporating additional constraints, ensuring that the evaluation prompt progressively and accurately captured and evaluated the intended aspects of each task\. Evaluation prompts were iteratively refined over multiple optimization runs using textual\-gradient analysis until the simulator outputs aligned closely with the human targets\. This refinement was performed separately for each task and concluded once the evaluation prompt stabilized\. The finalized prompts were then used for large\-scale evaluation\. Once the final evaluation prompt was established, we scaled the evaluation by replacing textual\-gradient analysis with numerical scoring functions\. These functions mapped deviations from the human baseline to scores in the range \[1,10\], where 1 indicates the largest deviation and 10 indicates the closest match\.
TABLE I:Prompt Evolution Across Five Tasks OptimizationKey Patterns observed after prompt tuning:\(1\)ItemsTalk: Adds explicit diversity constraints \(geography, time, genre\); \(2\)BinPref: Introduces threshold rule \(3\.5 cutoff\) and binary output format; \(3\)OpenPref: Demands multi\-aspect analysis, scene\-specific examples, and personal reflection; \(4\)RecRequest: Removes verbatim titles, adds rich semantic descriptors\.\(5\)NegFeedback: Enhances rejection quality,reformulation ability, and rationale coherence
##### Task\-Specific Evaluation Prompts
Each task employs a specialized evaluation prompteie\_\{i\}that defines hard constraints, scoring criteria, and target behavioral ranges informed by human baselines\. InItemsTalk,eie\_\{i\}enforces strict formatting before evaluating content, penalizing popular or redundant selections and rewarding high\-entropy lists spanning various genres, decades, and continents\. InBinPref,eie\_\{i\}uses deterministic correctness scoring, assigning positive credit only to exactYes/Nooutputs with no additional text, aligned with human decision boundaries\. InOpenPref,eie\_\{i\}constrains responses with respect to aspect, and sentiment entropy baselines derived from human review distributions\. InRecRequests,eie\_\{i\}assess fluency, implicitness, and diversity without explicit item references, targeting baselines based on human\-written queries\. InNegFeedback,eie\_\{i\}emphasizes clear rejection justification tied to persona preferences, semantically rich reformulation without naming items, and tone consistency, with a 10\-point rubric penalizing generic reasoning, verbatim movie mentions, and excessive prompt length\. All tasks use fixedK=3K=3epochs without early stopping\. Table[I](https://arxiv.org/html/2607.00010#S3.T1)illustrates the concrete prompt transformations resulting from this optimization process\. Table[II](https://arxiv.org/html/2607.00010#S3.T2)presents the complete evaluation prompts used for optimization\.
TABLE II:Task\-Specific Evaluation for Prompt OptimizationNote:All prompts instruct the evaluator LLM \(Llama\-3\.3\-70B\) to return only a numeric score \(0–10 or 0–1 for binary tasks\) with no additional text, ensuring deterministic optimization signals for automated prompt tuning\. Target ranges for entropy and diversity metrics were empirically derived from human baseline distributions \(Section[IV\-A](https://arxiv.org/html/2607.00010#S4.SS1)\)\.
## IVExperiments
Our experimental evaluation addresses three core research questions to validate the proposed framework\.RQ1: Does our framework improve performance over existing prompt\-based user simulators in conversational recommender systems?To answer this question, we compare our method against GPT\-3\.5\-turbo and GPT\-4 baselines, which represent the state\-of\-the\-art prompt\-based user simulation approaches used in prior research\[[28](https://arxiv.org/html/2607.00010#bib.bib10),[38](https://arxiv.org/html/2607.00010#bib.bib55),[39](https://arxiv.org/html/2607.00010#bib.bib57)\]\. Results are presented in Section[IV\-F1](https://arxiv.org/html/2607.00010#S4.SS6.SSS1)\.RQ2: How does our framework perform under different prompting configurations?To answer this question, we conduct comprehensive comparisons across four settings: zero\-shot, In\-Context Learning \(ICL\), Retrieval\-Augmented Generation \(RAG\), and Profile\-Augmented Generation \(PAG\), following established evaluation protocols\[[40](https://arxiv.org/html/2607.00010#bib.bib63)\]\. Comparative results across these configurations are reported in Section[IV\-F1](https://arxiv.org/html/2607.00010#S4.SS6.SSS1)\.RQ3: Is the proposed NegFeedback metric reliable for measuring simulators’ rejection quality and mitigating over\-acceptance bias?To answer this question, we recognize that while human evaluation provides the gold standard for assessing rejection behavior\[[28](https://arxiv.org/html/2607.00010#bib.bib10),[20](https://arxiv.org/html/2607.00010#bib.bib37)\], it does not scale to large\-scale evaluation\[[38](https://arxiv.org/html/2607.00010#bib.bib55),[33](https://arxiv.org/html/2607.00010#bib.bib56)\]\. Therefore, we design an LLM\-based evaluator to enable scalable assessment\. To guarantee the quality of the LLM evaluator, we conduct a pilot human evaluation study and calculate the alignment between human judgments and LLM evaluator scores using Pearson correlation\. The human validation methodology is detailed in Section[IV\-E](https://arxiv.org/html/2607.00010#S4.SS5), with comprehensive NegFeedback performance analysis presented in Section[IV\-F2](https://arxiv.org/html/2607.00010#S4.SS6.SSS2)\.
### IV\-ADataset
We use the Amazon Reviews 2023 dataset \(Movies & TV subset\)\[[8](https://arxiv.org/html/2607.00010#bib.bib38)\], which provides large scale real world user interactions \(\>\>1\.4 million reviews\) with diverse behavioral patterns\. We develop a pre\-processing pipeline that transforms raw review data into structured formats suitable for our five simulation tasks\. The Movies & TV subset aligns with conversational recommendation research and provides comparable baselines from prior work\. The methodology generalizes across other Amazon Reviews 2023 corpus subsets, ensuring scalability and adaptability for future research\.
### IV\-BDeployment Infrastructure and Practical Considerations
We implemented a customOllamaEngineextending TextGrad’sEngineLMfor local deployment using Llama 3\.3:70B, selected for its instruction\-following capability and compatibility with privacy\-preserving execution\. Smaller models \(e\.g\., mistral\-small, deepseek\-r1\) failed to meet TextGrad’s requirements\. Our system ran on 1 NVIDIA A100 GPU via institutional HPC, completing simulation and evaluation for 100 users across 5 tasks in 15 GPU hours\. Profiles are generated independently, and optimized prompts are reusable, enabling low\-cost, parallelized evaluation\. Cross\-domain adaptation requires minimal changes to attribute schemas and prompt templates, demonstrating feasibility for organizations with diverse recommendation systems\.
### IV\-CProfile Generation and Summarization
We implement a privacy\-preserving profile generation system using locally deployed Llama 3\.3:70B via Ollama, addressing cloud\-related data concerns while maintaining generation quality\. Profiles include demographic and behavioral traits based on prior work\[[36](https://arxiv.org/html/2607.00010#bib.bib1)\]\. Each profile is generated through structured LLM interaction to ensure diversity and coherence\. To reduce the potential for direct data leakage, we summarize user history by sampling up to five movies and prompting the LLM to produce a concise preference description\. This two\-stage process yields compact, natural language profiles suitable for simulation while anonymizing sensitive data\[[38](https://arxiv.org/html/2607.00010#bib.bib55),[39](https://arxiv.org/html/2607.00010#bib.bib57),[17](https://arxiv.org/html/2607.00010#bib.bib48)\]\. The profile summarization output replaces the raw history in all initial prompts\. This approach reduces length significantly while*reducing the potential for direct data leakage, improving item entropy and substantially lowering context window pressure*\. To assess leakage risk and diversity impact, we analyzed item entropy across 100 users: raw histories yielded simulator item entropy of 6\.415, while summaries increased this to 9\.269 \(vs human baseline 10\.067\)\. We further analyzed 100 responses where movie names appeared in user history: raw histories contained target titles verbatim in almost all of the cases which resulted in the high predictability of the titles mentioned and led to low entropy score \(6\.415\)\. This is strong evidence that data leakage caused the reduction of item entropy values\. The profile summaries abstracted these movie titles to genre/preference descriptions in all of the cases\. This was verified by manual inspection of 100 samples and found no verbatim title reproduction\. The resultant entropy measured in this case was the closest to the human baseline\. This diagnostic strongly validates profile summarization reduces direct leakage exposure and substantially improves behavioral diversity\.
### IV\-DTask Design and Baselines
We evaluate on five tasks from\[[28](https://arxiv.org/html/2607.00010#bib.bib10)\]:ItemsTalkmeasures item mention diversity through entropy;BinPrefmeasures the preference alignment between user simulator and the real user;OpenPrefassesses nuanced aspect and sentiment expression;RecRequestevaluates lexical diversity and request granularity\. We compare against GPT\-3\.5\-turbo and GPT\-4 in four configurations: zero\-shot, In\-Context Learning \(ICL\), Retrieval Augmented Generation \(RAG\), and Profile Augmented Generation \(PAG\), following previous research\[[40](https://arxiv.org/html/2607.00010#bib.bib63)\]\. These baselines represent user simulators without heavy model training\. For NegFeedback, we conduct human evaluation with independent evaluators assessing rejection clarity, behavioral consistency, and reformulation quality on 10 randomly sampled responses\. We analyze inter\-evaluator correlation and measure human evaluator agreement with LLM\-based scores across our method and baseline models\.
### IV\-ENegFeedback Human and Baseline Evaluation
We introduceNegFeedback, which evaluates rejection quality, reformulation ability, and rationale coherence which are core behaviors in conversational recommendation\.Study Design:We conducted a pilot human evaluation on 10 samples to establish proof\-of\-concept validity\. Three native English\-speaking graduate students independently scored LLM\-generated rejection responses on a 0–10 scale across four criteria:\(A\)Rejection clarity \(0=implicit acceptance, 10=unambiguous rejection\),\(B\)Justification specificity \(0=generic, 10=persona\-grounded\),\(C\)Reformulation quality \(0=no guidance, 10=semantically rich without naming titles\),\(D\)Tone consistency \(0=contradictory, 10=perfect match\)\. Whilen=10n=10is insufficient for definitive conclusions, results provide initial evidence of metric validity; we commit to scaling ton≥100n\\geq 100in future work\.Protocol:Evaluators were blinded to model identity, received randomized samples, and scored based on user profile \+ movie name \+ simulator rejection response\. The same rubric was used for both human and LLM evaluation to reduce subjectivity and LLM bias\. Inter\-rater agreement among three evaluators shows moderate consistency \(average pairwise correlation:0\.520\.52\)
### IV\-FResults and Discussion
#### IV\-F1Performance Across Behavioral Tasks
TABLE III:OpenPref Task ResultsOur simulator demonstrates substantial improvements across multiple behavioral dimensions through automated prompt optimization\. Apart from comparing with GPT baseline and our method, we include the human baseline for all the tasks in the tables\.Humanbaseline is computed directly from Amazon user data, serving as the gold standard for behavioral fidelity\. Unlike synthetic responses from simulators, these reflect genuine diversity, preferences, and communication patterns\.
In the OpenPref task \(Table[III](https://arxiv.org/html/2607.00010#S4.T3)\),post\-tuning aspect entropy \(7\.523\) approximates human performance \(6\.761\) and dramatically exceeds GPT baselines \(4\.17 to 5\.087\), while maintaining realistic sentiment entropy \(1\.274\) that matches human patterns \(1\.277\)\.GPT variants uniformly collapse to zero sentiment entropy across all context strategies \(ICL, RAG, PAG\), revealing severe positive bias\. This failure stems from GPT’s tendency to generate safe, uniformly positive responses when conditioned on user preferences, a bias that our entropy based optimization explicitly addresses through iterative feedback that penalizes sentiment uniformity\.
TABLE IV:ItemsTalk Task ResultsFor ItemsTalk \(Table[IV](https://arxiv.org/html/2607.00010#S4.T4)\),post optimization achieves item entropy \(9\.269\), closely matching human diversity \(10\.067\) and substantially exceeding GPT baselines \(6\.496 to 6\.644\)\. The 9\.269 item entropy \(vs human 10\.067\) reflects improved diversity enabled by shorter, leakage\-reduced summaries that prevent over\-reliance on history\-popular items\. The uniform collapse across ICL, RAG, and PAG variants suggests these context augmentation strategies introduce information that overwhelms the GPT model, causing it to treat the augmented context as noise that degrades GPT’s inherent diversity, with even vanilla GPT\-3\.5\-turbo underperforming base Llama\-3\.3\-70b likely due to GPT’s extensive training constraining item selection toward safer mainstream recommendations and its training cutoff underrepresenting niche item diversity, both biasing toward high\-frequency items\. Our iterative optimization successfully amplifies the LLM diversity by the multiple iterations of tuning and reward based learning\.
TABLE V:BinPref Task ResultsIn BinPref\(Table[V](https://arxiv.org/html/2607.00010#S4.T5)\), the models based on GPT\-3\.5\-turbo and GPT\-4 demonstrate relatively low correlation\. Herellama3\.3:70b base \(0\.731\) slightly outperforms our tuned method \(0\.726\)\. This may be due to the fact that without the tuning, the base model had strong instruction following ability and prompt tuning in this case was not necessary\. But prompt tuning performs better than GPT baselines as the instruction following ability of these models were low\. This same problem had been identified\[[28](https://arxiv.org/html/2607.00010#bib.bib10)\]in previous research in the GPT family\. Using a larger and better instruction following model seems to have solved the problem\.
TABLE VI:RecRequest Task ResultsIn RecRequest\(Table[VI](https://arxiv.org/html/2607.00010#S4.T6)\),post tuning achieves Type\-Token Ratio of \(0\.235\), closely approximating human lexical patterns \(0\.188\), while GPT baselines exhibit inflated diversity \(0\.42\+\)\.This overestimation indicates that GPT generates vocabulary rich but stylistically unnatural requests while prioritizing lexical variety over conversational authenticity\. Our method’s human baseline optimization successfully calibrates this trade off, producing requests that are appropriately repetitive in structure while remaining semantically meaningful\.
#### IV\-F2NegFeedback: Evaluating Rejection Realism
TABLE VII:NegFeedback Task ResultsTable[VII](https://arxiv.org/html/2607.00010#S4.T7)presents correlations between model outputs and human evaluation for the NegFeedback task\. As detailed in Section[IV\-E](https://arxiv.org/html/2607.00010#S4.SS5), our method achieves the highest human\-LLM correlation \(r=0\.484r=0\.484,p=0\.156p=0\.156\), substantially outperforming base Llama \(r=0\.258r=0\.258\) and all GPT\-3\.5\-turbo variants \(negative or near\-zero correlations\)\. Most GPT configurations produced zero\-variance outputs, yielding undefined correlations which were corrected after Laplace smoothing\. Our method achieves highest human\-LLM correlation \(r=0\.484r=0\.484,p=0\.156p=0\.156\) vs\. base Llama \(r=0\.258r=0\.258\) and GPT\-3\.5\-turbo variants \(negative correlations, Table[VII](https://arxiv.org/html/2607.00010#S4.T7)\)\. While not statistically significant due to limited sample size, the positive moderate correlation suggests the LLM evaluator captures meaningful human\-aligned signal\. GPT\-3\.5\-turbo configurations produced near\-zero variance outputs, resulting in undefined correlations which needed Laplace smoothing\. Yet these configurations gave poor correlations\.
## VConclusion
We presented a multi\-objective framework for automatic prompt optimization of LLM\-based user simulators in conversational recommender systems\. This work jointly targets systematic positive bias, data leakage, and constrained behavioral diversity while avoiding brittle manual prompt engineering via textual gradients and entropy\-aware scoring\. Profile summarization is integrated to abstract verbatim histories\. Our approach substantially optimizes item entropy, aspect and sentiment entropy, preference alignment, semantic richness, and rejection quality with respect to human baselines\. The proposed NegFeedback task and metric further capture persona\-grounded rejection quality and over\-acceptance bias, with an LLM evaluator whose scores show meaningful correlation with human judgments, suggesting that scalable, human\-aligned evaluation of rejection behavior is feasible without exhaustive human annotation\. The approach generalizes across recommendation domains with minimal adaptation,enabling scalable evaluation for industrial use\. This makes our approach immediately applicable for practitioners seeking reproducible, privacy compliant simulation pipelines in regulated or resource\-constrained environments\.
## VIAI\-Generated Content Acknowledgement
The authors acknowledge the use of Claude \(Anthropic\) to refine the language and improve the clarity of this manuscript\. Specifically, the tool was used in the Introduction, Our Framework, and Experiments sections to polish author written text for better flow and grammatical precision\. All AI\-generated suggestions were critically reviewed and modified by the authors, who maintain full responsibility for the accuracy and integrity of the final content\.
## References
- \[1\]K\. Balog, N\. Bernard, S\. Zerhoudi, and C\. Zhai\(2025\)Theory and toolkits for user simulation in the era of generative AI: user modeling, synthetic data generation, and system evaluation\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13\-18, 2025,Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p3.1)\.
- \[2\]A\. Bernadić, G\. Kujundžić, and I\. Primorac\(2025\)Large language models in power systems: enhancing control and decision\-making\.International Journal of Innovative Solutions in Engineering\.External Links:[Document](https://dx.doi.org/10.47960/3029-3200.2025.1.1.10)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p3.1)\.
- \[3\]L\. Chen, Q\. Dai, Z\. Zhang, X\. Feng, M\. Zhang, P\. Tang, X\. Chen, Y\. Zhu, and Z\. Dong\(2025\)RecUserSim: a realistic and diverse user simulator for evaluating conversational recommender systems\.InCompanion Proceedings of the ACM on Web Conference 2025,pp\. 133–142\.Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p4.1),[§I](https://arxiv.org/html/2607.00010#S1.p9.1),[§II\-C](https://arxiv.org/html/2607.00010#S2.SS3.p1.1)\.
- \[4\]M\. Chen, H\. Fu, C\. Liu, X\. Wang, Z\. Li, and J\. Sun\(2025\)Build a good human\-free prompt tuning: jointly pre\-trained template and verbalizer for few\-shot classification\.IEEE Transactions on Knowledge and Data Engineering37,pp\. 2253–2265\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2025.3543422)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p4.1)\.
- \[5\]W\. Fan, Z\. Zhao, J\. Li, Y\. Liu, X\. Mei, Y\. Wang, J\. Tang, and Q\. Li\(2024\)Recommender systems in the era of large language models \(LLMs\)\.IEEE Transactions on Knowledge and Data Engineering36\(11\),pp\. 6889–6907\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2024.3392335)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1)\.
- \[6\]C\. Gao, W\. Lei, X\. He, M\. De Rijke, and T\. S\. Chua\(2021\)Advances and challenges in conversational recommender systems: a survey\.AI Open2,pp\. 100–126\.Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1)\.
- \[7\]G\. Gui and O\. Toubia\(2023\)The challenge of using llms to simulate human behavior: a causal inference perspective\.Columbia Business School Research Paper\.Cited by:[§II\-D](https://arxiv.org/html/2607.00010#S2.SS4.p1.1)\.
- \[8\]Y\. Hou, J\. Li, Z\. He, A\. Yan, X\. Chen, and J\. McAuley\(2024\)Bridging language and items for retrieval and recommendation\.arXiv preprint arXiv:2403\.03952\.Cited by:[§IV\-A](https://arxiv.org/html/2607.00010#S4.SS1.p1.1)\.
- \[9\]B\. Lester, R\. Al\-Rfou, and N\. Constant\(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 3045–3059\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.243)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p9.1)\.
- \[10\]R\. Li, S\. Ebrahimi Kahou, H\. Schulz, V\. Michalski, L\. Charlin, and C\. Pal\(2018\)Towards deep conversational recommendations\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1)\.
- \[11\]W\. Li, X\. Wang, W\. Li, and B\. Jin\(2025\)A survey of automatic prompt engineering: an optimization perspective\.arXiv preprint arXiv:2502\.11560\.Cited by:[§II\-B](https://arxiv.org/html/2607.00010#S2.SS2.p1.1)\.
- \[12\]J\. Lin, X\. Dai, Y\. Xi, W\. Liu, B\. Chen, X\. Li, C\. Zhu, H\. Guo, Y\. Yu, R\. Tang, and W\. Zhang\(2023\)How can recommender systems benefit from large language models: a survey\.ACM Transactions on Information Systems43,pp\. 1–47\.External Links:[Document](https://dx.doi.org/10.1145/3678004)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.00010#S2.SS1.p1.1)\.
- \[13\]Y\. Lu, A\. Aleta, C\. Du, L\. Shi, and Y\. Moreno\(2024\)LLMs and generative agent\-based models for complex systems research\.Physics of Life Reviews51,pp\. 283–293\.External Links:[Document](https://dx.doi.org/10.1016/j.plrev.2024.10.013)Cited by:[§II\-D](https://arxiv.org/html/2607.00010#S2.SS4.p1.1)\.
- \[14\]X\. Luo, Z\. Tang, J\. Wang, and X\. Zhang\(2024\)DuetSim: building user simulator with dual large language models for task\-oriented dialogues\.Proceedings of LREC\-COLING 2024\.Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p4.1),[§I](https://arxiv.org/html/2607.00010#S1.p9.1)\.
- \[15\]J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,pp\. 1–22\.Cited by:[§II\-A](https://arxiv.org/html/2607.00010#S2.SS1.p1.1)\.
- \[16\]K\. Przystalski, J\. K\. Argasiński, N\. Lipp, and D\. Pacholczyk\(2025\)The rise of the large language models\.InBuilding Personality\-Driven Language Models: How Neurotic is ChatGPT,pp\. 3–9\.Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p3.1)\.
- \[17\]J\. Qin, W\. Liu, W\. Zhang, and Y\. Yu\(2025\)D2K: turning historical data into retrievable knowledge for recommender systems\.InProceedings of the ACM on Web Conference 2025,WWW ’25,New York, NY, USA,pp\. 472–482\.External Links:ISBN 9798400712746,[Link](https://doi.org/10.1145/3696410.3714664),[Document](https://dx.doi.org/10.1145/3696410.3714664)Cited by:[§IV\-C](https://arxiv.org/html/2607.00010#S4.SS3.p1.1)\.
- \[18\]K\. Ramnath, K\. Zhou, S\. Guan, S\. S\. Mishra, X\. Qi, Z\. Shen, S\. Wang, S\. Woo, S\. Jeoung, Y\. Wang,et al\.\(2025\)A systematic survey of automatic prompt optimization techniques\.arXiv preprint arXiv:2502\.16923\.Cited by:[§II\-B](https://arxiv.org/html/2607.00010#S2.SS2.p1.1)\.
- \[19\]I\. Sekulić, M\. Alinannejadi, and F\. Crestani\(2024\)Analysing utterances in LLM\-based user simulation for conversational search\.ACM Transactions on Intelligent Systems and Technology15,pp\. 1–22\.External Links:[Document](https://dx.doi.org/10.1145/3650041)Cited by:[§II\-D](https://arxiv.org/html/2607.00010#S2.SS4.p1.1)\.
- \[20\]W\. Sun, S\. Zhang, K\. Balog, Z\. Ren, P\. Ren, Z\. Chen, and M\. De Rijke\(2021\)Simulating user satisfaction for the evaluation of task\-oriented dialogue systems\.InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval,External Links:[Document](https://dx.doi.org/10.1145/3404835.3463241)Cited by:[§IV](https://arxiv.org/html/2607.00010#S4.p1.1)\.
- \[21\]L\. Wang, J\. Zhang, H\. Yang, Z\. Chen, J\. Tang, Z\. Zhang, X\. Chen, Y\. Lin, H\. Sun, R\. Song, W\. Zhao, J\. Xu, Z\. Dou, J\. Wang, and J\. Wen\(2024\)User behavior simulation with large language model\-based agents for recommender systems\.ACM Transactions on Information Systems\.External Links:[Document](https://dx.doi.org/10.1145/3708985)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p3.1),[§II\-A](https://arxiv.org/html/2607.00010#S2.SS1.p1.1)\.
- \[22\]Q\. Wang, J\. Wu, Z\. Tang, B\. Luo, N\. Chen, W\. Chen, and B\. He\(2025\)What limits llm\-based human simulation: llms or our design?\.CoRRabs/2501\.08579\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.08579),[Document](https://dx.doi.org/10.48550/ARXIV.2501.08579),2501\.08579Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p8.1)\.
- \[23\]Y\. Wang, Z\. Jiang, Z\. Chen, F\. Yang, Y\. Zhou, E\. Cho, X\. Fan, Y\. Lu, X\. Huang, and Y\. Yang\(2024\)RecMind: large language model powered agent for recommendation\.InFindings of the Association for Computational Linguistics: NAACL 2024,pp\. 4351–4364\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-naacl.271),[Link](https://aclanthology.org/2024.findings-naacl.271/)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1)\.
- \[24\]W\. Wei, J\. Tang, L\. Xia, Y\. Jiang, and C\. Huang\(2024\)PromptMM: multi\-modal knowledge distillation for recommendation with prompt\-tuning\.InProceedings of the ACM Web Conference 2024,WWW ’24,New York, NY, USA,pp\. 3217–3228\.External Links:ISBN 9798400701719,[Link](https://doi.org/10.1145/3589334.3645359),[Document](https://dx.doi.org/10.1145/3589334.3645359)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p9.1)\.
- \[25\]L\. Wu, Z\. Zheng, Z\. Qiu, H\. Wang, H\. Gu, T\. Shen, C\. Qin, C\. Zhu, H\. Zhu, Q\. Liu,et al\.\(2024\)A survey on large language models for recommendation\.World Wide Web27\(5\),pp\. 60\.Cited by:[§II\-A](https://arxiv.org/html/2607.00010#S2.SS1.p1.1)\.
- \[26\]Z\. Wu\(2024\)Optimizing e\-commerce recommender systems: a comprehensive review of techniques and future directions\.Applied and Computational Engineering\.External Links:[Link](https://api.semanticscholar.org/CorpusID:274337327)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1)\.
- \[27\]W\. Xiang, H\. Zhu, S\. Lou, X\. Chen, Z\. Pan, Y\. Jin, S\. Chen, and L\. Sun\(2024\)Simuser: generating usability feedback by simulating various users interacting with mobile applications\.InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems,pp\. 1–17\.Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p3.1)\.
- \[28\]S\. Yoon, Z\. He, J\. M\. Echterhoff, and J\. J\. McAuley\(2024\)Evaluating large language models as generative user simulators for conversational recommendation\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), NAACL 2024, Mexico City, Mexico, June 16\-21, 2024,Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p4.1),[§I](https://arxiv.org/html/2607.00010#S1.p6.1),[§I](https://arxiv.org/html/2607.00010#S1.p7.1),[§II\-A](https://arxiv.org/html/2607.00010#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2607.00010#S3.SS1.p1.1),[§III\-C](https://arxiv.org/html/2607.00010#S3.SS3.p1.1),[§IV\-D](https://arxiv.org/html/2607.00010#S4.SS4.p1.1),[§IV\-F1](https://arxiv.org/html/2607.00010#S4.SS6.SSS1.p4.1),[§IV](https://arxiv.org/html/2607.00010#S4.p1.1)\.
- \[29\]M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou\(2025\)Optimizing generative ai by backpropagating language model feedback\.Nature639\(8055\),pp\. 609–616\.Cited by:[§II\-B](https://arxiv.org/html/2607.00010#S2.SS2.p1.1),[§III](https://arxiv.org/html/2607.00010#S3.p1.1)\.
- \[30\]Q\. Zhan, X\. Guo, H\. Ji, and L\. Wu\(2023\-07\)User simulator assisted open\-ended conversational recommendation system\.InProceedings of the 5th Workshop on NLP for Conversational AI \(NLP4ConvAI 2023\),Y\. Chen and A\. Rastogi \(Eds\.\),Toronto, Canada,pp\. 89–101\.External Links:[Link](https://aclanthology.org/2023.nlp4convai-1.8/),[Document](https://dx.doi.org/10.18653/v1/2023.nlp4convai-1.8)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p9.1),[§II\-A](https://arxiv.org/html/2607.00010#S2.SS1.p1.1)\.
- \[31\]A\. Zhang, Y\. Chen, L\. Sheng, X\. Wang, and T\. S\. Chua\(2024\)On generative agents in recommendation\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1807–1817\.Cited by:[§II\-A](https://arxiv.org/html/2607.00010#S2.SS1.p1.1)\.
- \[32\]D\. Zhang, Y\. Geng, W\. Gong, Z\. Qi, Z\. Chen, X\. Tang, Y\. Shan, Y\. Dong, and J\. Tang\(2024\)RecDCL: dual contrastive learning for recommendation\.InProceedings of the ACM Web Conference 2024,WWW ’24,New York, NY, USA,pp\. 3655–3666\.External Links:ISBN 9798400701719,[Link](https://doi.org/10.1145/3589334.3645533),[Document](https://dx.doi.org/10.1145/3589334.3645533)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1)\.
- \[33\]E\. Zhang, X\. Wang, P\. Gong, Y\. Lin, and J\. Mao\(2024\)Usimagent: large language models for simulating search users\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2687–2692\.Cited by:[§II\-C](https://arxiv.org/html/2607.00010#S2.SS3.p1.1),[§IV](https://arxiv.org/html/2607.00010#S4.p1.1)\.
- \[34\]J\. Zhang, R\. Xie, Y\. Hou, W\. Zhao, L\. Lin, and J\. Wen\(2023\)Recommendation as instruction following: a large language model empowered recommendation approach\.ACM Transactions on Information Systems\.External Links:[Document](https://dx.doi.org/10.1145/3708882)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1)\.
- \[35\]S\. Zhang and K\. Balog\(2020\)Evaluating conversational recommender systems via user simulation\.InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,External Links:[Document](https://dx.doi.org/10.1145/3394486.3403202)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p1.1),[§I](https://arxiv.org/html/2607.00010#S1.p6.1),[§I](https://arxiv.org/html/2607.00010#S1.p8.1),[§III\-A](https://arxiv.org/html/2607.00010#S3.SS1.p1.1)\.
- \[36\]Z\. Zhang, S\. Liu, Z\. Liu, R\. Zhong, Q\. Cai, X\. Zhao, C\. Zhang, Q\. Liu, and P\. Jiang\(2025\)LLM\-powered user simulator for recommender system\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 13339–13347\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i12.33456)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p4.1),[§II\-A](https://arxiv.org/html/2607.00010#S2.SS1.p1.1),[§IV\-C](https://arxiv.org/html/2607.00010#S4.SS3.p1.1)\.
- \[37\]X\. Zhao, L\. Xia, L\. Zou, H\. Liu, D\. Yin, and J\. Tang\(2021\)UserSim: user simulation via supervised generative adversarial network\.InProceedings of the Web Conference 2021,External Links:[Document](https://dx.doi.org/10.1145/3442381.3450125)Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p4.1),[§I](https://arxiv.org/html/2607.00010#S1.p8.1),[§III\-A](https://arxiv.org/html/2607.00010#S3.SS1.p1.1)\.
- \[38\]L\. Zhu, X\. Huang, and J\. Sang\(2024\)How reliable is your simulator? analysis on the limitations of current llm\-based user simulators for conversational recommendation\.InCompanion Proceedings of the ACM Web Conference 2024,pp\. 1726–1732\.Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p4.1),[§I](https://arxiv.org/html/2607.00010#S1.p7.1),[§II\-C](https://arxiv.org/html/2607.00010#S2.SS3.p1.1),[§II\-D](https://arxiv.org/html/2607.00010#S2.SS4.p1.1),[§IV\-C](https://arxiv.org/html/2607.00010#S4.SS3.p1.1),[§IV](https://arxiv.org/html/2607.00010#S4.p1.1)\.
- \[39\]L\. Zhu, X\. Huang, and J\. Sang\(2025\)A LLM\-based controllable, scalable, human\-involved user simulator framework for conversational recommender systems\.InProceedings of the ACM on Web Conference 2025,pp\. 4653–4661\.Cited by:[§I](https://arxiv.org/html/2607.00010#S1.p4.1),[§I](https://arxiv.org/html/2607.00010#S1.p9.1),[§II\-C](https://arxiv.org/html/2607.00010#S2.SS3.p1.1),[§IV\-C](https://arxiv.org/html/2607.00010#S4.SS3.p1.1),[§IV](https://arxiv.org/html/2607.00010#S4.p1.1)\.
- \[40\]Y\. Zhuang, H\. Sun, Y\. Yu, R\. Qiang, Q\. Wang, C\. Zhang, and B\. Dai\(2024\)HYDRA: model factorization framework for black\-box LLM personalization\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=CKgNgKmHYp)Cited by:[§IV\-D](https://arxiv.org/html/2607.00010#S4.SS4.p1.1),[§IV](https://arxiv.org/html/2607.00010#S4.p1.1)\.Similar Articles
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
This paper introduces SalesSim, a framework and benchmark for evaluating Multimodal LLMs as retail user simulators, identifying gaps in persona alignment and proposing a new reinforcement learning method called UserGRPO.
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
Introduces ReElicit, a Bayesian optimization framework that uses LLMs to elicit and adapt feature spaces for optimizing system prompts under aggregate scalar feedback, achieving strong performance across ten benchmark tasks.
Environment-Grounded Automated Prompt Optimization for LLM Game Agents
Introduces an automated prompt optimization framework for LLM game agents that decomposes the observation-to-action pipeline into two agents and iteratively refines prompts via an evolutionary loop guided by environment returns. Evaluated on BabyAI tasks, it significantly improves success rates (e.g., from 0% to 72.5% on PutNext) without updating model weights.
Self-Supervised Prompt Optimization
This paper introduces Self-Supervised Prompt Optimization (SPO), a framework that optimizes prompts for LLMs without external references by using output comparisons, significantly reducing costs and data requirements.
Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games
This paper proposes Equation-to-Behavior Prompting and reinforcement learning to guide large language models to simulate diverse human decision-making patterns in persuasion games, showing improved belief accuracy and training outcomes.