Communication Policy Evolution for Proactive LLM Agents

arXiv cs.AI Papers

Summary

This paper formalizes communication policy for LLM agents and proposes Communication Policy Evolution (CPE), a self-evolution framework that refines communication policies through rollout and prompt-level evolving, achieving best task success across multiple settings.

arXiv:2606.14314v1 Announce Type: new Abstract: LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investigate how agents should communicate across modalities, this paper formalizes Communication Policy, establishes textual and UI-based policies, and then evaluates communication policies across diverse environments, personas, and model combinations. Building information asymmetry for proactive agents, we set up two complementary settings, User-Agent and Planner-Executor. Experimental results reveal complementary strengths between interaction channels: text-based interaction often facilitates task performance, while structured UI improves agents' response quality and persona compliance. Motivated by that, a hybrid method combines these advantages. We further propose Communication Policy Evolution (CPE), a self-evolution framework for refining communication policies through rollout and prompt-level evolving. Without model modification, CPE achieves the best task success across multiple settings using prompt refinement alone. Our findings identify communication behavior as a critical yet underexplored design dimension for LLM agents.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:11 AM

# Communication Policy Evolution for Proactive LLM Agents
Source: [https://arxiv.org/html/2606.14314](https://arxiv.org/html/2606.14314)
Xinbei Ma1, Jiyang Qiu1,∗, Yao Yao1, Zheng Wu1, Yijie Lu1, Xiangmou Qu2, Jiaxin Yin2, Xingyu Lou2,†, Jun Wang2,†, Weiwen Liu1, Weinan Zhang1, Zhuosheng Zhang1,†, Hai Zhao1 1Shanghai Jiao Tong University,2OPPO Research Institute \{sjtumaxb, qiujiyang, zhangzs\}@sjtu\.edu\.cn,zhaohai@cs\.sjtu\.edu\.cn louxingyu@oppo\.com,junwang\.lu@gmail\.com

###### Abstract

LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users’ identical preferences further limit information exchange\. To investigate*how*agents should communicate across modalities, this paper formalizes*Communication Policy*, establishes textual and UI\-based policies, and then evaluates communication policies across diverse environments, personas, and model combinations\. Building information asymmetry for proactive agents, we set up two complementary settings,*User–Agent*and*Planner–Executor*\. Experimental results reveal complementary strengths between interaction channels: text\-based interaction often facilitates task performance, while structured UI improves agents’ response quality and persona compliance\. Motivated by that, a hybrid method combines these advantages\. we further propose*Communication Policy Evolution*\(CPE\), a self\-evolution framework for refining communication policies through rollout and prompt\-level evolving\. Without model modification, CPE achieves the best task success across multiple settings using prompt refinement alone\. Our findings identify communication behavior as a critical yet underexplored design dimension for LLM agents\.

Communication Policy Evolution for Proactive LLM Agents

Xinbei Ma1††thanks:Equal contribution\. Work done during internship at OPPO\., Jiyang Qiu1,∗, Yao Yao1, Zheng Wu1, Yijie Lu1,Xiangmou Qu2, Jiaxin Yin2, Xingyu Lou2,†, Jun Wang2,†,Weiwen Liu1, Weinan Zhang1, Zhuosheng Zhang1,†, Hai Zhao1††thanks:Corresponding authors\.1Shanghai Jiao Tong University,2OPPO Research Institute\{sjtumaxb, qiujiyang, zhangzs\}@sjtu\.edu\.cn,zhaohai@cs\.sjtu\.edu\.cnlouxingyu@oppo\.com,junwang\.lu@gmail\.com,

## 1Introduction

Large language model \(LLM\) agents have rapidly evolved into autonomous systems capable of reasoning, tool use, and extended interaction with users and environments\(Yaoet al\.,[2022](https://arxiv.org/html/2606.14314#bib.bib64); Weiet al\.,[2022](https://arxiv.org/html/2606.14314#bib.bib65); Patilet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib66); Qinet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib67); Wanget al\.,[2025b](https://arxiv.org/html/2606.14314#bib.bib20)\)\. Yet despite these advances, a fundamental bottleneck remains: users often hold the complete task they want to achieve, but natural language cannot fully convey all constraints, preferences, and edge cases at once\. Important information emerges only gradually through interaction\. Therefore, agents always have access topartial information, where successful task completion depends not only on reasoning and execution, but also on the agent’s ability to recover missing information through interaction\(Fang and Ke,[2025](https://arxiv.org/html/2606.14314#bib.bib76)\)\. The challenge is further compounded by differences in user expertise, patience, and communication style, all of which directly shape interaction quality\(Bhattacharjeeet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib74); Wu and Osawa,[2024](https://arxiv.org/html/2606.14314#bib.bib75)\)\.

Proactive agents actively acquire missing information through clarification dialogues\(Denget al\.,[2023](https://arxiv.org/html/2606.14314#bib.bib30); Liaoet al\.,[2023](https://arxiv.org/html/2606.14314#bib.bib31)\), proactive planning\(Zhanget al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib7)\), and sequential decision\-making under uncertainty\(Suriet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib10); Huanget al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib9)\)\. However, this line of work focuses primarily on*what*the agent should ask, while largely overlooking an equally important question:*how should the agent communicate?*In practice, information recovery is fundamentally shaped by the communication channel itself\(Sachdevaet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib80)\)\. Free\-form natural language enables flexible and open\-ended interaction, while structured interfaces constrain user responses into organized and lower\-ambiguity formats\. Choosing between these interaction channels therefore becomes a critical decision for effective task\-state alignment\.

Generative UI uses LLMs to produce HTML and associated code, which is then rendered as multimodal interfaces for users to view and interact with\. Recent work shows that structured interfaces can substantially improve information collection quality\(Chenet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib62)\): compared to unconstrained free\-text interaction, structured forms shift users from recall to recognition, improving precision through input constraints and visual organization\(Weiet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib44); Anbalaganet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib78)\)\. Moreover, HTML interfaces can now be generated fluently by LLMs themselves\(Caoet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib43); Nandyet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib83)\), and Agent–Computer Interface research further highlights the value of structured interaction surfaces for LLM agents\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib84)\)\. However, existing work primarily aims to improve Generative UI itself rather than deploy it in real\-world applications\.

In this paper, we formalize the channel\-selection decision asCommunication Policy\. We equip agents with two communication primitives:ask\_questionfor free\-form language interaction andgenerate\_uifor HTML\-based forms\. This defines two single\-channel settings,MtextM\_\{\\mathrm\{text\}\}andMuiM\_\{\\mathrm\{ui\}\}, and a hybrid setting,MhybridM\_\{\\mathrm\{hybrid\}\}, where the agent dynamically selects between the two channels\. Across four environments and diverse user personas, we find that text and UI have complementary strengths: text drives task completion, while UI improves response quality and persona compliance\. Hybrid access achieves the best overall results in most settings, though the optimal strategy depends on task structure and persona\. To optimize communication policy automatically, we proposeCommunicationPolicyEvolution \(CPE\)\. In each round, CPE evaluates the current policy on a training batch, prompts an LLM to analyze the rollout results and propose targeted edits, and accepts or rejects the candidate via a two\-stage gate that guarantees monotonic improvement on held\-out data\. The optimized policies achieve best task completion across all evaluated configurations, using only prompt\-level refinement without modifying model weights\.

Our contributions are three\-fold:

1. 1\.We identify channel selection as a fundamental problem in LLM agent interaction and formalize Communication Policy for hybrid text\+UI communication\.
2. 2\.We systematically evaluate hybrid communication under partial information, showing that text and UI interaction exhibit complementary strengths across tasks and users\.
3. 3\.We propose CPE, a self\-evolution framework that discovers effective communication policies through iterative rollout analysis, achieving best productivity via prompt optimization\.

## 2Related Work

#### LLM Agents’ Proactivity

Proactive LLM agents actively seek missing information rather than passively following incomplete instructions\. Work in this area spans prompting LLMs for clarification dialogues\(Denget al\.,[2023](https://arxiv.org/html/2606.14314#bib.bib30)\), structured clarification in dialogue systems\(Sahayet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib34); Siroet al\.,[2026](https://arxiv.org/html/2606.14314#bib.bib13)\), proactive planning where agents ask before acting\(Zhanget al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib7)\), anticipating user needs from environmental events\(Luet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib8)\), sequential decision\-making under uncertainty\(Suriet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib10); Huanget al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib9)\), and open\-ended proactive assistance\(Abbaset al\.,[2026](https://arxiv.org/html/2606.14314#bib.bib36)\)\. Despite this diversity, these efforts focus on*what*to ask: the communication channel itself,*how*the agent asks, remains unexamined\.

#### User\-centric Agent

Designs of user\-centric agents focus on how agents adapt to individual user preferences instead of interacting with all users in the same way\. Persona\-conditioned user simulators embed diverse interaction styles into LLM\-based user proxies, enabling controlled evaluation without costly human studies\(Douet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib57); Gromadaet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib15); Samuelet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib16); Wanget al\.,[2025a](https://arxiv.org/html/2606.14314#bib.bib14)\), and power benchmarks probing personalization and human–agent conversational gaps\(Haoet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib52); Wanget al\.,[2025b](https://arxiv.org/html/2606.14314#bib.bib20)\)\. Personalization techniques leverage explicit user profiles, latent preference models, or curiosity\-driven rewards to tailor responses\(Liet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib54); Qiuet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib55); Shiet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib56); Wanet al\.,[2026](https://arxiv.org/html/2606.14314#bib.bib59)\)\. Building on this line of work, we introduce the communication channel itself as a design variable, investigating whether structured UI\-based interaction can improve an agent’s ability to align with user personas\.

![Refer to caption](https://arxiv.org/html/2606.14314v1/x1.png)Figure 1:Overview of our communication policy formulation and evolution\.\(a\)A full task specificationzzcontains information dimensions with different sensitivity costs, while the agent observes only a vague versionz~=vague​\(z\)\\tilde\{z\}=\\text\{vague\}\(z\)\.\(b\)We study two settings: User–Agent interaction and Planner–Executor interaction, where the agent/executor interacts with both a simulator and the environment\.\(c\)We compare text\-onlyMuiM\_\{\\text\{ui\}\}, UI\-onlyMuiM\_\{\\text\{ui\}\}, and hybrid communication modesMhybridM\_\{\\text\{hybrid\}\}, selecting channels viaπcomm\\pi\_\{\\text\{comm\}\}at each turn\. The simulator records the disclosed Cost level and optional persona\-alignment Reward\.
#### Test\-time Evolving

Recent test\-time optimization methods enable LLM agents to improve behavior through interaction experience without parameter updates\. Existing approaches include black\-box prompt optimization\(Yanget al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib21); Wanget al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib22)\), textual gradient descent\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib60)\), trace\-driven reflection with gating\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib23); Yiet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib24)\), and multi\-agent optimization\(Zhanget al\.,[2026](https://arxiv.org/html/2606.14314#bib.bib26)\)\. Related self\-reflective agents update memory from failure traces\(Shinnet al\.,[2023](https://arxiv.org/html/2606.14314#bib.bib28); Zhaoet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib61)\), although their effectiveness often depends on the quality of verification signals\(Huanget al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib29)\)\. While these methods optimize prompts to improve agents’ task execution capabilities, our work instead focuses on optimizing communication behavior, training agents to decide when to communicate through natural language and when to rely on structured UI interactions in order to recover user intent under partial information\.

## 3Communication Policy Evaluation

### 3\.1Problem Formulation

We consider an interaction setting with three components: a user, an agent, and an environment\. An achievable task is fully specified by a*task specification*zz, which contains all information necessary for successful execution and is held by the proposer, e\.g\., the user\. Due to information loss during communication, the agent only observes a*vague specification*z~\\tilde\{z\}, in which part of the task\-relevant information inzzis missing or underspecified\.

z~=vague​\(z\),where​\|z~\|≪\|z\|\.\\tilde\{z\}=\\text\{vague\}\(z\),\\quad\\text\{where \}\|\\tilde\{z\}\|\\ll\|z\|\.\(1\)
The interaction proceeds in stepst=1,2,…t=1,2,\\ldots\. At steptt, the agent holds a beliefbtb\_\{t\}aboutzz, initialized fromz~\\tilde\{z\}, and selects a communication actionata\_\{t\}from the action space determined by the interaction mode \(§[3\.2](https://arxiv.org/html/2606.14314#S3.SS2)\)\. The user seesata\_\{t\}together with the interaction historyh<t=\(a1,d1,…,at−1,dt−1\)h\_\{<t\}=\(a\_\{1\},d\_\{1\},\\ldots,a\_\{t\-1\},d\_\{t\-1\}\), and produces a disclosure:

dt∼ℳ​\(z,at,h<t\),d\_\{t\}\\sim\\mathcal\{M\}\(z,a\_\{t\},h\_\{<t\}\),\(2\)whereℳ\\mathcal\{M\}is the user’s response policy\. The agent updates its belief:

bt\+1=Update​\(bt,dt\)\.b\_\{t\+1\}=\\text\{Update\}\(b\_\{t\},d\_\{t\}\)\.\(3\)
The interaction ends when the agent issues a task\-execution action\. The agent’s objective is to choose communication actions that maximize task success while keeping the disclosure cost in check:

max\{at\}t=1T⁡𝒯​\(oT;z\)s\.t\.∑t=1Tct≤Cmax,\\max\_\{\\\{a\_\{t\}\\\}\_\{t=1\}^\{T\}\}\\;\\mathcal\{T\}\(o\_\{T\};z\)\\quad\\text\{s\.t\.\}\\quad\\sum\_\{t=1\}^\{T\}c\_\{t\}\\leq C\_\{\\max\},\(4\)whereTTis the terminal step,oTo\_\{T\}is the output produced from the final beliefbTb\_\{T\},𝒯​\(⋅;z\)\\mathcal\{T\}\(\\cdot;z\)is an environment\-specific success function \(e\.g\., patch correctness, terminal reward, database\-state match\), andctc\_\{t\}is the sensitivity cost of disclosuredtd\_\{t\}\. Effective communication in this setting requires agents to complete tasks successfully \(*productivity*\), proactively recover missing information \(*proactivity*\), and adapt to user\-specific interaction preferences \(*personalization*\)Sunet al\.\([2025](https://arxiv.org/html/2606.14314#bib.bib92)\)\.

### 3\.2Communication Modes

We first introduce two communication primitives beyond the task actions𝒜env\\mathcal\{A\}\_\{\\text\{env\}\}\. The primitiveask\_questionsends a natural\-language query and receives a free\-text response, whilegenerate\_uirenders an HTML form as a screenshot for the user to complete\. Based on these primitives, we define two single\-channel modes and one hybrid mode:

𝒜text\\displaystyle\\mathcal\{A\}\_\{\\text\{text\}\}=𝒜env∪\{ask\_question\},\\displaystyle=\\mathcal\{A\}\_\{\\text\{env\}\}\\cup\\\{\\texttt\{ask\\\_question\}\\\},\(5\)𝒜ui\\displaystyle\\mathcal\{A\}\_\{\\text\{ui\}\}=𝒜env∪\{generate\_ui\},\\displaystyle=\\mathcal\{A\}\_\{\\text\{env\}\}\\cup\\\{\\texttt\{generate\\\_ui\}\\\},𝒜hybrid\\displaystyle\\mathcal\{A\}\_\{\\text\{hybrid\}\}=𝒜env∪\{ask\_question,generate\_ui\}\.\\displaystyle=\\mathcal\{A\}\_\{\\text\{env\}\}\\cup\\\{\\texttt\{ask\\\_question\},\\texttt\{generate\\\_ui\}\\\}\.
We denote the resulting modes asMtextM\_\{\\text\{text\}\},MuiM\_\{\\text\{ui\}\}, andMhybridM\_\{\\text\{hybrid\}\}, respectively\.MtextM\_\{\\text\{text\}\}andMuiM\_\{\\text\{ui\}\}are single\-channel modes: the former supports flexible free\-text interaction, while the latter supports structured information collection through fixed fields\.MhybridM\_\{\\text\{hybrid\}\}further combines both primitives, allowing the agent to choose between text and UI communication at each turn\. We formalize this channel\-selection mechanism as the*Communication Policy*\. Figure[1](https://arxiv.org/html/2606.14314#S2.F1)summarizes the three modes and the interaction loop\.

### 3\.3Two Interaction Settings

We design two settings that isolate different sources of the information gap:

Setting A: User–Agent\.This is our primary setting, designed to approximate realistic user\-facing agent interaction\. The user is simulated by aUser Simulator, an LLM parameterized by a persona \(§[D](https://arxiv.org/html/2606.14314#A4)\)\. The user holds the full task specificationzz, while the agent receives only the vague specificationz~\\tilde\{z\}and must recover missing information through communication\. The persona captures user\-specific interaction preferences and shapes disclosure behavior, making responses potentially uncertain, incomplete, indirect, or evolving over time\. Accordingly, this setting evaluates task completion, proactive recovery of missing information, and adaptation to user preferences\. The fullUser Simulatorsystem prompt is detailed in §[B](https://arxiv.org/html/2606.14314#A2)\.

Setting B: Planner–Executor\.This auxiliary setting abstracts away subjective user preferences and focuses on agent–agent collaboration\. ThePlanner Simulatorholds the full task specificationzzand provides global planning guidance, while the executor receives onlyz~\\tilde\{z\}and must query the planner to recover missing execution details\. Since both roles are agents, no persona is introduced; the planner discloses information objectively under the cost schema\. Thus, this setting contains an information gap between planner and executor, but removes personalization\-related factors and evaluates only collaborative task completion\. The fullPlanner Simulatorsystem prompt is detailed in §[C](https://arxiv.org/html/2606.14314#A3)\.

## 4Communication Policy Evolution

### 4\.1The Communication Policy

WhileMhybridM\_\{\\text\{hybrid\}\}provides access to both communication channels and shows strong potential, it does not specify how the channels should be used\. To further exploit the hybrid setting, we proposeCommunication Policy Evolution\(CPE\), a training\-free prompt evolving method, which optimizes this policy to guide when the agent should use text or UI interaction\.

πcomm=\(system\_prompt,examples,appendixes\)\.\\pi\_\{\\text\{comm\}\}=\(\\text\{system\\\_prompt\},\\;\\text\{examples\},\\;\\text\{appendixes\}\)\.
The system prompt specifies the available tools, their invocation formats, and heuristics for channel selection\. The examples provide few\-shot demonstrations of choosing betweenask\_questionandgenerate\_uiunder varied conditions\. The appendixes carry environment\-specific guidance\. The policy is rendered to the agent at the start of each episode; optimizing it reduces to rewriting text informed by rollout results\. Example prompts for all four benchmarks are provided in Appendix[F](https://arxiv.org/html/2606.14314#A6)\.

![Refer to caption](https://arxiv.org/html/2606.14314v1/x2.png)Figure 2:Communication Policy Evolution \(CPE\)\. Each round: \(1\)Evaluatethe current policyπ\\pion a batchℬr\\mathcal\{B\}\_\{r\}; \(2\)Evolve: the LLM analyzes hybrid\-only signals \(scores, trajectories, task specs, current policy, patch history\) and proposes a JSON patchΔ​π\\Delta\\pi; \(3\)Mutateby applyingΔ​π\\Delta\\pias text overrides to produce candidateπ′\\pi^\{\\prime\}; \(4\)Selectvia two\-stage gating: candidate must beatπ\\pionℬr\\mathcal\{B\}\_\{r\}\(train accept\), then beatπ∗\\pi^\{\*\}on𝒟val\\mathcal\{D\}\_\{\\text\{val\}\}\(val accept\)\. Val\-accepted candidates updateπ∗\\pi^\{\*\}, guaranteeing monotonic improvement\.
### 4\.2Policy Optimization for Communication Policy Evolution

CPE \(Figure[2](https://arxiv.org/html/2606.14314#S4.F2)\) evolves the communication policyπcomm\\pi\_\{\\text\{comm\}\}by iterative refinement against rollout results\. Given a task distribution𝒟\\mathcal\{D\}, an initial communication policyπ0\\pi\_\{0\}, and a total round budgetRR, CPE performs prompt\-level optimization overπcomm\\pi\_\{\\text\{comm\}\}\. It rewrites only the communication\-policy prompt, while leaving the underlying agent and user\-simulator model weights unchanged\.

Objective and regret\.For a task\-execution agent, completing the task remains the primary goal\. We therefore optimize for productivity, defining the CPE objective as the mean productivity score over the task distribution:

J​\(πcomm;𝒟\)=1\|𝒟\|​∑i∈𝒟sprod\(i\)\.J\(\\pi\_\{\\text\{comm\}\};\\mathcal\{D\}\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\}s\_\{\\text\{prod\}\}^\{\(i\)\}\.\(6\)
Proactivity and personalization are reported throughout to make the trade\-off visible: when communication policy is optimized for productivity, both interaction scores may rise \(better channel choices elicit higher\-quality responses\) or dip \(the policy trades interaction turns for execution speed\)\. We treat this tension as a descriptive finding rather than a multi\-objective optimization problem\.

Giving the agent both channels does not tell it how to use them\. When hybrid underperforms a single\-channel baseline, the agent is worse off than if it had been restricted to one channel\. We define theinstance\-level regretto quantify this gap:

r\(i\)=max⁡\(stext\(i\),sui\(i\)\)−shybrid\(i\),r^\{\(i\)\}=\\max\\bigl\(s\_\{\\text\{text\}\}^\{\(i\)\},\\,s\_\{\\text\{ui\}\}^\{\(i\)\}\\bigr\)\-s\_\{\\text\{hybrid\}\}^\{\(i\)\},\(7\)wherer\(i\)\>0r^\{\(i\)\}\>0means hybrid underperformed a single\-channel baseline, and minimizing regret is equivalent to maximizingJ​\(πcomm;𝒟\)J\(\\pi\_\{\\text\{comm\}\};\\mathcal\{D\}\)\.

Iterative refinement\.Each round of CPE proceeds in four steps:

1. 1\.Evaluate\.The current policyπcomm\\pi\_\{\\text\{comm\}\}is rolled out on a batchℬ⊆𝒟train\\mathcal\{B\}\\subseteq\\mathcal\{D\}\_\{\\text\{train\}\}ofKKepisodes, producing per\-episode scores and interaction trajectoriesτ\(i\)=\{\(at,ot,ct\)\}t=1Ti\\tau^\{\(i\)\}=\\\{\(a\_\{t\},o\_\{t\},c\_\{t\}\)\\\}\_\{t=1\}^\{T\_\{i\}\}, whereat∈\{ask\_question,generate\_ui\}a\_\{t\}\\in\\\{\\texttt\{ask\\\_question\},\\texttt\{generate\\\_ui\}\\\}is the agent’s channel choice at turntt,oto\_\{t\}is the user’s response, andctc\_\{t\}is the Cost incurred \(Eq\.[4](https://arxiv.org/html/2606.14314#S3.E4)\)\.
2. 2\.Evolve\.The same LLM is prompted to self\-evolve its communication policy: it analyzes the batch evaluation results and proposes targeted edits\. The signal provided to the LLM is detailed in §[4\.3](https://arxiv.org/html/2606.14314#S4.SS3)\. From these signals, the LLM produces a structured JSON patch: Δπcomm=Evolve\(\\displaystyle\\Delta\\pi\_\{\\text\{comm\}\}=\\text\{Evolve\}\\bigl\(πcomm,\\displaystyle\\pi\_\{\\text\{comm\}\},\(8\)\{sprod\(i\),spro\(i\),spers\(i\)\}i∈ℬ,\\displaystyle\\\{s\_\{\\text\{prod\}\}^\{\(i\)\},s\_\{\\text\{pro\}\}^\{\(i\)\},s\_\{\\text\{pers\}\}^\{\(i\)\}\\\}\_\{i\\in\\mathcal\{B\}\},\{τ\(i\),z\(i\)\}i∈ℬ,ℋ\),\\displaystyle\\\{\\tau^\{\(i\)\},z^\{\(i\)\}\\\}\_\{i\\in\\mathcal\{B\}\},\\;\\mathcal\{H\}\\bigr\),whereΔ​πcomm\\Delta\\pi\_\{\\text\{comm\}\}is a JSON patch applied over the current policy text, overwriting or appending to its components\.
3. 3\.Mutate\.The patch is applied to produce a candidate policyπcomm′\\pi\_\{\\text\{comm\}\}^\{\\prime\}\.
4. 4\.Select\.πcomm′\\pi\_\{\\text\{comm\}\}^\{\\prime\}is evaluated on the same batchℬ\\mathcal\{B\}\. The candidate is accepted if: J​\(πcomm′;ℬ\)\>J​\(πcomm;ℬ\)\+ϵ\.J\(\\pi\_\{\\text\{comm\}\}^\{\\prime\};\\mathcal\{B\}\)\>J\(\\pi\_\{\\text\{comm\}\};\\mathcal\{B\}\)\+\\epsilon\.\(9\)If rejected, the overrides are rolled back to the pre\-round state and the policy remainsπcomm\\pi\_\{\\text\{comm\}\}\.

Validation gating and monotonicity\.To prevent overfitting to individual batches,𝒟\\mathcal\{D\}is partitioned into𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and𝒟val\\mathcal\{D\}\_\{\\text\{val\}\}, and a best\-so\-far policyπcomm∗\\pi\_\{\\text\{comm\}\}^\{\*\}with scoreJ∗J^\{\*\}is maintained across rounds\. Batchesℬr⊆𝒟train\\mathcal\{B\}\_\{r\}\\subseteq\\mathcal\{D\}\_\{\\text\{train\}\}of sizeKKare drawn by epoch\-style sampling without replacement: the pool is shuffled, contiguous blocks are drawn sequentially, and the pool is reshuffled upon exhaustion\. When a candidate passes the train accept condition \(Eq\.[9](https://arxiv.org/html/2606.14314#S4.E9)\), it is evaluated on the full𝒟val\\mathcal\{D\}\_\{\\text\{val\}\}:

J​\(πcomm′;𝒟val\)\>J∗\+ϵ\.J\(\\pi\_\{\\text\{comm\}\}^\{\\prime\};\\mathcal\{D\}\_\{\\text\{val\}\}\)\>J^\{\*\}\+\\epsilon\.\(10\)
Only when this holds isπcomm∗\\pi\_\{\\text\{comm\}\}^\{\*\}replaced byπcomm′\\pi\_\{\\text\{comm\}\}^\{\\prime\}andJ∗J^\{\*\}updated\. By construction,J∗J^\{\*\}ismonotonically non\-decreasingover rounds: the best policy on held\-out data never degrades\. The working policyπcomm\\pi\_\{\\text\{comm\}\}may fluctuate as it tracks batch\-level variation, butπcomm∗\\pi\_\{\\text\{comm\}\}^\{\*\}records the strongest generalizing policy encountered\. Figure[2](https://arxiv.org/html/2606.14314#S4.F2)illustrates the procedure; the formal algorithm is provided in algorithm[1](https://arxiv.org/html/2606.14314#algorithm1)\.

Input:Task distribution

𝒟\\mathcal\{D\}, initial policy

π0\\pi\_\{0\}, rounds

RR, batch size

KK, tolerance

ϵ\\epsilon
Output:Best policy

π∗\\pi^\{\*\}
1

π←π0\\pi\\leftarrow\\pi\_\{0\};

π∗←π0\\pi^\{\*\}\\leftarrow\\pi\_\{0\};

J∗←J​\(π0;𝒟val\)J^\{\*\}\\leftarrow J\(\\pi\_\{0\};\\mathcal\{D\}\_\{\\text\{val\}\}\);

ℋ←∅\\mathcal\{H\}\\leftarrow\\emptyset;

2for*r=1r=1toRR*do

3

ℬr←SampleEpoch​\(𝒟train,K\)\\mathcal\{B\}\_\{r\}\\leftarrow\\text\{SampleEpoch\}\(\\mathcal\{D\}\_\{\\text\{train\}\},K\);

Jpre←J​\(π;ℬr\)J\_\{\\text\{pre\}\}\\leftarrow J\(\\pi;\\mathcal\{B\}\_\{r\}\);

//1\. Evaluate

Δ​π←Evolve​\(π,scores​\(ℬr\),trajectories​\(ℬr\),ℋ\)\\Delta\\pi\\leftarrow\\text\{Evolve\}\(\\pi,\\text\{scores\}\(\\mathcal\{B\}\_\{r\}\),\\text\{trajectories\}\(\\mathcal\{B\}\_\{r\}\),\\mathcal\{H\}\);

//2\. Evolve

π′←Apply​\(π,Δ​π\)\\pi^\{\\prime\}\\leftarrow\\text\{Apply\}\(\\pi,\\Delta\\pi\);

//3\. Mutate

Jpost←J​\(π′;ℬr\)J\_\{\\text\{post\}\}\\leftarrow J\(\\pi^\{\\prime\};\\mathcal\{B\}\_\{r\}\);

//4\. Select

4if*J*post*\>J*pre*\+ϵJ\_\{\\text\{post\}\}\>J\_\{\\text\{pre\}\}\+\\epsilon*then

5

π←π′\\pi\\leftarrow\\pi^\{\\prime\};

6

Jval←J​\(π′;𝒟val\)J\_\{\\text\{val\}\}\\leftarrow J\(\\pi^\{\\prime\};\\mathcal\{D\}\_\{\\text\{val\}\}\);

7

ℋ←ℋ∪\{\(Δ​π,Jval,Jval\>J∗\+ϵ\)\}\\mathcal\{H\}\\leftarrow\\mathcal\{H\}\\cup\\\{\(\\Delta\\pi,J\_\{\\text\{val\}\},J\_\{\\text\{val\}\}\>J^\{\*\}\+\\epsilon\)\\\};

8if*J*val*\>J∗\+ϵJ\_\{\\text\{val\}\}\>J^\{\*\}\+\\epsilon*then

9

π∗←π′\\pi^\{\*\}\\leftarrow\\pi^\{\\prime\},

J∗←JvalJ^\{\*\}\\leftarrow J\_\{\\text\{val\}\};

10

11end if

12

13end if

14

15end for

16return*π∗\\pi^\{\*\}*;

Algorithm 1Communication Policy Evolution \(CPE\)
### 4\.3Evolution Signals

The reflect LLM receives five categories of signals, all drawn fromMhybridM\_\{\\text\{hybrid\}\}rollouts:

- •Scores\.For eachi∈ℬi\\in\\mathcal\{B\}, all three per\-dimension scores\{sprod\(i\),spro\(i\),spers\(i\)\}\\\{s\_\{\\text\{prod\}\}^\{\(i\)\},s\_\{\\text\{pro\}\}^\{\(i\)\},s\_\{\\text\{pers\}\}^\{\(i\)\}\\\}underMhybridM\_\{\\text\{hybrid\}\}, withsprod\(i\)s\_\{\\text\{prod\}\}^\{\(i\)\}as the optimization target\.
- •Trajectories\.The full rollouts\{τ\(i\)\}\\\{\\tau^\{\(i\)\}\\\}: each agent communication action tagged by channel type \(ask\_questionorgenerate\_ui\), its content, and the user’s response including Cost annotations\.
- •Task specifications\.The system messages sent to the user, containing the ground\-truthzz\.
- •Current policy\.The complete text ofπcomm\\pi\_\{\\text\{comm\}\}, incorporating all edits accumulated from prior accepted rounds\.
- •Patch history\.The most recent accepted and rejected patches from prior rounds, each annotated with its validation outcome\.

## 5Experimental Setup

BenchmarkAgentUserMtextM\_\{\\text\{text\}\}MuiM\_\{\\text\{ui\}\}MhybridM\_\{\\text\{hybrid\}\}SWE\-benchQwen3\-32BQwen3\-VL\-32B\.035/\.250/\.355\.009/\.161/\.421\.040/\.134/\.454Seed\-OSS\-36BQwen3\-VL\-32B\.135/\.076/\.545\.120/\.156/\.502\.139/\.085/\.520GPT\-5\-miniQwen3\-VL\-32B\.031/\.129/\.472\.030/\.179/\.589\.043/\.138/\.667DeepSeek\-V3\.2Qwen3\-VL\-32B\.135/\.156/\.629\.113/\.196/\.705\.140/\.089/\.594TravelGymDeepSeek\-V3\.2GPT\-4o1\.437/\.045/\.9051\.249/\.260/\.9521\.233/\.145/\.856GPT\-5\-miniGPT\-4o1\.055/\.250/\.695\.998/\.470/\.8901\.055/\.250/\.710DeepSeek\-V3\.2Qwen3\-VL\-32B1\.365/\.220/\.8951\.416/\.135/\.9501\.514/\.120/\.915GPT\-5\-miniQwen3\-VL\-32B1\.013/\.345/\.850\.995/\.285/\.865\.906/\.195/\.766τ2\\tau^\{2\}\-benchDeepSeek\-V3\.2GPT\-4o\.270/\.230/\.650\.275/\.120/\.850\.270/\.015/\.610DeepSeek\-V3\.2Qwen3\-VL\-32B\.280/\.020/\.485\.235/\.035/\.555\.300/\.035/\.495GPT\-5\-miniQwen3\-VL\-32B\.375/\.015/\.347\.375/\.105/\.593\.340/\.135/\.470GPT\-5\-miniGPT\-4o\.425/\.045/\.643\.385/\.065/\.827\.405/\.075/\.563WebArenaDeepSeek\-V3\.2GPT\-4o\.205/\.628/\.865\.180/\.810/\.673\.195/\.695/\.766GPT\-5\-miniGPT\-4o\.200/\.525/\.737\.140/\.668/\.803\.180/\.315/\.752

Table 1:Mode comparison ofMtextM\_\{\\text\{text\}\},MuiM\_\{\\text\{ui\}\}, andMhybridM\_\{\\text\{hybrid\}\}across four benchmarks in the User–Agent setting\. Each cell reports productivity / proactivity / personalization, averaged over four personas\. Best value per metric per row is color\-coded:productivity,proactivity,personalization\.### 5\.1Environments

We evaluate across four environments spanning diverse domains and interaction settings: SWE\-bench \(software repair\)\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib84)\), TravelGym \(travel planning\)\(Qianet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib85)\),τ2\\tau^\{2\}\-bench \(customer service\)\(Barreset al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib86)\), and WebArena \(web navigation\)\(Zhouet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib68)\)\. Together, these environments cover code editing, preference elicitation, conversational task completion, and browser\-based interaction under partial information\. Detailed environment specifications, task formulations, and the exact mapping fromzztoz~\\tilde\{z\}are provided in Appendix[E](https://arxiv.org/html/2606.14314#A5)\(Table[6](https://arxiv.org/html/2606.14314#A5.T6)\)\.

### 5\.2User and Personas

In Setting A \(User–Agent\), the user is simulated by an LLM that holds the full task specificationzz\(per\-benchmark schemas in Appendix[E](https://arxiv.org/html/2606.14314#A5)and[A](https://arxiv.org/html/2606.14314#A1)\), is assigned one of four personas \(Appendix[D](https://arxiv.org/html/2606.14314#A4)\), and replies to agent queries with Cost annotations following the disclosure rule of Eq\.[2](https://arxiv.org/html/2606.14314#S3.E2)\. Each disclosuredtd\_\{t\}is annotated with a Cost label indicating the highest sensitivity level revealed; the user follows a minimal\-disclosure principle, revealing only the lowest\-cost information sufficient to answer the query and refusing when no matching information is available\. In Setting B \(Planner–Executor, §[3\.3](https://arxiv.org/html/2606.14314#S3.SS3)\), the user is replaced by a planner LLM holding the same information\. The planner has no persona\.

### 5\.3Interaction Modes and Models

We evaluate all three communication modes defined in §[3\.2](https://arxiv.org/html/2606.14314#S3.SS2):MtextM\_\{\\text\{text\}\},MuiM\_\{\\text\{ui\}\}, andMhybridM\_\{\\text\{hybrid\}\}\. Agent models are DeepSeek\-V3\.2\(Liuet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib87)\), GPT\-5\-mini, Qwen3\-VL\-32B\-Instruct\(Baiet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib89)\), and Seed\-OSS\-36B\-Instruct\(Team,[2025](https://arxiv.org/html/2606.14314#bib.bib91)\); user\-side models are GPT\-4o\(Achiamet al\.,[2023](https://arxiv.org/html/2606.14314#bib.bib90)\)and Qwen3\-VL\-32B\-Instruct\. Specific pairings vary by benchmark and are reported in each result table\.

### 5\.4Metrics

FollowingSunet al\.\([2025](https://arxiv.org/html/2606.14314#bib.bib92)\)and the facets of communication quality in §[3\.2](https://arxiv.org/html/2606.14314#S3.SS2), each episode yields three scores:

Productivity\.Whether the agent completes the task successfully, measured by each environment’s native task\-success signal \(e\.g\., patch similarity, database\-state match\), with TravelGym max 2\.4\. This is the primary signal: communication is valuable only insofar as it enables task completion\.

Proactivity\.To complete the task, the agent must act in the environment while recovering the missing information through the communication channel\. Each piece of information inzzcarries a*sensitivity level*ℓ∈\{1,…,L\}\\ell\\in\\\{1,\\ldots,L\\\}\. The inverse of interaction cost, computed from the Cost annotations tagged by theUser Simulatorat each turn: a proactive agent achieves high productivity with low cumulative cost∑tct\\sum\_\{t\}c\_\{t\}and few questioning turns\.

Personalization\.Adherence to the assigned persona’s interaction preference\. TheUser Simulatortags each response with a persona\-specific reward label \(0/1\); the score is the negative of the cumulative penalty for deviations\. In Setting B, personas are not used and this dimension is absent\.

## 6Experiments and Analysis

BenchmarkAgentUserMtextM\_\{\\text\{text\}\}MuiM\_\{\\text\{ui\}\}MhybridM\_\{\\text\{hybrid\}\}MCPEM\_\{\\text\{CPE\}\}SWE\-benchDeepSeek\-V3\.2Qwen3\-VL\-32B\.135/\.156/\.629\.113/\.196/\.705\.140/\.089/\.594\.214↑\\uparrow/\.076↓\\downarrow/\.589↓\\downarrowτ2\\tau^\{2\}\-benchDeepSeek\-V3\.2GPT\-4o\.270/\.230/\.650\.275/\.120/\.850\.270/\.015/\.610\.300↑\\uparrow/\.005↓\\downarrow/\.520↓\\downarrowDeepSeek\-V3\.2Qwen3\-VL\-32B\.280/\.020/\.485\.235/\.035/\.555\.300/\.035/\.495\.300−\-/\.055↑\\uparrow/\.575↑\\uparrowGPT\-5\-miniQwen3\-VL\-32B\.375/\.015/\.347\.375/\.105/\.593\.340/\.135/\.470\.380↑\\uparrow/\.145↑\\uparrow/\.603↑\\uparrowGPT\-5\-miniGPT\-4o\.425/\.045/\.643\.385/\.065/\.827\.405/\.075/\.563\.430↑\\uparrow/\.075−\-/\.630↑\\uparrowTravelGymDeepSeek\-V3\.2GPT\-4o1\.437/\.045/\.9051\.249/\.260/\.9521\.233/\.145/\.8561\.533↑\\uparrow/\.170↑\\uparrow/\.920↑\\uparrowGPT\-5\-miniQwen3\-VL\-32B1\.013/\.345/\.850\.995/\.285/\.865\.906/\.195/\.7661\.113↑\\uparrow/\.205↑\\uparrow/\.845↑\\uparrowWebArenaDeepSeek\-V3\.2GPT\-4o\.205/\.628/\.865\.180/\.810/\.673\.195/\.695/\.766\.235↑\\uparrow/\.775↑\\uparrow/\.633↓\\downarrowGPT\-5\-miniGPT\-4o\.200/\.525/\.737\.140/\.668/\.803\.180/\.315/\.752\.275↑\\uparrow/\.520↑\\uparrow/\.835↑\\uparrow

Table 2:CPE\-optimized policy \(MCPEM\_\{\\text\{CPE\}\}\) vs\. baselines\. CPE is applied only whereMhybridM\_\{\\text\{hybrid\}\}underperforms at least one single\-channel baseline in productivity\. Bestproductivityper row is bolded\.↑\\uparrow/↓\\downarrow/−\-indicate direction of change fromMhybridM\_\{\\text\{hybrid\}\}\.We organize experiments around four analyses: \(1\) mode comparison \(MtextM\_\{\\text\{text\}\}vs\.MuiM\_\{\\text\{ui\}\}vs\.MhybridM\_\{\\text\{hybrid\}\}\), \(2\) oracle full\-information upper bound on SWE\-bench, \(3\) CPE\-optimized policy \(MCPEM\_\{\\text\{CPE\}\}\), and \(4\) Planner–Executor setting\.

### 6\.1Mode Comparison

Table[1](https://arxiv.org/html/2606.14314#S5.T1)reports the three modes across four benchmarks in the User–Agent setting, averaged over four personas\. Several patterns stand out:

MhybridM\_\{\\text\{hybrid\}\}leads productivity in the majority of cases, but the gain is task\-dependent\.MhybridM\_\{\\text\{hybrid\}\}achieves the best productivity in 8 of 14 agent–user pairs, confirming that access to both channels generally improves task completion\. The benefit is most consistent in SWE\-bench \(4/4\) and absent in WebArena \(0/2\)\. Hybrid access is broadly useful but not universally cost\-effective\.

MuiM\_\{\\text\{ui\}\}dominates personalization and proactivity\.MuiM\_\{\\text\{ui\}\}achieves the best personalization in 10 of 14 pairs and the best proactivity in 8 of 14\. The effect is strongest inτ2\\tau^\{2\}\-bench and TravelGym, whereMuiM\_\{\\text\{ui\}\}leads personalization in all 8 pairs\. Structured forms reduce response ambiguity by constraining user input to relevant dimensions, which simultaneously improves persona compliance and elicits more information per turn than free\-text replies\.

MtextM\_\{\\text\{text\}\}outperformsMuiM\_\{\\text\{ui\}\}on productivity\.In 9 of 14 pairs,MtextM\_\{\\text\{text\}\}productivity exceedsMuiM\_\{\\text\{ui\}\}\. Free\-text exchange is more efficient for task progress: the agent can ask precisely targeted questions and receive context\-rich replies, whereas UI forms impose a fixed per\-turn overhead that does not always pay off in task completion\. The productivity gap, together withMuiM\_\{\\text\{ui\}\}’s advantage in interaction quality, confirms that no single channel dominates, motivating learned channel\-selection strategies\.

### 6\.2Oracle: Value of Full Information

Table 3:Oracle experiment on SWE\-bench\.MfullM\_\{\\text\{full\}\}injects the complete task description into the agent’s initial prompt\. Theask\_qcolumn reports the proportion of tasks where the agent usedask\_questionat least once\. Bestproductivityper row is bolded\.To isolate the role of information availability, we conduct an oracle experiment on SWE\-bench where the complete task specification is injected into the agent’s initial prompt\. This removes the need for information recovery and isolates task\-execution capability from interaction quality\.

Table[3](https://arxiv.org/html/2606.14314#S6.T3)reports the results\. Full information improves productivity by 2\.3×\\times–10\.6×\\timesoverMtextM\_\{\\text\{text\}\}, confirming thatmissing information, rather than execution ability, is the primary bottleneck\. Correspondingly,ask\_questionusage drops consistently under the oracle setting, indicating that most interaction turns inMtextM\_\{\\text\{text\}\}are spent recovering task information\.

### 6\.3CPE\-Optimized Policy

We apply CPE only to settings where naive hybrid interaction underperforms at least one single\-channel baseline, indicating ineffective channel selection\. Configuration details are in Appendix[J](https://arxiv.org/html/2606.14314#A10)\. Table[2](https://arxiv.org/html/2606.14314#S6.T2)reportsMCPEM\_\{\\text\{CPE\}\}alongside all baselines for completed optimization runs\.

Table 4:Mode comparison in the Planner–Executor setting\. Each cell reports productivity; best per row inbold blue\.Across all 9 experiment configurations,MCPEM\_\{\\text\{CPE\}\}consistently achieves the best productivity across all evaluated settings using only prompt\-level optimization\. The arrows in Table[2](https://arxiv.org/html/2606.14314#S6.T2)show that relative toMhybridM\_\{\\text\{hybrid\}\}, CPE improves proactivity in 6 of 9 cases and personalization in 6 of 9 cases, suggesting that better communication policies can simultaneously improve task completion and interaction effectiveness\. The optimized policies are reproduced in Appendix[I](https://arxiv.org/html/2606.14314#A9)\.

### 6\.4Planner–Executor Setting

In the Planner–Executor setting \(§[3\.3](https://arxiv.org/html/2606.14314#S3.SS3)\), the Planner holds a complete planzzand discloses cooperatively; the Executor receives only a vague summary and must recover missing dimensions through interaction\. This removes persona dynamics and task ambiguity, letting us ask: does hybrid communication help even when the information holder is perfectly cooperative and the information need is well\-defined?

Table[4](https://arxiv.org/html/2606.14314#S6.T4)reports the results on SWE\-bench, TravelGym, andτ2\\tau^\{2\}\-bench\. The answer is clear:MhybridM\_\{\\text\{hybrid\}\}leads in 7 of 10 pairs\. Even when the Planner is fully cooperative and the missing information is deterministic, having both channels beats either alone\. This reinforces the paper’s central claim: communication policy, the meta\-decision of*which channel to use*, carries independent value beyond adapting to user personas\. When hybrid access underperforms, the information need is narrow enough that a single well\-chosen channel suffices; when the need is broader, channel flexibility directly translates to better task outcomes\. The hybrid win rate here \(7/10\) is comparable to the User–Agent setting \(8/14, Table[1](https://arxiv.org/html/2606.14314#S5.T1)\), suggesting that channel selection matters about as much for pure information transmission as it does for navigating human ambiguity\.

## 7Conclusion

In this work, we formalized*Communication Policy*, the prompt\-level strategy for choosing between text and UI channels in hybrid agent interaction, and proposed CPE to optimize this decision automatically\. Experiments across four environments and two interaction settings reveal two key findings: \(i\) text and UI communication are complementary rather than interchangeable, with text often supporting more efficient task progression, while structured UI improves response quality and persona compliance; hybrid interaction can combine advantages from both channels and often achieves stronger overall performance; \(ii\) effective communication strategies can be discovered through prompt\-level self\-evolution without model retraining\. Our results highlight communication behavior as an important and underexplored dimension in LLM agent interaction\. We hope this work encourages the community to treat communication channel selection as a first\-class design concern\.

## Limitations

Our work presents a few limitations that outline directions for future work\. First, our experiments use LLM\-based user simulators, which enable controlled and reproducible evaluation at scale but may not capture the full variability of real human interaction\. Second, our evaluation spans four environments, two interaction settings, and four user personas, but the space of possible communication channel designs remains largely unexplored\. Third, our cost annotation protocol assumes users disclose information following a predefined sensitivity schema; real\-world disclosure behavior is more complex and context\-dependent\.

## References

- A\. Abbas, C\. Wohn, A\. Jagtap, E\. H Rho, Y\. Kim, and S\. W\. Lee \(2026\)" Having lunch now": understanding how users engage with a proactive agent for daily planning and self\-reflection\.InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems,pp\. 1–23\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§5\.3](https://arxiv.org/html/2606.14314#S5.SS3.p1.3)\.
- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang,et al\.\(2025\)Gepa: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- S\. K\. Anbalagan, X\. Nei, U\. Mohan, V\. K\. Kanamarlapudi, A\. Kommalapati, and X\. Zhao \(2025\)Bridging ui design and chatbot interactions: applying form\-based principles to conversational agents\.InInternational conference on human\-computer interaction,pp\. 223–231\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p3.1)\.
- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge,et al\.\(2025\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[§5\.3](https://arxiv.org/html/2606.14314#S5.SS3.p1.3)\.
- V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan \(2025\)τ2\\tau^\{2\}\-Bench: evaluating conversational agents in a dual\-control environment\.arXiv preprint arXiv:2506\.07982\.Cited by:[§E\.3](https://arxiv.org/html/2606.14314#A5.SS3.p1.1),[§5\.1](https://arxiv.org/html/2606.14314#S5.SS1.p1.3)\.
- A\. Bhattacharjee, J\. Suh, M\. Ershadi, S\. T\. Iqbal, A\. D\. Wilson, and J\. Hernandez \(2024\)Understanding communication preferences of information workers in engagement with text\-based conversational agents\.arXiv preprint arXiv:2410\.20468\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p1.1)\.
- Y\. Cao, P\. Jiang, and H\. Xia \(2025\)Generative and malleable user interfaces with generative and evolving task\-driven data model\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,pp\. 1–20\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p3.1)\.
- J\. Chen, Y\. Zhang, Y\. Zhang, Y\. Shao, and D\. Yang \(2025\)Generative interfaces for language models\.arXiv preprint arXiv:2508\.19227\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p3.1)\.
- Y\. Deng, L\. Liao, L\. Chen, H\. Wang, W\. Lei, and T\. Chua \(2023\)Prompting and evaluating large language models for proactive dialogues: clarification, target\-guided, and non\-collaboration\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 10602–10621\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p2.1),[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Dou, M\. Galley, B\. Peng, C\. Kedzie, W\. Cai, A\. Ritter, C\. Quirk, W\. Xu, and J\. Gao \(2025\)SimulatorArena: are user simulators reliable proxies for multi\-turn evaluation of ai assistants?\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 35200–35278\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- D\. C\. Fang and T\. Ke \(2025\)Information seeking for robust decision making under partial observability\.arXiv preprint arXiv:2510\.01531\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p1.1)\.
- J\. Gromada, A\. Kasicka, E\. Komkowska, L\. Krajewski, N\. Krawczyk, M\. Veyret, B\. Przybył, L\. M\. R\. Barahona, and M\. K\. Szczerbak \(2025\)Evaluating conversational agents with persona\-driven user simulations based on large language models: a sales bot case study\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,pp\. 230–245\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Hao, P\. Cao, Z\. Jin, H\. Liao, Y\. Chen, K\. Liu, and J\. Zhao \(2025\)Evaluating personalized tool\-augmented llms from the perspectives of personalization and proactivity\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 21897–21935\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. Yu, X\. Song, and D\. Zhou \(2024\)Large language models cannot self\-correct reasoning yet\.InInternational conference on learning representations,Vol\.2024,pp\. 32808–32824\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Huang, S\. Chen, M\. Chen, J\. May, L\. Yang, M\. Wan, and P\. Zhou \(2025\)Teaching language models to gather information proactively\.Findings of the Association for Computational Linguistics: EMNLP 2025,pp\. 15588–15599\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p2.1),[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px1.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)Swe\-bench: can language models resolve real\-world github issues?\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 54107–54157\.Cited by:[§E\.1](https://arxiv.org/html/2606.14314#A5.SS1.p1.1),[§1](https://arxiv.org/html/2606.14314#S1.p3.1),[§5\.1](https://arxiv.org/html/2606.14314#S5.SS1.p1.3)\.
- X\. Li, R\. Zhou, Z\. C\. Lipton, and L\. Leqi \(2024\)Personalized language modeling from personalized human feedback\.arXiv preprint arXiv:2402\.05133\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Liao, G\. H\. Yang, and C\. Shah \(2023\)Proactive conversational agents in the post\-chatgpt world\.InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval,pp\. 3452–3455\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p2.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§5\.3](https://arxiv.org/html/2606.14314#S5.SS3.p1.3)\.
- Y\. Lu, S\. Yang, C\. Qian, G\. Chen, Q\. Luo, Y\. Wu, H\. Wang, X\. Cong, Z\. Zhang, Y\. Lin,et al\.\(2025\)Proactive agent: shifting llm agents from reactive responses to active assistance\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 47431–47457\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Nandy, S\. O\. Adalgeirsson, A\. K\. Sinha, T\. Kraljic, M\. Cleron, L\. Shi, A\. Singh, A\. Chaudhary, A\. Ganti, C\. A\. Melancon,et al\.\(2024\)Bespoke: using llm agents to generate just\-in\-time interfaces by reasoning about user intent\.InCompanion Proceedings of the 26th International Conference on Multimodal Interaction,pp\. 78–81\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p3.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive apis\.Advances in Neural Information Processing Systems37,pp\. 126544–126565\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p1.1)\.
- C\. Qian, Z\. Liu, A\. Prabhakar, Z\. Liu, J\. Zhang, H\. Chen, H\. Ji, W\. Yao, S\. Heinecke, S\. Savarese,et al\.\(2025\)Userbench: an interactive gym environment for user\-centric agents\.arXiv preprint arXiv:2507\.22034\.Cited by:[§E\.2](https://arxiv.org/html/2606.14314#A5.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.14314#S5.SS1.p1.3)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2024\)Toolllm: facilitating large language models to master 16000\+ real\-world apis\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 9695–9717\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p1.1)\.
- Y\. Qiu, T\. Shi, X\. Zhao, F\. Zhu, Y\. Zhang, and F\. Feng \(2025\)Latent inter\-user difference modeling for llm personalization\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 10610–10628\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Sachdeva, A\. Kim, and A\. R\. Dennis \(2024\)Taking the chat out of chatbot? collecting user reviews with chatbots and web forms\.Journal of Management Information Systems41\(1\),pp\. 146–177\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p2.1)\.
- R\. Sahay, L\. S\. Tekumalla, P\. Aggarwal, A\. Jain, and A\. Saladi \(2025\)Ask: aspects and retrieval based hybrid clarification in task oriented dialogue systems\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 6: Industry Track\),pp\. 881–895\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px1.p1.1)\.
- V\. Samuel, H\. P\. Zou, Y\. Zhou, S\. Chaudhari, A\. Kalyan, T\. Rajpurohit, A\. Deshpande, K\. Narasimhan, and V\. Murahari \(2024\)Personagym: evaluating persona agents and llms\.arXiv preprint arXiv:2407\.184168\(9\)\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Shi, W\. Xu, Z\. Zeqi, X\. Zi, Q\. Wu, and M\. Xu \(2025\)PersonaX: a recommendation agent\-oriented user modeling framework for long behavior sequence\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 5764–5787\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Siro, Y\. Yuan, M\. Aliannejadi, and M\. de Rijke \(2026\)AGENT\-cq: automatic generation and evaluation of clarifying questions for conversational search with large language models\.ACM Transactions on Information Systems\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Sun, X\. Zhou, W\. Du, X\. Wang, S\. Welleck, G\. Neubig, M\. Sap, and Y\. Yang \(2025\)Training proactive and personalized llm agents\.arXiv preprint arXiv:2511\.02208\.Cited by:[§3\.1](https://arxiv.org/html/2606.14314#S3.SS1.p3.6),[§5\.4](https://arxiv.org/html/2606.14314#S5.SS4.p1.1)\.
- M\. Suri, P\. Mathur, N\. Lipka, F\. Dernoncourt, R\. A\. Rossi, and D\. Manocha \(2025\)Structured uncertainty guided clarification for llm agents\.arXiv preprint arXiv:2511\.08798\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p2.1),[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px1.p1.1)\.
- B\. S\. Team \(2025\)Seed\-oss open\-source models\.Note:[https://github\.com/ByteDance\-Seed/seed\-oss](https://github.com/ByteDance-Seed/seed-oss)Cited by:[§5\.3](https://arxiv.org/html/2606.14314#S5.SS3.p1.3)\.
- Y\. Wan, J\. Wu, M\. Abdulhai, L\. Shani, and N\. Jaques \(2026\)Enhancing personalized multi\-turn dialogue with curiosity reward\.Advances in Neural Information Processing Systems38,pp\. 155857–155894\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Wang, X\. Li, S\. Yang, L\. Zhou, F\. Jiang, and H\. Li \(2025a\)Know you first and be you better: modeling human\-like user simulators via implicit profiles\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 21082–21107\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, C\. Li, Z\. Wang, F\. Bai, H\. Luo, J\. Zhang, N\. Jojic, E\. Xing, and Z\. Hu \(2024\)Promptagent: strategic planning with language models enables expert\-level prompt optimization\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 23967–24001\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Wang, N\. Geng, Z\. Guo, W\. Ma, and M\. Zhang \(2025b\)Human vs\. agent in task\-oriented conversations\.InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region,pp\. 133–142\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p1.1),[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p1.1)\.
- J\. Wei, S\. Kim, H\. Jung, and Y\. Kim \(2024\)Leveraging large language models to power chatbots for collecting user self\-reported data\.Proceedings of the ACM on Human\-Computer Interaction8\(CSCW1\),pp\. 1–35\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p3.1)\.
- Y\. Wu and H\. Osawa \(2024\)Navigating communication patterns and personalities in user preference during human\-agent interaction\.InProceedings of the 12th International Conference on Human\-Agent Interaction,pp\. 447–449\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 12028–12068\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p1.1)\.
- S\. Yi, M\. Khang, and S\. Park \(2025\)ZERA: zero\-init instruction evolving refinement agent–from zero instructions to structured prompts via principle\-based optimization\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 23334–23348\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou \(2024\)Textgrad: automatic" differentiation" via text\.arXiv preprint arXiv:2406\.07496\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Zhang, Y\. Deng, Z\. Ren, S\. K\. Ng, and T\. Chua \(2024\)Ask\-before\-plan: proactive language agents for real\-world planning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 10836–10863\.Cited by:[§1](https://arxiv.org/html/2606.14314#S1.p2.1),[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zhang, L\. Ge, H\. Li, W\. Zhu, C\. Zhang, and Y\. Ye \(2026\)MAPRO: recasting multi\-agent prompt optimization as maximum a posteriori inference\.InFindings of the Association for Computational Linguistics: EACL 2026,pp\. 4458–4480\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§2](https://arxiv.org/html/2606.14314#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2024\)Webarena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 15585–15606\.Cited by:[§E\.4](https://arxiv.org/html/2606.14314#A5.SS4.p1.1),[§5\.1](https://arxiv.org/html/2606.14314#S5.SS1.p1.3)\.

## Appendix ACost Level Definitions

Each benchmark defines a 3–5 level sensitivity hierarchy for the information dimensions inzz\(§[3\.1](https://arxiv.org/html/2606.14314#S3.SS1)\)\. TheUser Simulatorreferences these levels to decide what information to disclose at each turn\. Table[5](https://arxiv.org/html/2606.14314#A1.T5)lists the definitions as written in theUser Simulatorsystem prompt\.

Table 5:Cost level definitions from theUser Simulatorsystem prompt\. Cost 3 is reserved for refusal across all environments\. SWE\-bench,τ2\\tau^\{2\}\-bench, and WebArena use 5 levels; TravelGym uses 3\.
## Appendix BUser SimulatorPrompt Structure

TheUser Simulatorsystem prompt contains five sections:

1. 1\.Role\.“You are playing the role of a human user responding to an agent’s questions\.” The LLM is instructed to stay in character, sound natural, and never reveal its artificial nature\.
2. 2\.User Preference & Reward\.The assigned persona description \(e\.g\., “The user is an amateur and can only answer very simple and general questions”\) and the corresponding reward rule \(e\.g\., “If the question is very simple, tag \[Reward 1\]; otherwise, respond I don’t know and tag \[Reward 0\]”\)\.
3. 3\.Information & Cost\.The full task specificationzzorganized into environment\-specific sensitivity tiers, each with a defined Cost level\. The simulator must tag each response with the highest Cost among the tiers disclosed\.
4. 4\.Answering Rules\.Default style: concise, casual, natural language; use contractions; single paragraph only; no markdown, bullet points, code blocks, or role explanations; never ask the agent a question\.
5. 5\.Output Format\.Natural\-language reply \(single paragraph\), then\[Cost N\]on a new line, then optionally\[Reward 0/1\]if the reward rule requires it\.

The example below is aτ2\\tau^\{2\}\-benchUser Simulatorprompt with the amateur persona; the structure is analogous for other benchmarks and personas, with environment\-specific cost tiers and persona\-specific preference descriptions\.

τ2\\tau^\{2\}\-benchUser SimulatorPrompt \(amateur persona\)System prompt\.*Role\.*You are playing the role of a human user responding to an agent’s questions\. Your behavior is dictated by a specific user preference, and you will evaluate the agent’s questions based on a reward rule\. You will also assign a cost based on the sensitivity of the information you share\.*User Preference & Reward\.*Your Preference: The user is an amateur and can only answer very simple and general questions\. When a question involves any professional knowledge, the user will respond with ‘‘I don’t know\.’’ Reward Rule: If the question is very simple \(e\.g\., related only to personal preference or common sense\), tag it as \[Reward 1\]\. Otherwise, if it involves any professional knowledge, respond with ‘‘I don’t know\.’’ and tag it as \[Reward 0\]\.*Information & Cost \(airline domain\)\.*Known Info \(Cost 1\): Basic account or booking details the customer knows about themselves\. Task Instructions \(Cost 2\): The customer’s detailed goals, constraints, and behavioural directives\. Key Identifiers \(Cost 4\): Specific IDs \(user\_id, reservation\_id, etc\.\)\. Evaluation Criteria \(Cost 5\): The exact conditions that must be met for the task to succeed\. Cost 3: Information not provided \-\-\- refused to answer or said ‘‘I don’t know\.’’\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- TASK INFORMATION \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- Known Info: You are Emma Kim\. Your user id is emma\_kim\_9957\. Task Instructions: You want to cancel reservation EHGLP3\. If the agent tells you cancellation is not possible, mention you were told insurance was unnecessary\. Do not cancel without a refund\. Key Identifiers: None Evaluation Criteria: \{"actions": \[\], "nl\_assertions": \["Agent should refuse to proceed with the cancellation\."\], "reward\_basis": \["DB", "COMMUNICATE"\]\} *Answering Rules\.*Adhere to your preference above all else\. Default style: keep replies short, casual, and natural\. Use contractions \(‘‘it’s,’’ ‘‘I don’t’’\)\. Always write in a single paragraph\. No markdown, bullet points, code blocks, or role explanations\. Never ask the agent a question\. Escalate to more sensitive information only if the agent is stuck\.*Output Format\.*1\. A single\-paragraph, human\-style, concise reply\. 2\. The cost tag on a new line \(e\.g\., \[Cost 3\]\)\. 3\. The reward tag on a new line \(e\.g\., \[Reward 1\]\), only if required by the reward rule\.Again, your preference is: The user is an amateur and can only answer very simple and general questions\. Now answer the agent’s question\. Be very concise\. Ensure cost and reward predictions are accurate\.

## Appendix CPlanner SimulatorPrompt Structure

In the Planner–Executor setting \(§[3\.3](https://arxiv.org/html/2606.14314#S3.SS3)\), theUser Simulatoris replaced by aPlanner Simulator: an LLM that holds the full task specificationzz\. At the start of each episode, the Planner receives the system prompt \(withzz\) and generates a step\-by\-step execution plan, which is shared with the Executor\. On subsequent turns, the Executor sends clarification queries and the Planner answers using the full context\. Unlike theUser Simulator, the Planner has no persona and no reward tagging; it discloses information cooperatively following the cost schema\. The prompt consists of four sections:

1. 1\.Role\.Frames the LLM as the Planner holding the full ground truth; the Executor reaches the Planner through communication turns\. Answers must draw from the ground\-truth annex only\.
2. 2\.Output Format\.A JSON object withthought\(private reasoning\) andresponse\(what the Executor reads, followed by a Cost tag\)\. No reward tags\.
3. 3\.Information & Cost\.Cost tiers \(Scenario→\\rightarrowCost 1, Dimensions & Latent Preferences→\\rightarrowCost 2, Refusal→\\rightarrowCost 3\), with concrete TASK INFORMATION appended below: the full scenario, the vague initial\_opening \(z~\\tilde\{z\}\), which dimensions have latent preferences, and the structured ground\-truth constraints\.
4. 4\.Initial Execution Plan\.A numbered step\-by\-step plan generated at episode start and shared with the Executor\. The Planner must stay consistent with this plan when answering subsequent clarifications\.

The example below is a TravelGymPlanner Simulatorprompt; the structure is analogous for other benchmarks, with environment\-specific cost tiers and task information\.

TravelGymPlanner SimulatorPromptSystem prompt\.*Role\.*You are the Planner for this TravelGym episode\. You hold the full ground truth in the annex below\. The Executor reaches you through action turns \(plain text or a UI screenshot\)\. Answer from the annex only; do not invent preferences\. Prefer the smallest disclosure that still lets the Executor proceed\.*Output Format\.*Return exactly one JSON object with two keys: \- \`thought\`: brief private reasoning \(may reference strategy\)\. \- \`response\`: what the Executor should read as your reply, then a new line with exactly one of \[Cost 1\], \[Cost 2\], or \[Cost 3\]\. Do not put tags inside \`thought\`\. Do not add \[Reward\] lines\.*Information & Cost \(TravelGym\)\.*Scenario \(Cost 1\): Trip framing and scenario text \-\-\- information already public to the Executor\. Dimensions & Latent Preferences \(Cost 2\): Structured ground truth beyond the public framing \(which dimensions exist, opening line, specific preference lines\)\. Cost 3: Refusal, don’t know, or no substantive travel detail\.\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- TASK INFORMATION \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- Scenario: I am planning a trip to Chicago from November 10th to November 17th\. \[\.\.\.\] I would love to dine at a restaurant that not only serves authentic American cuisine but also provides convenient parking\. Initial opening: I am planning a trip to Chicago from November 10th to November 17th, staying in an apartment\. I would like to dine at a restaurant in Chicago on November 12th\. Dimensions: restaurant, apartment Latent preferences: \[P1\] restaurant cuisine \-\- American \|\|\[P2\] restaurant parking\|\|\[P3\] apartment platform \-\- Airbnb\|\|\[P4\] apartment rooms \-\- at least two bathroomsInitial Execution Plan\.1\. Confirm travel dates and destination: November 10th\-\-17th, Chicago\. 2\. Search for an Airbnb in Chicago with at least two bathrooms\. 3\. Book the selected Airbnb for the specified dates\. 4\. Research restaurants in Chicago serving authentic American cuisine\. 5\. Identify restaurants that offer convenient parking\. 6\. Select a restaurant meeting both cuisine and parking preferences\. 7\. Make a reservation at the chosen restaurant for November 12th\. \[\.\.\.\]

## Appendix DPersona Definitions

Experiments use four personas, each assigned to an equal share of episodes across all environments\. Each persona is defined by a natural\-language preference description injected into theUser Simulatorprompt at the start of an episode, shaping how the user responds to agent questions \(or whether they respond at all\)\.

amateur\.The user is an amateur and can only answer very simple and general questions\. When a question involves any professional knowledge, the user responds with “I don’t know\.”

do\_selection\.The user can only answer selection questions\. The agent’s question must provide options \(e\.g\., A, B, C\); the user responds with the chosen letter only\. If the question is not a selection question, the user responds with “I don’t know\.”

one\_question\.The user prefers the agent to ask only one question at a time\. Multi\-part questions receive a refusal\.

answer\_more\.The user prefers the agent to ask more questions\. The agent should ask a minimum of 3 questions during the episode\.

The same four preference texts are used across SWE\-bench, TravelGym,τ2\\tau^\{2\}\-bench, and WebArena\.

### D\.1Scoring

Each persona defines a compliance condition\. Letreward0\\text\{reward\}\_\{0\}be the number of\[Reward 0\]tags in an episode andask\_turnthe number of agent questioning turns\. The penaltypreference\_rewardand compliance flagpreference\_okare:

- •amateur\.preference\_reward=−0\.1⋅reward0\\text\{preference\\\_reward\}=\-0\.1\\cdot\\text\{reward\}\_\{0\}\(LLM\-tagged\)\. Compliant iff no\[Reward 0\]tags\.
- •do\_selection\.preference\_reward=−0\.5⋅reward0\\text\{preference\\\_reward\}=\-0\.5\\cdot\\text\{reward\}\_\{0\}\(LLM\-tagged\)\. Compliant iff all questions provide A/B/C options\.
- •one\_question\.preference\_reward=−0\.5⋅reward0\\text\{preference\\\_reward\}=\-0\.5\\cdot\\text\{reward\}\_\{0\}\(LLM\-tagged\)\. Compliant iff each message contains a single question\.
- •answer\_more\.preference\_reward=−min⁡\(ask\_turn−3,0\)\\text\{preference\\\_reward\}=\-\\min\(\\text\{ask\\\_turn\}\-3,\\;0\)\(programmatic\)\. Compliant iffask\_turn≥3\\text\{ask\\\_turn\}\\geq 3\.

preference\_ok=1\\text\{preference\\\_ok\}=1whenpreference\_reward=0\\text\{preference\\\_reward\}=0, and 0 otherwise\. The personalization score is the mean over successful episodesℰ\\mathcal\{E\}:

Personalization=1\|ℰ\|​∑e∈ℰpreference\_ok​\(e\)∈\[0,1\]\.\\text\{Personalization\}=\\frac\{1\}\{\|\\mathcal\{E\}\|\}\\sum\_\{e\\in\\mathcal\{E\}\}\\text\{preference\\\_ok\}\(e\)\\in\[0,1\]\.

## Appendix EEnvironment Details

Table 6:Evaluation environments with example vague specifications and task specifications\. The agent sees onlyz~\\tilde\{z\}; the fullzzis held by theUser SimulatororPlanner Simulator\.In all benchmarks, forMuiM\_\{\\text\{ui\}\}andMhybridM\_\{\\text\{hybrid\}\}modes,generate\_uiHTML is rendered via Playwright as a screenshot and sent to the multimodal user LLM\.ask\_questionsends plain text\.

Table[6](https://arxiv.org/html/2606.14314#A5.T6)gives an example vague–full specification pair for each benchmark\. The per\-benchmark vaguification rules are as follows\.

### E\.1SWE\-bench

We use SWE\-bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib84)\), a repository\-level software repair benchmark\. We evaluate on 56 tasks from SWE\-bench Verified, each paired with 4 personas \(amateur, answer\_more, do\_selection, one\_question\), yielding 224 episodes per experimental configuration\. Max turns is 200\.

The agent debugs and edits code in a Linux sandbox with environment toolsexecute\_bash\(read\-only\) andstr\_replace\_editor\. Communication toolsask\_questionandgenerate\_uiare interleaved with code operations\. Success is measured by the F1 similarity between the agent’s final diff and the oracle patch\.

Vaguification\.zzcontains the full issue description, developer hints, the target file path\(s\), and the target function/class name\(s\)\.z~\\tilde\{z\}is asummarized\_issue: a one\-paragraph abbreviation that drops file paths, function names, and hint details\.

### E\.2TravelGym

We use TravelGym\(Qianet al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib85)\), a multi\-turn travel planning benchmark\. We evaluate on 50 tasks, each paired with 4 personas, yielding 200 episodes per configuration\. Max turns is 20\.

The agent uses a singleinteract\_with\_envtool with sub\-commandssearch,answer,finish, andaction\. Communication is multiplexed throughaction:ask\_questioncorresponds tocontent\_kind=text, andgenerate\_uitocontent\_kind=html\. Success is measured by the environment’s native terminal reward \(max 2\.4\), combining an LLM judge evaluation of the final itinerary with rule\-based answer matching against ground truth\.

Vaguification\.zzcontains the full trip scenario, the list of travel dimensions \(flight, hotel, restaurant\), per\-dimension latent preference constraints \(e\.g\., cabin class, star rating, cuisine type\), and the ground\-truth answer IDs\.z~\\tilde\{z\}is theinitial\_description: a 1–2 sentence user opening \(e\.g\., “Plan a weekend trip to Paris”\) that drops all dimension\-level preferences and constraints\.

### E\.3τ2\\tau^\{2\}\-bench

We useτ2\\tau^\{2\}\-bench\(Barreset al\.,[2025](https://arxiv.org/html/2606.14314#bib.bib86)\)in the airline domain only\. We evaluate on 50 tasks, each paired with 4 personas, yielding 200 episodes per configuration\. Max turns is 100\.

The agent uses domain\-specific function calls \(book\_reservation,search\_direct\_flight,get\_user\_details, etc\.\) to manipulate a simulated airline database\. Communication toolsask\_questionandgenerate\_uiare appended to the domain tool list;ask\_questioncalls are intercepted and routed to theUser Simulator\. Success is binary: the final database state must exactly match the expected state from a reference execution\.

Vaguification\.zzcontains known account/booking facts \(known\_info\), customer task instructions \(task\_instructions\), key identifiers \(user\_id,reservation\_id\), and evaluation criteria\.z~\\tilde\{z\}is thereason\_for\_call: the customer’s opening message \(1–3 sentences, e\.g\., “I need to cancel a booking”\), which omits all account details, identifiers, and success conditions\.

### E\.4WebArena

We use WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.14314#bib.bib68)\)across 3 sites: shopping, Reddit, and GitLab \(map and wiki are excluded\)\. We evaluate on 50 tasks, each paired with 4 personas, yielding 200 episodes per configuration\. Max turns is 50\.

The agent navigates live web environments via browser primitives \(goto,click,fill,scroll,select\_option\)\. Communication toolsask\_questionandgenerate\_uiare added alongside browser actions\. Each LLM call receives a single system message plus a flat task block \(current URL, previous action, page accessibility tree\)\. Success is evaluated by URL match, page element presence, or expected string content on the final page\.

Vaguification\.zzcontains the full intent \(verbatim task description\), target sites, credentials \(username,password,start\_url\), and evaluation criteria \(reference answers, URL checks\)\.z~\\tilde\{z\}is a one\-sentence summary that infers the task category \(e\.g\., purchase, search, post\) from the intent via keyword matching and maps site names to short descriptions \(e\.g\., “I need to purchase something on an e\-commerce shopping website”\)\. All item names, prices, dates, and credential details are dropped\.

## Appendix FCommunication Policy Prompts

The communication policyπcomm\\pi\_\{\\text\{comm\}\}is the prompt text rendered to the agent at the start of each episode\. It consists of three components: \(1\) the system prompt, specifying the agent’s role, available tools, and channel\-selection rules; \(2\) few\-shot examples demonstrating when to use each channel; and \(3\) environment\-specific guidance \(e\.g\., task framing, cost schemas\)\. Below we provide the policy for each benchmark as used in our experiments\.

### F\.1SWE\-bench

SWE\-bench Communication PolicySystem prompt\.*Role\.*You are OpenHands agent, a helpful AI assistant that can interact with a computer to solve tasks\.*Environment tools\.*•execute\_bash•finish•str\_replace\_editor*Communication tools\.*•ask\_question•generate\_ui*Channel\-selection rules\.*•Do not ask the user in free\-form assistant text; use either ask\_question \(plain text\) or generate\_ui \(visual HTML\)\.•Use ask\_question when a short text question is enough\.•Use generate\_ui when a rendered page helps \(forms, options, layout\); html\_code must be a complete valid HTML5 document\.•Adapt to the user’s preference in the task description; be creative with HTML\.Examples\.*Example A — text \(ask\_question\)\.* USER: The app sometimes shows a random error page\. Can you figure out what’s wrong?ASSISTANT: I’ll list the top\-level layout, then ask one focused question if details are missing\. <function=execute\_bash\> <parameter=command\>ls \-1 /testbed</parameter\> </function\> USER: api frontend scripts tests ASSISTANT: <function=ask\_question\> <parameter=query\>What is the exact error message or stack trace when the error page appears?</parameter\> </function\> *Example B — UI \(generate\_ui\)\.* ASSISTANT \(needs more structured input\): <function=generate\_ui\> <parameter=progress\_summary\> I’ve mapped the repo\. I need one concrete detail to reproduce the issue\.</parameter\> <parameter=html\_code\><\!doctype html\> <html\><head\><meta charset="utf\-8"/\><title\>Reproduce the issue</title\> \[\.\.\.\] <label\>Timing</label\> <select\><option\>On first load</option\> \[\.\.\.\] </select\> <label\>What do you see?</label\> <textarea rows="3" placeholder="Error text or steps"\></textarea\> \[\.\.\.\] </html\></parameter\> <parameter=questions\>\["question": "When does the error occur and what appears on screen?"\] </parameter\> </function\>Appendixes\.NEW TASK DESCRIPTION: The app sometimes shows a random error page\. Can you figure out what’s wrong? The user’s preference for the agent is: The user is an amateur and can only answer very simple and general questions\. When a question involves any professional knowledge, the user will respond with "I don’t know\."

### F\.2TravelGym

TravelGym Communication PolicySystem prompt\.*Role\.*You are an agent that actively interacts with a specific environment\.*Environment tools\.*•search•answer•finish*Communication tools\.*•ask\_question•generate\_ui*Channel\-selection rules\.*•Do not ask in free\-form assistant text; use interact\_with\_env \(text\) or interact\_with\_env \(html\)\.•Use content\_kind=text when a short text question is enough\.•Use content\_kind=html when a rendered page helps \(forms, options, layout\)\.•Adapt to the user’s preference in the task description; be creative with HTML\.*Behavior rules\.*The user’s travel needs may be vague or underspecified\. When you need clarification, use interact\_with\_env with choice=action and a content\_kind prefix to select the channel\. Each question should address a specific missing preference dimension \(budget, cuisine, star rating, etc\.\)\.Examples\.*Example A — text \(ask\_question\)\.* USER: I am planning a trip to Bangkok, staying in an apartment from November 10th to November 17th, and visiting a restaurant on November 11th\. Also my budget is limited\.\.\.ASSISTANT \(after searching for apartments\): <function=interact\_with\_env\> <parameter=choice\>action</parameter\> <parameter=content\_kind\>text</parameter\> <parameter=content\>Do you have a preference for a specific area in Bangkok, or is any area fine as long as it’s within budget?</parameter\> </function\> *Example B — UI \(generate\_ui\)\.* ASSISTANT \(needs to narrow down restaurant options\): <function=interact\_with\_env\> <parameter=choice\>action</parameter\> <parameter=content\_kind\>html</parameter\> <parameter=content\><\!doctype html\> <html\><head\><meta charset="utf\-8"/\><title\>Restaurant Preferences</title\> \[\.\.\.\] <label\>Cuisine type</label\> <select\><option\>Thai</option\> \[\.\.\.\] </select\> <label\>Budget per person \(THB\)</label\> <select\><option\>Under 300</option\> \[\.\.\.\] </select\> \[\.\.\.\] </html\></parameter\> </function\>Appendixes\.I am planning a business trip to Austin, Texas, from April 15th to April 20th, staying in an apartment\. I also need a restaurant reservation in Austin on April 18th\.The user’s preference for the agent is: The user is an amateur and can only answer very simple and general questions\. When a question involves any professional knowledge, the user will respond with "I don’t know\."

### F\.3τ2\\tau^\{2\}\-bench

τ2\\tau^\{2\}\-bench Communication PolicySystem prompt\.*Role\.*You are a professional customer service agent for an airline company\. Your job is to resolve the customer’s issue efficiently and accurately, following company policy\.*Environment tools\.*•book\_reservation•cancel\_reservation•search\_direct\_flight•search\_onestop\_flight•get\_user\_details•get\_reservation\_details•get\_flight\_status•list\_all\_airports•send\_certificate•update\_reservation\_flights•update\_reservation\_passengers•update\_reservation\_baggages•calculate•transfer\_to\_human\_agents*Communication tools\.*•ask\_question•generate\_ui*Channel\-selection rules\.*•Do not ask the user in free\-form assistant text; use either ask\_question \(plain text\) or generate\_ui \(visual HTML\)\.•Use ask\_question when a short text question is enough\.•Use generate\_ui when a rendered page helps \(forms, options, layout\); html\_code must be a complete valid HTML5 document\.•Adapt to the user’s preference in the task description; be creative with HTML\.Examples\.*Example A — text \(ask\_question\)\.*CUSTOMER: I need to cancel a booking\.ASSISTANT: <function=ask\_question\> <parameter=query\>Could you provide the reservation ID or the full name on the booking?</parameter\> </function\> *Example B — UI \(generate\_ui\)\.*ASSISTANT \(needs to confirm flight change details\): <function=generate\_ui\> <parameter=progress\_summary\>I found your reservation\. To change your flight, I need a few details\.</parameter\> <parameter=html\_code\><\!doctype html\> <html\><head\><meta charset="utf\-8"/\><title\>Change Flight</title\> \[\.\.\.\] <label\>New departure date</label\> <input type="date" name="new\_date"/\> <label\>Preferred time</label\> <select\><option\>Morning</option\> \[\.\.\.\] </select\> \[\.\.\.\] </html\></parameter\> <parameter=questions\>\["question": "What date and time would you prefer for your new flight?"\]</parameter\> </function\>Appendixes\.NEW TASK DESCRIPTION: A customer has contacted customer service\. Here is their initial message: \-\-\- BEGIN CUSTOMER MESSAGE \-\-\- I recently spoke on the phone with a customer support representative that told me that a service agent will be able to help me cancel my reservation\. \-\-\- END CUSTOMER MESSAGE \-\-\- Resolve this issue\.The customer’s preference: The user can only answer selection questions\. The agent’s question must provide options such as A, B, C, and the user will respond with their choice \(e\.g\., A\)\.

### F\.4WebArena

WebArena Communication PolicySystem prompt\.*Role\.*You are a web automation agent\. You control a real web browser to complete tasks on behalf of the user\.*Environment tools\.*•goto•click•fill•press•scroll•select\_option*Communication tools\.*•ask\_question•generate\_ui*Channel\-selection rules\.*•Do not ask in free\-form assistant text; use ask\_question \(plain text\) or generate\_ui \(visual HTML\)\.•Use ask\_question when a short text question is enough\.•Use generate\_ui when a rendered page helps \(forms, options, layout\); html\_code must be a complete valid HTML5 document\.•Adapt to the user’s preference in the task description; be creative with HTML\.Examples\.*Example A — text \(ask\_question\)\.* TASK: I need to find some information on an e\-commerce admin panel\.ASSISTANT \(after navigating to the admin page\): <function=ask\_question\> <parameter=query\>What specific information are you looking for \-\- sales reports, customer data, or product catalog details?</parameter\> </function\> *Example B — UI \(generate\_ui\)\.* ASSISTANT \(needs search criteria\): <function=generate\_ui\> <parameter=progress\_summary\>I’m on the admin panel\. I need to know what to search for\.</parameter\>\]\] <parameter=html\_code\><\!doctype html\> <html\><head\><meta charset="utf\-8"/\><title\>Search Criteria</title\> \[\.\.\.\] <label\>What are you looking for?</label\> <select\><option\>Sales report</option\> \[\.\.\.\] </select\> <label\>Any specific date range?</label\> <input type="text" placeholder="e\.g\., Last 7 days"/\> \[\.\.\.\] </html\></parameter\> <parameter=questions\>\["question": "What data do you need and for what time period?"\] </parameter\> </function\>Appendixes\.Task: I need to purchase something on an e\-commerce shopping website\.The user’s preference for the agent is: The user can only answer selection questions\. The agent’s question must provide options such as A, B, C, and the user will respond with their choice \(e\.g\., A\)\.

## Appendix GEvolution Reflect Prompts

In each Evolve step \(Section[4\.2](https://arxiv.org/html/2606.14314#S4.SS2)\), an LLM is prompted to analyze rollout results and propose policy edits as a JSON patch\. The prompt consists of a system message followed by seven numbered sections: §1 optimization goals, §2 modification requirements, §3 output JSON schema, §4 per\-task scores, §5 current policy snapshots, §6 reference trajectories, and §7 a final formatting instruction\. Sections 4–6 are populated dynamically from the current round’s rollout data\. Below we show the full template for each benchmark; static sections are reproduced as written, while dynamic sections use\[\.\.\.\]to indicate the content they carry\.

### G\.1SWE\-bench

SWE\-bench Reflect PromptSystem prompt\.You are a senior prompt engineer\. The downstream agent may use ask\_question \(plain text\) or generate\_ui \(HTML\) to obtain user information\. You must respond with one JSON object only \(no markdown fences\), matching the schema in the user message\. The user message supplies schemas, scores, and the prompts to revise\. Improve guidance so the agent chooses ask\_question vs generate\_ui appropriately\.*§1 Optimization goals\.*Your objective is better end\-task performance by editing prompts so the agent chooses appropriately between ask\_question \(quick text clarification\) and generate\_ui \(structured HTML\) when it needs user input\. \- System / example prompts: teach when to use plain text vs a rendered page, how much to inspect first, and how to keep clarifications focused\. Do not contradict the frozen tool catalog / XML template block\. \- generate\_ui: use when multi\-field, branching, or layout clearly helps the user answer\. \- ask\_question: use when a single short answer is enough\. \- SWE: reduce wasted edits before key facts; align with repo exploration norms\.*§2 Modification requirements\.*\- Improve prompts using the score table \(this round’s eval slice\) and the reference trajectories in section 6\. \- Prefer \`system\_delta\` / \`example\_delta\` when small additions suffice; you may still output full \`hybrid\_system\` / \`hybrid\_example\` if needed\. \- \`hybrid\_system\` / \`system\_delta\`: the tool block from \`You have access to the following functions:\` through the first format\-example \`</function\>\` before \`<IMPORTANT\>Reminder:\` must remain byte\-identical to the current snapshot\. \- Do not include: gold answers, hidden test labels, \`preference\_ok\` shortcuts, or dataset leakage\. \- Output valid JSON only \(one top\-level object\), UTF\-8 strings\.*§3 Output JSON schema\.*\{"utils": \{"hybrid\_system": "<optional full replacement string\>", "hybrid\_example": "<optional full replacement string\>", "system\_delta": "<optional suffix appended to current hybrid system\>", "example\_delta": "<optional suffix appended to current hybrid example\>"\}\}*§4 Per\-task scores\.*\[For each taski∈ℬi\\in\\mathcal\{B\}, hybrid\-only scores on productivity\.\]*§5 Current policy snapshots\.*\[Full text of the current \`hybrid\_system\` and \`hybrid\_example\` prompts, incorporating all prior accepted edits\.\]*§6 Reference trajectories\.*\[For each task, hybrid rollout excerpts showing every \`ask\_question\` and \`generate\_ui\` call with the user\-simulator response, preserving turn order\.\]*§7 Final instruction\.*Emit one JSON object as in section 3 \(JSON schema\)\. No markdown outside that JSON\.

### G\.2TravelGym

TravelGym Reflect PromptSystem prompt\.You are a senior prompt engineer\. The downstream agent may use ask\_question \(plain text\) or generate\_ui \(HTML\) to obtain user information\. You must respond with one JSON object only \(no markdown fences\), matching the schema in the user message\. Target environment: TravelGym / preference\-driven travel\. Agent prompts come from the dataset parquet \(\`interact\_with\_env\`, etc\.\); you improve behavior by emitting \`travel\.system\_suffix\` only \(see schema\)\.*§1 Optimization goals\.*Your objective is better end\-task performance by editing \`travel\.system\_suffix\` only \(plain text appended each round to the Travel agent’s first system message\)\. \- Teach when to use ask\_question vs generate\_ui, and how to use \`interact\_with\_env\` \(\`choice=action\`, \`content\_kind=text\|html\`\) appropriately\. \- Suffixes accumulate across rounds; prefer small, non\-contradictory additions\. \- Travel: respect stated amateur/pro tone; avoid over\-proactive suggestions\.*§2 Modification requirements\.*\- Improve behavior using the score table \(this round’s eval slice\) and the reference trajectories in section 6\. \- Output \`travel\.system\_suffix\` only: appended to the first system message\. Omit \`travel\.user\_suffix\` unless you have a rare cross\-task user\-channel policy \(discouraged\)\. \- Prefer short suffixes \(tone, when to use \`ask\_question\` vs \`generate\_ui\`, \`interact\_with\_env\` with \`choice=action\` and \`content\_kind=text\|html\`\)\. \- Do not include: gold answers, hidden test labels, or dataset leakage\. \- Output valid JSON only \(one top\-level object\), UTF\-8 strings\.*§3 Output JSON schema\.*\{"travel": \{"system\_suffix": "<string appended to first agent system\>"\}\}*§4 Per\-task scores\.*\[For each taski∈ℬi\\in\\mathcal\{B\}, hybrid\-only scores on productivity\.\]*§5 Current policy snapshots\.*\[The native Travel agent system prompt as rebuilt by the eval pipeline\.*§6 Reference trajectories\.*\[For each task, hybrid rollout excerpts showing every \`interact\_with\_env\` call \(tagged with \`content\_kind=text\|html\`\) and the user\-simulator response\.\]*§7 Final instruction\.*Emit one JSON object as in section 3 \(JSON schema\)\. No markdown outside that JSON\.

### G\.3τ2\\tau^\{2\}\-bench

τ2\\tau^\{2\}\-bench Reflect PromptSystem prompt\.You are a senior prompt engineer\. The downstream agent may use ask\_question \(plain text\) or generate\_ui \(HTML\) to obtain user information\. You must respond with one JSON object only \(no markdown fences\), matching the schema in the user message\. Target environment: Tau2 / airline customer\-service assistant\. Agent system is rebuilt in\-code per domain; you improve behavior by emitting \`tau2\.system\_suffix\` only \(see schema\)\. Nudge ask\_question vs generate\_ui and keep clarifications focused and domain\-appropriate\.*§1 Optimization goals\.*Your objective is better end\-task performance by editing \`tau2\.system\_suffix\` only \(plain text appended each round to theτ2\\tau^\{2\}agent’s rebuilt system in hybrid mode\)\. \- Teach when to use ask\_question \(short text\) vs generate\_ui \(HTML\), and how to keep clarifications focused\. \- Suffixes accumulate across rounds; prefer small, non\-contradictory additions\. \- Do not attempt to replace the in\-codeτ2\\tau^\{2\}tool catalog or \`<function=\` XML templates via JSON; use suffixes for policy and tone only\. \- Tau2: keep clarifications domain\-neutral unless the task domain is explicit\.*§2 Modification requirements\.*\- Improve behavior using the score table \(this round’s eval slice\) and the reference trajectories in section 6\. \- Output \`tau2\.system\_suffix\` only: a plain string appended to theτ2\\tau^\{2\}agent’s rebuilt system for the next eval\. Omit \`tau2\.user\_suffix\` unless you have a rare cross\-task user\-channel policy \(discouraged\)\. \- Prefer short suffixes; do not try to replace the in\-code tool or XML templates\-\-\-suffixes are for policy and tone only\. \- Do not include: gold answers, hidden test labels, \`preference\_ok\` shortcuts, or dataset leakage\. \- Output valid JSON only \(one top\-level object\), UTF\-8 strings\.*§3 Output JSON schema\.*\{"tau2": \{"system\_suffix": "<string appended to agent system; prefer non\-empty when you have a concrete improvement\>"\}\}*§4 Per\-task scores\.*\[For each taski∈ℬi\\in\\mathcal\{B\}, hybrid\-only scores on productivity\.\]*§5 Current policy snapshots\.*\[Theτ2\\tau^\{2\}agent system prompt as rebuilt by the eval pipeline\.\]*§6 Reference trajectories\.*\[For each task, hybrid rollout excerpts showing every \`ask\_question\` and \`generate\_ui\` call with the user\-simulator response\.\]*§7 Final instruction\.*Emit one JSON object as in section 3 \(JSON schema\)\. No markdown outside that JSON\.

### G\.4WebArena

WebArena Reflect PromptSystem prompt\.You are a senior prompt engineer\. The downstream agent may use ask\_question \(plain text\) or generate\_ui \(HTML\) to obtain user information\. You must respond with one JSON object only \(no markdown fences\), matching the schema in the user message\. Target environment: WebArena / browser\-style tasks\. For\_genui\`; for \`env\_family=webarena\` you emit \`webarena\.system\_suffix\` only \(see schema\)\.*§1 Optimization goals\.*Your objective is better end\-task performance by editing \`webarena\.system\_suffix\` only \(plain text appended each round to the WebArena agent’s rebuilt system in GenUI hybrid mode\)\. \- Teach when to use ask\_question vs generate\_ui after inspecting the page, and how to keep clarifications task\-grounded\. \- Suffixes accumulate across rounds; prefer small, non\-contradictory additions\. \- Do not try to replace the full browser tool XML block or inject OpenHands repo tools via JSON\. \- WebArena: prefer choices grounded in visible page/task state; never leak evaluation labels\.*§2 Modification requirements\.*\- Improve behavior using the score table \(this round’s eval slice\) and the reference trajectories in section 6\. \- Output \`webarena\.system\_suffix\` only: appended to the rebuilt system\. Omit \`webarena\.user\_suffix\` unless you have a rare cross\-task user\-channel policy \(discouraged\)\. \- Prefer short suffixes \(policy, tone, ask vs UI\)\. \- Do not include: gold answers, hidden test labels, or dataset leakage\. \- Output valid JSON only \(one top\-level object\), UTF\-8 strings\.*§3 Output JSON schema\.*\{"webarena": \{"system\_suffix": "<string appended to rebuilt agent system\>"\}\}*§4 Per\-task scores\.*\[For each taski∈ℬi\\in\\mathcal\{B\}, hybrid\-only scores on productivity\.\]*§5 Current policy snapshots\.*\[The WebArena agent system prompt as rebuilt by the eval pipeline\.\]*§6 Reference trajectories\.*\[For each task, hybrid rollout excerpts showing every \`ask\_question\` and \`generate\_ui\` call with the user\-simulator response\.\]*§7 Final instruction\.*Emit one JSON object as in section 3 \(JSON schema\)\. No markdown outside that JSON\.

## Appendix HPersona\-Level Case Studies

We present persona\-level analyses that illustrate how communication mode effects are moderated by user persona\. All values are productivity / proactivity / personalization; best productivity per row inbold blue\.

### H\.1Persona Determines the Best Mode

Table 7:τ2\\tau^\{2\}\-bench DeepSeek\-V3\.2 \+ GPT\-4o: productivity / proactivity / personalization by persona\. The best mode varies with persona—text for amateur, UI/hybrid for answer\_more, text/hybrid for do\_selection, UI for one\_question\.Table[7](https://arxiv.org/html/2606.14314#A8.T7)breaks downτ2\\tau^\{2\}\-bench DeepSeek\-V3\.2 \+ GPT\-4o by persona\. The optimal communication mode differs for every persona: amateur users achieve the highest productivity withMtextM\_\{\\text\{text\}\}\(\.260\), answer\_more with eitherMuiM\_\{\\text\{ui\}\}orMhybridM\_\{\\text\{hybrid\}\}\(\.320\), do\_selection withMtextM\_\{\\text\{text\}\}orMhybridM\_\{\\text\{hybrid\}\}\(\.380\), and one\_question withMuiM\_\{\\text\{ui\}\}\(\.300\)\. No single channel dominates across personas, directly validating the paper’s core claim that optimal communication policy is conditional on*who*the user is\.

### H\.2When Naive Hybrid Access Backfires

Table 8:TravelGym GPT\-5\-mini \+ Qwen3\-VL\-32B: productivity / proactivity / personalization by persona\.MhybridM\_\{\\text\{hybrid\}\}trails at least one single\-channel baseline for amateur, answer\_more, and one\_question\.Table[8](https://arxiv.org/html/2606.14314#A8.T8)breaks down TravelGym GPT\-5\-mini \+ Qwen3\-VL\-32B by persona\.MhybridM\_\{\\text\{hybrid\}\}underperforms at least one single\-channel baseline in three of four personas, with the amateur persona showing the steepest drop \(MuiM\_\{\\text\{ui\}\}1\.012→\\rightarrowMhybridM\_\{\\text\{hybrid\}\}\.780,−23%\-23\\%\)\. When the agent lacks the judgment to choose channels wisely, having both channels is worse than having one, directly motivating the need for CPE\.

### H\.3Cross\-Benchmark Persona–Mode Alignment

The two analyses above illustrate persona\-mode interactions in specific settings\. Here we compare across all 14 agent–user experiment configurations to identify systematic persona–benchmark–mode alignment patterns\.

Case 1: Amateur users\.MtextM\_\{\\text\{text\}\}is the safest choice\.

Table 9:Amateur persona across benchmarks \(productivity / proactivity / personalization, means\)\. Best per column in each row:productivity\(blue\),proactivity\(orange\),personalization\(teal\)\. Bottom row: number of model combos \(tau2/WA/TG/SWE\) where that mode was best or tied for best in productivity\.Table[9](https://arxiv.org/html/2606.14314#A8.T9)breaks down amateur\-persona productivity across the four benchmarks\.MtextM\_\{\\text\{text\}\}is the best or tied\-for\-best mode in 4 of 4τ2\\tau^\{2\}\-bench combos, 2 of 2 WebArena combos, and 2 of 4 TravelGym combos\. Only SWE\-bench deviates, where all modes score near zero andMhybridM\_\{\\text\{hybrid\}\}holds a marginal edge \(\.028 vs\. \.019\)\. The explanation is consistent with the persona design: amateur users refuse complex questions and reward only very simple inquiries\.MuiM\_\{\\text\{ui\}\}andMhybridM\_\{\\text\{hybrid\}\}encourage richer interaction, but against an amateur persona, the extra communication turns yield mostly “I don’t know” responses and accumulate cost without recovering useful information\.MtextM\_\{\\text\{text\}\}, by asking fewer and simpler questions, avoids wasted turns\.

Case 2: one\_question persona\.MuiM\_\{\\text\{ui\}\}excels at personalization but sacrifices productivity\.

Table 10:one\_question persona across benchmarks \(productivity / proactivity / personalization\)\. Best per column in each row:productivity\(blue\),proactivity\(orange\),personalization\(teal\)\.Table[10](https://arxiv.org/html/2606.14314#A8.T10)shows a consistent trade\-off for one\_question users:MuiM\_\{\\text\{ui\}\}achieves the highest personalization in every benchmark \(by margins of 0\.16–0\.76 overMhybridM\_\{\\text\{hybrid\}\}\), but trails in productivity in three of four benchmarks\. The persona’s interaction preference \(answer exactly one question per turn\) maps naturally onto UI\-based communication, where the agent can present a single focused selection widget\. However, the structured UI interaction is slower than asking one targeted text question, leading to lower throughput\. This persona illustrates the*personalization–productivity tension*: the mode that best satisfies the user’s interaction style is not always the mode that completes the task fastest\.

## Appendix ICPE\-Optimized Policies

Below are the best communication policy patches discovered by CPE for each benchmark and model pairing\. For SWE\-bench, the patch rewrites the full hybrid system prompt and examples; forτ2\\tau^\{2\}\-bench, TravelGym, and WebArena, it appends a suffix to the agent’s system prompt\. We reproduce the suffix and example portions\.

### I\.1SWE\-bench

SWE\-bench: DeepSeek\-V3\.2 \+ Qwen3\-VL\-32B \(hybrid example\)——————— EXAMPLES \(hybrid: choose text or UI per turn\) ———————For each clarification, pickonetool:ask\_question\(user reads plain text in chat\) orgenerate\_ui\(user sees a rendered HTML page\)\. Useask\_questionwhen a short sentence is enough; usegenerate\_uiwhen layout, choices, or structured fields help\.— Example A — text \(ask\_question\) after inspecting the repo —USER: The app sometimes shows a random error page\. Can you figure out what’s wrong?ASSISTANT: I’ll list the top\-level layout, then ask one focused text question if details are missing\.execute\_bash: ls \-1 /testbedUSER: api / frontend / scripts / testsASSISTANT:ask\_question: What is the exact error message or stack trace when the error page appears?— Example B — visual \(generate\_ui\) when structured input helps —ASSISTANT \(needs options or a clearer form\):generate\_ui:progress\_summary: I’ve mapped the repo\. I need one concrete detail to reproduce the issue\.html\_code:A minimal HTML form with a<select\>for timing and a<textarea\>for additional notes, styled with sans\-serif font\.

### I\.2τ2\\tau^\{2\}\-bench

τ2\\tau^\{2\}\-bench: DeepSeek\-V3\.2 \+ Qwen3\-VL\-32BWhen a user provides their user ID, immediately use it to look up their account and reservations without asking for it again\. For users who prefer one question at a time, ensure eachask\_questioncontains only a single, clear query\. For users who can only answer selection questions, always phrase questions with explicit lettered options \(A, B, C, etc\.\) and useask\_question\. Usegenerate\_uionly when presenting complex visual choices or forms, not for simple selection or yes/no questions\. For amateur users, avoid asking for specific dates, times, or policy details; ask only general, non\-technical questions\. If a user’s constraints \(e\.g\., ‘do not transfer’\) are known, respect them and avoid suggesting options that violate them\.

τ2\\tau^\{2\}\-bench: DeepSeek\-V3\.2 \+ GPT\-4oClarification policy \(append\-only\): Preferask\_questionfor single short factual checks \(reservation/user/payment ID, yes/no, a single\-letter or A/B/C choice\) and whenever the user explicitly prefers one\-question\-at\-a\-time\. Usegenerate\_uiwhen you need structured input, when presenting more than three detailed options, or when a visual layout/form will make selection or entry unambiguous — always include a 1–3 sentence progress\_summary and exactly one question object in the UI\. Ask only the minimum data required to take the next action and verify identity once at the start of the session \(don’t repeatedly re\-request the same credential unless the session expired\)\. If anask\_questionis answered “I don’t know” twice or the UI times out, switch strategy: either present a shortgenerate\_ui\(single question\) or rephrase as a concise A/B/Cask\_question\(≤\\leq3 choices\)\. Never repeat an identical question; if clarification fails, offer an alternate distinguishing question or a short selection UI\. One function call per turn; keep clarifications focused, short, and directly tied to the next system action\.

τ2\\tau^\{2\}\-bench: GPT\-5\-mini \+ GPT\-4oAllow at most 2 consecutiveask\_questionprobes for the same verification goal; if the user replies “I don’t know” twice in a row about that item, stop repeating low\-yield probes and immediately present a single escalation step \(one shortask\_question\) offering three clear choices \(e\.g\. A\) Transfer to human, B\) Send OTP to contact on file, C\) Search by passenger name\+date\)\. Useask\_questiononly for single short answers or a single\-letter/number selection and honor pref\_one\_question/pref\_do\_selection by asking one concise selection; usegenerate\_uionly when you need the user to enter 3\+ structured fields, sensitive exact values, or the user explicitly agrees to complete a form\. When callinggenerate\_uiinclude a 1–2 sentence progress\_summary, render exactly one question in the questions JSON that matches the HTML, and include only the fields strictly necessary\.

τ2\\tau^\{2\}\-bench: GPT\-5\-mini \+ Qwen3\-VL\-32BAttemptget\_reservation\_detailsorget\_user\_detailsbefore asking anything; if all required fields are present, call the action immediately\. Useask\_questiononly for exactly one missing fact or a single\-choice decision \(prefer A/B/C\); keep the prompt concise \(≤\\leq20 words\), wrap it as<parameter=query\>\.\.\.</parameter\>, and ask exactly one question per call\. Limit consecutiveask\_questioncalls to 2; if the user replies “I don’t know” or times out twice, stop asking and calltransfer\_to\_human\_agentswith a one\-sentence summary\. Usegenerate\_uionly when you must collect two\-or\-more structured fields and the user is likely able to complete a small form; include a 1–3 sentence progress\_summary, a complete HTML5 document, and a questions parameter that exactly matches the page question\.

### I\.3TravelGym

TravelGym: DeepSeek\-V3\.2 \+ GPT\-4oUI/question policy: When the user is vague, prefer one focused database search \(interact\_with\_env choice=search\) to gather candidates before asking clarifying questions\. Useinteract\_with\_envwithchoice=actionandcontent\_kind=textfor a single, concise clarification \(ask one thing only\)\. Usecontent\_kind=htmlonly to present a small selection UI \(roughly 2–6 labeled options\) — the HTML must be a complete valid HTML5 document, include<meta charset="utf\-8"/\>, and present exactly one clear travel\-related question\. After the user answers, if you can confidently recommend, callinteract\_with\_envwithchoice=answerand return exactly one option ID\. Conserve interaction rounds, keep questions short and targeted, match the user’s stated amateur/pro tone, and avoid unnecessary proactive suggestions\.

TravelGym: GPT\-5\-mini \+ Qwen3\-VL\-32BHybrid interaction rules: Useinteract\_with\_env choice=actionwhen you need to ask the user something\. For short clarifications \(dates, budget, party size, location, one yes/no or single preference\) usecontent\_kind=textand ask exactly one focused question per action\. Usecontent\_kind=htmlonly to present a structured selection \(at most 3 clearly labeled options\) — supply a complete valid HTML5 document with<meta charset="utf\-8"/\>and make the page ask exactly one travel\-related question\. Usechoice=searchto query the database once you have the necessary constraints\. Usechoice=answeronly to return a single option ID after preferences are confirmed\. Avoid multi\-part compound questions; ask sequentially to gather required details efficiently\.

### I\.4WebArena

WebArena: DeepSeek\-V3\.2 \+ GPT\-4oClarification Strategy\.Before usingask\_questionorgenerate\_ui, first explore the page thoroughly using goto, click, scroll, and other navigation actions\. Useask\_questionfor simple, text\-based clarifications about the task \(e\.g\., ‘Which category should I search in?’\)\. Usegenerate\_uionly when you need to present multiple options visually or gather structured input that benefits from a form\. Keep questions concise and task\-focused; avoid asking for information already visible on the page or implied by the task description\. If the task is ambiguous after exploration, ask one clarifying question to resolve the ambiguity, then proceed\.Action Priority\.Prioritize direct navigation and interaction over asking questions\. Only useask\_questionorgenerate\_uiwhen you’ve exhausted visible page information and cannot proceed without clarification\. When you do ask, make your question specific and actionable — reference what you’ve already tried or observed on the page\.

WebArena: GPT\-5\-mini \+ GPT\-4oBecause productivity is the primary objective, prefer completing the task by interacting with visible page controls \(click/fill/select/scroll\) and by exploring the AXTree/URL before asking the user anything\. Useask\_questiononly when a single, specific missing fact truly blocks progress and cannot be discovered by on\-page actions; phrase it as one short sentence \(≤\\leq20 words\) that names the missing fact and why it is needed, and cite a page element or AXTree id if applicable\. Usegenerate\_uionly when the user must choose among multiple explicit options or complete a small structured form that cannot be captured in one short question; include a 1–2 sentence progress\_summary tied to the current AXTree/URL, keep the html minimal, and include exactly one question\. Never ask about information obtainable by simple exploration; avoid repeated clarification turns \(one clarifying question per blocking ambiguity\)\.

### I\.5Analysis of Optimized Policies

Several patterns recur across the CPE\-discovered policies, reflecting convergent strategies for effective channel selection:

1. 1\.Environment\-first, then clarify\.SWE\-bench and WebArena policies instruct the agent to explore the repository or page before asking questions\. TravelGym policies recommend a database search before questioning\. This avoids wasting communication turns on information the agent can discover autonomously\.
2. 2\.Text for simple, UI for structured\.All policies converge on a consistent channel heuristic:ask\_questionfor single factual queries \(IDs, yes/no, A/B/C choices\),generate\_uifor structured input \(3\+ fields, visual layouts, option comparisons\)\. This aligns with the low/high\-constraint design of the two primitives\.
3. 3\.Fail\-fast escalation\.τ2\\tau^\{2\}\-bench policies independently discover a “2\-strikes” rule: after two consecutive “I don’t know” or timeout responses, stop probing and escalate \(transfer to human, offer structured alternatives\)\. This prevents unproductive clarification loops\.
4. 4\.Persona\-aware adaptation\.Theτ2\\tau^\{2\}\-bench DeepSeek\-V3\.2 \+ Qwen3\-VL\-32B policy explicitly adapts to all four personas \(amateur: no professional questions; do\_selection: A/B/C only; one\_question: single query; answer\_more: forms allowed\)\. Other policies encode this implicitly through question constraints\.

These patterns emerge purely from productivity\-driven prompt optimization without human specification, confirming that CPE discovers effective and interpretable communication heuristics\.

## Appendix JCPE Configuration

All CPE runs share the same optimization setup, differing only in the candidate dataset and model pairing\. We optimize the communication policy prompt \(system prompt, examples, or environment\-specific suffix\); agent and user\-simulator weights are frozen\. Table[11](https://arxiv.org/html/2606.14314#A10.T11)lists the key parameters\.KKis the number of episodes evaluated per round; the reflect LLM receives up to 5 trajectory excerpts \(onlyask\_question/generate\_uiturns\) from theKKrollouts\.RRis the maximum number of rounds; in practice, the best policy was found within 25 rounds across all runs\.

Table 11:CPE configuration across benchmarks\.
## Appendix KLLM Usage

LLMs were used only for light editing \(grammar and clarity of writing\) and minor code completion \(data processing scripts\)\. They were not involved in research ideation, experimental design, analysis, or core contributions\.

Similar Articles

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

arXiv cs.CL

CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.

PolicyBank: Evolving Policy Understanding for LLM Agents

arXiv cs.CL

PolicyBank proposes a memory mechanism that enables LLM agents to autonomously refine their understanding of organizational policies through iterative interaction and corrective feedback, closing specification gaps that cause systematic behavioral divergence from true requirements. The work introduces a systematic testbed and demonstrates PolicyBank can close up to 82% of policy-gap alignment failures, significantly outperforming existing memory mechanisms.

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

arXiv cs.CL

This paper investigates why LLM agents suffer from progressive capability collapse under multi-iteration experience internalization and proposes a robust recipe addressing experience granularity, injection patterns, and training regime. Key findings include that principle-level experience, step-wise injection, and off-policy context-distillation yield more stable and sustainable continual learning.