MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

arXiv cs.AI 06/17/26, 04:00 AM Papers
benchmark map-agents llm-agents implicit-decision-factors user-satisfaction evaluation
Summary
MapSatisfyBench is a benchmark for evaluating LLM-based map agents on their ability to recover implicit decision factors from underspecified user queries, shifting evaluation from task completion to satisfaction-aware spatial decision making.
arXiv:2606.17453v1 Announce Type: new Abstract: Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:36 AM
# MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors
Source: [https://arxiv.org/html/2606.17453](https://arxiv.org/html/2606.17453)
Lubin Bai2†\\dagger, Mengyu Cao1†\\dagger, Sixue Wang1, Zhongwei Wan1, Yue Pan1, Jiale Hou1, Xiang Li1\*, Xiuyuan Zhang2\*

###### Abstract

Large language model agents are increasingly integrated into map services\. Since map services are embedded in everyday\-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction\. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources\. However, evaluating this ability is challenging\. The first challenge is to determine which implicit decision factors are suitable for evaluation\. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds\. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction\-relevant factors into objective and quantifiable evaluation targets\. To address these challenges, we propose a restore\-identify\-filter framework that reconstructs complete user needs from behavior\-chain evidence, identifies implicit decision factors, and retains only those supported by pre\-query evidence\. Building on this methodology, we construct MapSatisfyBench from large\-scale, real\-world anonymized user data and annotate ground truth from five dimensions and enables full\-chain evaluation of satisfaction\-aware map agents\. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction\-aware decisions\. These findings establish MapSatisfyBench as a benchmark for shifting map\-agent evaluation from task completion toward satisfaction\-aware spatial decision making\.

## 1 Introduction

![Refer to caption](https://arxiv.org/html/2606.17453v1/figures/intro.png)Figure 1:Motivation of MapSatisfyBench\. Map\-service queries often define multiple feasible responses, and satisfaction depends on whether the agent recovers behavior\-supported implicit decision factors\.Large language model \(LLM\) agents are increasingly deployed as map\-service assistants that translate natural language requests into executable actions, such as location search, route navigation, and trip planning\(Xieet al\.[2024](https://arxiv.org/html/2606.17453#bib.bib43); AMAP AI Agent Teamet al\.[2025](https://arxiv.org/html/2606.17453#bib.bib21)\)\. By integrating natural language interaction, contextual signals, and tool\-based execution, these agents provide more convenient and flexible support for everyday spatial decisions\. However, since map services are embedded in everyday\-life scenarios rather than professional task settings, real\-world user queries are rarely fully specified\(Kamvar and Baluja[2006](https://arxiv.org/html/2606.17453#bib.bib22); Church and Smyth[2009](https://arxiv.org/html/2606.17453#bib.bib10)\)\. Users often issue short and incomplete queries, while still expecting the agent to understand their needs even when those needs are not explicitly stated\. This challenge is further amplified by the open\-ended nature of many map\-service tasks\. Unlike fact\-seeking questions that usually have a unique correct answer, many daily map queries are inherently multi\-solution, i\.e\., their surface text defines a space of possible answers rather than a single ground\-truth response\(Vansteenwegenet al\.[2011](https://arxiv.org/html/2606.17453#bib.bib39); Purveset al\.[2018](https://arxiv.org/html/2606.17453#bib.bib33); Dellinget al\.[2017](https://arxiv.org/html/2606.17453#bib.bib13)\)\. Consequently, the key mission for a smart map agent is not merely to produce a feasible response, but to select, among multiple feasible responses, the one most likely to satisfy the user\.

Such satisfaction is closely tied to implicit decision factors that are not expressed in the query but are critical to whether the resulting decision is acceptable to the user\(Puet al\.[2012](https://arxiv.org/html/2606.17453#bib.bib32); Adomavicius and Tuzhilin[2005](https://arxiv.org/html/2606.17453#bib.bib2)\)\. For example, when a user asks ”How can I get there?” after searching for a hospital near a railway station, both driving and public transit may be semantically valid responses; however, if the user has just arrived by train and has no car available, a transit\-oriented route is more likely to be acceptable\. Although an agent could resolve such uncertainty by asking the user for clarification, repeatedly doing so will increase interaction burden and reduce usability\(Zouet al\.[2023](https://arxiv.org/html/2606.17453#bib.bib49); Zamaniet al\.[2020](https://arxiv.org/html/2606.17453#bib.bib44)\)\. In many cases, the missing factors can be recovered from available information source, like user profiles and interaction history\(Church and Smyth[2009](https://arxiv.org/html/2606.17453#bib.bib10); Villegaset al\.[2018](https://arxiv.org/html/2606.17453#bib.bib40)\)\. A capable map agent should therefore proactively exploit these information sources and reserve clarification questions for cases that cannot be reliably resolved otherwise\(Maes[1994](https://arxiv.org/html/2606.17453#bib.bib27); Horvitz[1999](https://arxiv.org/html/2606.17453#bib.bib20)\)\. This motivates the study to design a benchmark for assessing whether map agents can identify information gaps, proactively acquire relevant evidence from available sources, and incorporate the recovered implicit decision factors into satisfaction\-aware responses\.

Recent benchmarks have substantially advanced the evaluation of LLM\-based map agents from multiple perspectives, including planning, tool use, and information synthesis\(Xieet al\.[2024](https://arxiv.org/html/2606.17453#bib.bib43); Chaudhuriet al\.[2025](https://arxiv.org/html/2606.17453#bib.bib7); Chenget al\.[2025](https://arxiv.org/html/2606.17453#bib.bib9); Songet al\.[2026](https://arxiv.org/html/2606.17453#bib.bib37); Heet al\.[2025a](https://arxiv.org/html/2606.17453#bib.bib18); LBS\-IntentBench Contributors[2026](https://arxiv.org/html/2606.17453#bib.bib24)\)\. These efforts provide important signals for improving map agents, but they do not directly measure whether an agent can make decisions that are likely to satisfy the user under underspecified map\-service interactions\. Building such a benchmark is meaningful but challenging in two respects\. First, implicit decision factors are complex\. They are the key conditions that narrow the feasible solution space and make one response more acceptable than other semantically valid alternatives, yet they are very complex\. On the one hand, not every factor revealed by user behavior is suitable for evaluation, because some of them are not recoverable and evaluable\. And a benchmark should retain only those factors that an agent could reasonably recover from information available before it responds \(like historical behavior and spatio\-temporal environment\)\. On the other hand, These factors also have different characteristics\. For example, some are hard constraints whose violation has a stronger negative impact on response acceptability, whereas others are soft preferences that only modestly shift the relative acceptability of feasible options\. Second, ground truth construction is non\-trivial because satisfaction cannot be reduced to a single reference answer or a direct satisfied/unsatisfied label\. A satisfaction\-oriented benchmark should instead convert the factors that affect acceptance into quantifiable evaluation references, including whether each implicit factor is satisfied and how strongly it should influence the final score\. This requires the joint design of ground\-truth annotations and evaluation metrics, so that the benchmark can comprehensively and accurately assess whether an agent produces responses that are likely to satisfy the user\.

To address these challenges, we propose a restore\-identify\-filter framework based on behavior\-chain evidence and construct MapSatisfyBench\. Behavior\-chain evidence connects the pre\-query information, the spatio\-temporal environment, and the user’s subsequent actions, providing an objective basis for reconstructing what the user was trying to accomplish\. Based on this evidence, the restoration step recovers the complete need behind the interaction; the identification step compares this reconstructed need with the surface query to expose unspoken decision factors; and the filtering step retains only the factors supported by pre\-response evidence and therefore evaluable for an agent\. The retained factors are further annotated with their constraint type and evidence\-supported weight, allowing MapSatisfyBench to quantify not only whether an agent satisfies the explicit task, but also whether it satisfies the implicit factors that affect accepted\-response probability\. On this basis, MapSatisfyBench provides a behavior\-grounded benchmark for satisfaction\-aware decision making in map services\. Unlike task\-completion\-oriented benchmarks that primarily assess whether an agent can plan, retrieve, or execute a valid map\-service task, MapSatisfyBench evaluates whether an agent can align open\-ended map\-service decisions with behavior\-supported factors that determine whether a user is likely to accept the response\.

MapSatisfyBench aims to promote the evaluation of map agents from ”whether the task is finished” to ”whether the decision satisfies the user\.” In our evaluation of 12 LLM\-based agents, current systems generally achieve high scores on explicit intent completion and factual faithfulness, yet show clear weaknesses in implicit\-need satisfaction and tool selection\. These results indicate that map agents can often understand the stated request, but still struggle to identify the behavior\-supported decision factors that make a response feel acceptable and useful to the user\. Our contributions are threefold\. First, we propose a methodology for converting subjective satisfaction into behavior\-supported implicit decision factors, enabling objective evaluation without direct satisfaction labels\. Second, we construct MapSatisfyBench, a satisfaction\-aware benchmark covering diverse realistic map\-service scenarios, together with a full\-chain evaluation protocol that jointly assesses explicit task completion, implicit\-need satisfaction, tool selection, and so on\. Third, our experiments reveal a persistent gap between surface task completion and satisfaction\-aware decision making, offering practical diagnostic insights for developing map agents that can better support real user decisions\.

## 2 Related Work

##### Agent Benchmarks for Map Services\.

The increasing deployment of LLM agents in map services has motivated a growing body of benchmarks for evaluating agent capabilities under geographic constraints and real\-world service requirements\. Early efforts use travel planning as a structured proxy for spatial decision making\. TravelPlanner\(Xieet al\.[2024](https://arxiv.org/html/2606.17453#bib.bib43)\)introduced a first realistic sandboxes for evaluating multi\-constraint itinerary generation with external tools and travel records, while subsequent benchmarks such as TripCraft\(Chaudhuriet al\.[2025](https://arxiv.org/html/2606.17453#bib.bib7)\), TravelBench\(Chenget al\.[2025](https://arxiv.org/html/2606.17453#bib.bib9)\), and VitaBench\(Heet al\.[2025b](https://arxiv.org/html/2606.17453#bib.bib19)\)extend evaluation toward more realistic service\-oriented settings like unsolvable requests and cross\-scenario life\-service tasks\. Collectively, these benchmarks mark a shift from static itinerary generation to interactive, tool\-grounded service completion\. More recent works evaluate agents in direct map\-service scenarios\. MobilityBench\(Songet al\.[2026](https://arxiv.org/html/2606.17453#bib.bib37)\)evaluates route\-planning agents on anonymized Amap queries through a deterministic API\-replay sandbox, whereas LocalSearchBench\(Heet al\.[2025a](https://arxiv.org/html/2606.17453#bib.bib18)\)targets local\-service search over large\-scale merchant databases and real user requests requiring multi\-hop reasoning\. LBS\-IntentBench\(LBS\-IntentBench Contributors[2026](https://arxiv.org/html/2606.17453#bib.bib24)\)moves map agent evaluation beyond explicit instruction execution by focusing on implicit intent inference, which is partially aligned with our motivation\. However, it primarily evaluates intent recovery\. In contrast, MapSatisfyBench focuses on whether agents can make satisfactory and executable decisions when ambiguity remains, by reconstructing behavior\-supported implicit decision factors from user behavior chains and evaluating satisfaction\-aware responses\.

##### Satisfaction\-aware Agent Benchmarks\.

User satisfaction has been studied from early dialogue\-system evaluation to recent user\-centric agent benchmarks\. The classical PARADISE framework\(Walkeret al\.[1997](https://arxiv.org/html/2606.17453#bib.bib41)\)relates satisfaction to task success and interaction cost, and later task\-oriented dialogue work, including USS\(Sunet al\.[2021](https://arxiv.org/html/2606.17453#bib.bib38)\), SG\-USM\(Fenget al\.[2023](https://arxiv.org/html/2606.17453#bib.bib15)\), SPUR\(Linet al\.[2024](https://arxiv.org/html/2606.17453#bib.bib26)\), and CAUSE\(Abolghasemiet al\.[2024](https://arxiv.org/html/2606.17453#bib.bib1)\), further models satisfaction from dialogue trajectories, task\-attribute fulfillment, interpretable rubrics, or counterfactual dissatisfaction cases\. In general LLM evaluation, satisfaction is often approximated through human preference or LLM\-as\-judge protocols, as in InstructGPT\(Ouyanget al\.[2022](https://arxiv.org/html/2606.17453#bib.bib31)\), WebGPT\(Nakanoet al\.[2021](https://arxiv.org/html/2606.17453#bib.bib28)\), MT\-Bench and Chatbot Arena\(Zhenget al\.[2023](https://arxiv.org/html/2606.17453#bib.bib47)\), AlpacaFarm\(Duboiset al\.[2023](https://arxiv.org/html/2606.17453#bib.bib14)\), Arena\-Hard\(Liet al\.[2024](https://arxiv.org/html/2606.17453#bib.bib25)\), and the user\-reported\-scenario benchmark URS\(Wanget al\.[2024](https://arxiv.org/html/2606.17453#bib.bib8)\)\. More recent work moves toward interaction and personalization: UserBench\(Qianet al\.[2025](https://arxiv.org/html/2606.17453#bib.bib34)\)evaluates preference discovery under underspecified goals, AURA\(Kimet al\.[2025](https://arxiv.org/html/2606.17453#bib.bib23)\)analyzes user satisfaction across interactive planning stages, CollabLLM\(Wuet al\.[2025](https://arxiv.org/html/2606.17453#bib.bib42)\)optimizes multi\-turn collaboration toward long\-term user benefit, and personalization benchmarks such as PersonalLLM\(Zolloet al\.[2025](https://arxiv.org/html/2606.17453#bib.bib48)\), PrefEval\(Zhaoet al\.[2025a](https://arxiv.org/html/2606.17453#bib.bib46)\), and PersonaLens\(Zhaoet al\.[2025b](https://arxiv.org/html/2606.17453#bib.bib45)\)evaluate adaptation to individual preferences or user profiles\. These studies show a clear shift from exact\-match correctness to user\-centered evaluation, but most treat satisfaction as a response\-level preference, predicted label, or general interaction score\. MapSatisfyBench instead operationalizes satisfaction in map\-service decisions through behavior\-chain evidence: it reconstructs recoverable implicit decision factors, distinguishes hard constraints from soft preferences, and evaluates whether a tool\-grounded spatial response increases accepted\-response probability\.

## 3 Method

![Refer to caption](https://arxiv.org/html/2606.17453v1/figures/method.png)Figure 2:Overview of MapSatisfyBench\. Benchmark construction follows the restore\-identify\-filter principle, the deterministic replay sandbox simulates tool\-grounded interaction, and the evaluation metrics cover diverse aspects\.### 3\.1 Benchmark construction

#### Construction Principle

This section introduces the core logic of benchmark construction, which serves as the foundation for sample collection and ground truth construction\. For each user instance, we formalize the interaction as:

x=\(q,g,h−,r,h\+\),x=\(q,g,h^\{\-\},r,h^\{\+\}\),whereqqis the root query,ggdenotes the spatio\-temporal environment,h−=\(c,p\)h^\{\-\}=\(c,p\)denotes the pre\-query information available before the agent response, including interaction contextccand historical profilepp,rris the agent response, andh\+h^\{\+\}denotes post\-query user behavior after the agent responds\. At inference time, the agent gets the queryqq, produces a responserrthat narrows the feasible solution space by satisfying implicit needs inferred from the visible informationggandh−h^\{\-\}, and ensures the post\-query behaviorh\+h^\{\+\}is not exposed to the agent\.

MapSatisfyBench is constructed based on the principle of linking implicit needs with user satisfaction\. We aim to identify all recoverable implicit constraints or preferences that may affect the user’s final decision\. These recovered needs must then be converted into structured ground truth and metrics for quantitative evaluation\. To this end, we design a behavior\-chain\-based benchmark construction strategy with three steps: restore, identify, and filter\.

The restoration step aims to recover the complete need behind a query\. Specifically, we combine pre\-query informationh−h^\{\-\}, the spatio\-temporal environmentgg, and post\-query behaviorh\+h^\{\+\}to recover the user’s behavior chain\. Here,h−h^\{\-\}is used to capture the contextual and historical signals available before the agent responds,ggis used to anchor the request in the concrete time and location, andh\+h^\{\+\}is used to reveal the user’s subsequent behavioral anchor and how the interaction actually unfolded\. Together, these signals allow us to reconstruct the complete need behind the current interaction\. The identification step aims to expose the gap between what the user explicitly states and what a satisfactory response must satisfy\. we compare the surface queryqqwith the reconstructed complete need to identify implicit needs that are not explicitly stated but can narrow the feasible solution space, such as transport feasibility, temporal constraints, or preference conditions\. Not all identified needs are suitable for evaluation, because some cannot be inferred from information visible to the agent\. Therefore, in the filtering step, we keep only the implicit needs whose evidence can be found inggandh−h^\{\-\}, which are the information sources available at decision time, and filter out factors that cannot be supported before the agent responds\. The retained needs form an evaluable set of implicit decision factors, each of which is further labeled as a hard constraint or a soft preference\. Through this process, MapSatisfyBench turns behavior\-chain evidence into structured ground truth that can assess whether an agent’s response satisfies both explicit task requirements and the implicit factors that affect user satisfaction\. More details are provided in the Appendix A\.1\.

#### Sample Collection\.

We collect benchmark instances by applying the restore\-identify\-filter principle to large\-scale anonymized map\-service logs\. Given a candidate instancex=\(q,g,h−,r,h\+\)x=\(q,g,h^\{\-\},r,h^\{\+\}\), we first restore the behavior chain around the root query and check whether the pre\-query context, spatio\-temporal environment, and post\-query behavior together are sufficient enough to indicate a coherent map\-service task\. We then identify whether the surface query omits decision factors that would narrow the feasible solution space, which yields a candidate implicit\-need set𝒵\(x\)\\mathcal\{Z\}\(x\)\. Finally, we filter this set by retaining only factors that are supported by information available before the agent responds, i\.e\., evidence inggandh−h^\{\-\}\. A candidate is selected only when it satisfies the explicit task constraints, has a valid behavioral anchor inh\+h^\{\+\}, and contains at least one retained implicit factor\. This procedure keeps the benchmark grounded in naturally occurring user demands while ensuring that each instance contains implicit needs that are recoverable, evaluable, and relevant to user acceptance\. The resulting sample collection covers diverse map\-service scenarios and provides variation in implicit\-need source, recovery difficulty, and constraint type, supporting both realistic coverage and discriminative evaluation of map agents\.

#### Ground Truth Annotation

Ground truth Annotation aims to convert the factors that affect user satisfaction into quantifiable evaluation references\. For each selected instance, MapSatisfyBench does not define a single gold answer\. Instead, it builds a structured decision reference

G\(x\)=\(E\(x\),𝒵eval\(x\),C\(x\),T\(x\),F\(x\)\),G\(x\)=\\left\(E\(x\),\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\),C\(x\),T\(x\),F\(x\)\\right\),whereE\(x\)E\(x\)denotes explicit decision factors,𝒵eval\(x\)\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\)denotes evaluable implicit decision factors,C\(x\)C\(x\)is the clarification policy,T\(x\)T\(x\)is the expected tool\-use trajectory, andF\(x\)F\(x\)specifies factual response requirements\. Separating explicit factors from implicit ones is necessary because a response may satisfy the literal request while still being unlikely to be accepted by the user\. Explicit factors are extracted from the surface query to define the basic validity boundary of the task and provide the reference for evaluating whether the agent understands the stated request\. On the other hand, implicit factors are obtained from the restore\-identify\-filter procedure\. They capture unspoken decision factors that are not stated inqq, but are supported by pre\-response evidence and may change the probability that the user accepts the response\. Thus, explicit factors define what any feasible answer must satisfy, whereas implicit factors define what a satisfaction\-aware answer should additionally consider\.

For each retained implicit factorzi∈𝒵eval\(x\)z\_\{i\}\\in\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\), we further trace its evidence back to the information available at decision time, including the pre\-query informationh−h^\{\-\}and the spatio\-temporal environmentgg\. Based on this evidence, we annotate the ground truth with a scoring rubric, constraint type, and evidence\-supported weight\. Each implicit factor can be annotated as:

zi=\(ρi,τi,si\),z\_\{i\}=\(\\rho\_\{i\},\\tau\_\{i\},s\_\{i\}\),whereρi\\rho\_\{i\}is the rubric instruction for evaluating whether the response satisfies the factor,τi∈\{hard,soft\}\\tau\_\{i\}\\in\\\{\\mathrm\{hard\},\\mathrm\{soft\}\\\}indicates the constraint type, i\.e\., whether the factor is a necessary constraint or a graded preference, andsis\_\{i\}is its evidence\-supported weight\.

Hard constraints are factors whose violation has a relatively strong negative effect on response executability or acceptability, such as an unavailable transport mode or unresolved destination ambiguity\. Soft preferences have a weaker, incremental effect on satisfaction and should therefore be weighted rather than treated as binary failures\. Evidence\-supported weight is a factor\-level weight that reflects the expected influence ofziz\_\{i\}on the accepted\-response probability, and it is computed from both historical preference strength and current\-session support:

wi=userpref\(zi\)⋅CurrentNeed\(zi\)\.w\_\{i\}=\\mathrm\{userpref\}\(z\_\{i\}\)\\cdot\\mathrm\{CurrentNeed\}\(z\_\{i\}\)\.Here,userpref\(zi\)\\mathrm\{userpref\}\(z\_\{i\}\)summarizes the support from long\-term and recent user profile, whileCurrentNeed\(zi\)\\mathrm\{CurrentNeed\}\(z\_\{i\}\)measures whether this factor is active in the current interaction, considering relevance, continuity, directionality, and possible conflicts in the session\. This design allows the benchmark to distinguish stable personal tendencies from temporary situational needs, and to assign higher weights to factors that are jointly supported by historical behavior and the current behavior chain\. More details are provided in the Appendix A\.2\.

In addition to explicit and implicit decision factors, each instance includes three execution\-oriented annotations\. The clarification policyC\(x\)C\(x\)specifies whether the agent should ask for additional information and the expected number of useful clarification turns, especially when an implicit factor is important but weakly supported\. The tool\-use trajectoryT\(x\)T\(x\)specifies the expected tools and parameter constraints needed to complete the task, such as route planning, POI retrieval, or service recommendation\. The factual response requirementsF\(x\)F\(x\)define tool\-grounded factual constraints for the final answer, including consistency with returned POI names, distances, and other map\-service facts, and prohibit unsupported fabrication when a required fact is not returned by the tools\. Together, these annotations turn behavior\-chain evidence into a structured ground truth that supports evaluation of stated\-task understanding, satisfaction\-relevant implicit factor satisfaction, clarification efficiency, tool\-use correctness, and factual reliability\.

To ensure the reliability of these annotations, we apply a staged quality\-control process\. Candidate ground truth is first generated from the restored behavior chain and then checked by independent LLM judges, where agreement across judges is used as a consistency signal\. Cases with insufficient agreement or ambiguous evidence are further reviewed by human experts\. Only annotations that pass these consistency and expert validation checks are retained in the benchmark\.

### 3\.2 Deterministic Replay Sandbox\.

MapSatisfyBench evaluates agents in a deterministic replay sandbox\. For each benchmark instance, the sandbox instantiates two interactive roles: a simulated user agent and a business agent under evaluation\. The business agent receives the root queryqqtogether with the information available at decision time, and must decide whether to answer directly, ask clarification questions, or call tools\. The user agent is driven by the given complete need\. When the business agent asks for missing information, the user agent automatically provides responses that are consistent with the user’s complete need\. This design allows the evaluation to cover both single\-turn and multi\-turn map\-service interactions, and makes it possible to test whether an agent can progressively recover implicit needs instead of relying only on the initial query\.

The business agent is executed with a sandboxed tool environment based on a LangGraph implementation of the ReAct pattern\. The tool list contains 22 map\-service tools covering business\-relevant operations such as POI retrieval, route planning, and service recommendation\. To ensure that different models are evaluated under the same external environment, all tool responses are served by deterministic mocks rather than live online services\. The mock layer uses a hybrid retrieval strategy: parameters with simple and explicit semantics, such aspoi\_name, are resolved by exact matching, while open\-text parameters such as search queries are resolved by embedding\-based retrieval over a fixed mock corpus\. Given the same benchmark instance, tool inventory, and mock database, the sandbox returns the same tool results across runs\. This makes the interaction trajectory reproducible and enables fair comparison of different models\. Experiments about the reliability of sand box are in Appendix B\.2 and B\.3\.

### 3\.3 Evaluation Metrics

MapSatisfyBench evaluates a model along the full decision pipeline rather than only the final text response\. The metrics are designed to answer four questions: whether the agent understands the stated task, whether it uses the right map\-service tools, whether the returned facts are grounded in tool outputs, and, most importantly, whether the final decision satisfies the implicit factors that affect the user’s probability of accepting the response\. All metrics are computed at the instance level and then averaged over the evaluation set\.

LetDDdenote the set of evaluated instances andM\(x\)M\(x\)denote a per\-instance metric, and the reported score is

M¯=1\|D\|∑x∈DM\(x\)\\overline\{M\}=\\frac\{1\}\{\|D\|\}\\sum\_\{x\\in D\}M\(x\)We leverage LLM\-as\-a\-judge protocol to assess response quality under a unified rubric\. There are seven metrics that jointly evaluate the full\-chain performance of a given agent\.

##### Explicit\-decision\-factor Completion Rate \(ECR\)\.

ECR measures whether the agent satisfies the explicit decision factorsE\(x\)E\(x\)extracted from the surface query\. For each explicit factorej∈E\(x\)e\_\{j\}\\in E\(x\), the evaluator assigns a binary completion labelaj∈\{0,1\}a\_\{j\}\\in\\\{0,1\\\}\. The instance\-level score is

ECR\(x\)=1\|E\(x\)\|∑ej∈E\(x\)aj\.\\mathrm\{ECR\}\(x\)=\\frac\{1\}\{\|E\(x\)\|\}\\sum\_\{e\_\{j\}\\in E\(x\)\}a\_\{j\}\.This metric captures basic task validity that the final response should satisfy the user’s stated requirements\.

##### Implicit\-decision\-factor Satisfaction Rate \(IISR\)\.

IISR is the central metric for satisfaction\-aware evaluation\. It measures whether the agent satisfies the implicit decision factors𝒵eval\(x\)\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\)identified during ground truth construction\. For each implicit factorziz\_\{i\}, the ground truth provides an evidence\-supported weightwiw\_\{i\}and a constraint typeτi\\tau\_\{i\}\. The evaluator assigns a satisfaction scorecic\_\{i\}according to the factor type\. Hard factors use binary labelsci∈\{0,1\}c\_\{i\}\\in\\\{0,1\\\}, whereci=0c\_\{i\}=0means that the response violates this necessary implicit constraint\. Soft factors use a graded rubricci∈\{1\.0,0\.75,0\.5,0\.25,0\}c\_\{i\}\\in\\\{1\.0,0\.75,0\.5,0\.25,0\\\}, reflecting different degrees of preference satisfaction\. The instance\-level IISR is computed as the weighted average of all implicit\-factor satisfaction scores:

IISR\(x\)=∑zi∈𝒵eval\(x\)wici∑zi∈𝒵eval\(x\)wi\.\\mathrm\{IISR\}\(x\)=\\frac\{\\sum\_\{z\_\{i\}\\in\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\)\}w\_\{i\}c\_\{i\}\}\{\\sum\_\{z\_\{i\}\\in\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\)\}w\_\{i\}\}\.Thus, hard and soft factors differ in how their factor\-level satisfaction score is assigned, but both are aggregated through the same weighted\-sum formulation\. This design reflects the fact that violating an important hard constraint strongly reduces the accepted\-response probability through a zero contribution for that factor, while the final instance score still accounts for other implicit factors that may have been satisfied\.

##### Accepted\-response Probability \(AR\)\.

AR is the main aggregate metric\. It operationalizes satisfaction as the probability that the response would be accepted, rather than as a subjective satisfaction label\. Since a response is expected to satisfy both the explicit and the implicit factors that affect acceptance, we define

AR\(x\)=ECR\(x\)⋅IISR\(x\)\.\\mathrm\{AR\}\(x\)=\\mathrm\{ECR\}\(x\)\\cdot\\mathrm\{IISR\}\(x\)\.This formulation makes the dependency explicit: ifECR\(x\)=0\\mathrm\{ECR\}\(x\)=0, the response fails the stated task andAR\(x\)=0\\mathrm\{AR\}\(x\)=0\. If there are no implicit factors,IISR\(x\)=1\\mathrm\{IISR\}\(x\)=1andAR\(x\)=ECR\(x\)\\mathrm\{AR\}\(x\)=\\mathrm\{ECR\}\(x\)\. AR is therefore the metric most directly aligned with the goal of MapSatisfyBench, i\.e\., evaluating whether the agent’s decision is likely to be accepted by the user under the reconstructed behavior\-chain evidence\.

##### Tool Selection Accuracy \(TS\)\.

TS evaluates whether the agent selects the tool set required by the annotated tool\-use trajectoryT\(x\)T\(x\)\. LetTgold\(x\)T\_\{\\mathrm\{gold\}\}\(x\)be the set of required tools andTpred\(x\)T\_\{\\mathrm\{pred\}\}\(x\)be the set of tools selected by the agent\. A predicted tool is counted as valid only when the tool exists and its required parameters satisfy the annotated parameter rules\. We compute tool selection by a set\-overlap score:

TS\(x\)=\|Tgold\(x\)∩Tpred\(x\)\|\|Tgold\(x\)∪Tpred\(x\)\|\.\\mathrm\{TS\}\(x\)=\\frac\{\|T\_\{\\mathrm\{gold\}\}\(x\)\\cap T\_\{\\mathrm\{pred\}\}\(x\)\|\}\{\|T\_\{\\mathrm\{gold\}\}\(x\)\\cup T\_\{\\mathrm\{pred\}\}\(x\)\|\}\.This formulation penalizes both missing necessary tools and invoking redundant tools, which is important in map\-agent settings where incorrect or unnecessary tool calls can lead to wrong results or higher interaction cost\.

##### Information Faithfulness Score \(IFS\)\.

IFS measures whether factual claims in the final response are supported by tool outputs\. LetF\(x\)F\(x\)be the set of factual requirements in the ground truth, and letbl∈\{0,1\}b\_\{l\}\\in\\\{0,1\\\}indicate whether the factual claim or rubric itemflf\_\{l\}is fully supported by the corresponding tool returns\. Unsupported, fabricated, or tool\-contradicted facts receive 0\. The score is

IFS\(x\)=1\|F\(x\)\|∑fl∈F\(x\)bl\.\\mathrm\{IFS\}\(x\)=\\frac\{1\}\{\|F\(x\)\|\}\\sum\_\{f\_\{l\}\\in F\(x\)\}b\_\{l\}\.In practice, this metric checks claims such as POI names and route details against the deterministic sandbox outputs\.

##### Interaction Efficiency \(Eff\)\.

Eff measures the turn\-level efficiency of an agent\. LetSactual\(x\)S\_\{\\mathrm\{actual\}\}\(x\)denote the number of business\-agent turns used by the evaluated agent in the deterministic replay, and letSref\(x\)S\_\{\\mathrm\{ref\}\}\(x\)denote the difficulty\-calibrated reference turn budget, which specifies the expected interaction length under a reasonable balance between clarification and user burden\. We define

Eff\(x\)=11\+Sactual\(x\)/Sref\(x\)\.\\mathrm\{Eff\}\(x\)=\\frac\{1\}\{1\+S\_\{\\mathrm\{actual\}\}\(x\)/S\_\{\\mathrm\{ref\}\}\(x\)\}\.This formulation normalizes interaction efficiency into\(0,1\]\(0,1\]\.Eff\(x\)=0\.5\\mathrm\{Eff\}\(x\)=0\.5indicates the human\-expert median turn count; fewer turns increase the score, while more turns decrease it\.

##### Satisfaction Efficiency Score \(SES\)\.

SES is a composite metric that evaluates whether an agent can make a satisfactory decision efficiently\. While AR measures whether the final response is likely to be accepted by the user, Eff measures the turn cost required to reach that response\. We therefore define

SES\(x\)=AR\(x\)⋅Eff\(x\)\.\\mathrm\{SES\}\(x\)=\\mathrm\{AR\}\(x\)\\cdot\\mathrm\{Eff\}\(x\)\.SES rewards agents that are both effective and smart, a response receives a high score only when it has a high accepted\-response probability and is produced with efficient interaction\. The multiplicative form ensures that poor task satisfaction cannot be compensated for by short interaction, and inefficient interaction further reduces the value of an otherwise acceptable response\.

## 4 Experiments

### 4\.1 Evaluated Models

We evaluate LLM\-based agents that completed the full MapSatisfyBench replay and scoring pipeline\. The model set is selected to cover both frontier models and open\-weight model families, as well as different provider ecosystems and capacity levels\. This includes the GPT family, represented by GPT\-4\.1\(OpenAI[2025](https://arxiv.org/html/2606.17453#bib.bib29)\)and GPT\-5\.3\(OpenAI[2026](https://arxiv.org/html/2606.17453#bib.bib30)\); the Claude family, represented by Claude Opus 4\.6\(Anthropic[2026a](https://arxiv.org/html/2606.17453#bib.bib5)\)and Claude Sonnet 4\.6\(Anthropic[2026b](https://arxiv.org/html/2606.17453#bib.bib6)\); the Gemini family, represented by Gemini 3\.1 Pro Preview\(Google[2026a](https://arxiv.org/html/2606.17453#bib.bib16)\)and Gemini 3\.1 Flash Preview\(Google[2026b](https://arxiv.org/html/2606.17453#bib.bib17)\); the DeepSeek family, represented by DeepSeek\-V3\.2\(DeepSeek[2025](https://arxiv.org/html/2606.17453#bib.bib11)\)and DeepSeek\-V4\-Pro\(DeepSeek\-AI[2026](https://arxiv.org/html/2606.17453#bib.bib12)\); and the Qwen family, including Qwen3\-30B\(Qwen[2025b](https://arxiv.org/html/2606.17453#bib.bib36)\), Qwen3\-235B\(Qwen[2025a](https://arxiv.org/html/2606.17453#bib.bib35)\), and the hosted Qwen3\.6\-Plus\(Alibaba Cloud[2026a](https://arxiv.org/html/2606.17453#bib.bib4)\)\. The frontier models provide references for strong general\-purpose agent performance, while the DeepSeek and Qwen models allow us to compare against openly licensed model families and smaller\-capacity variants\.

### 4\.2 Experiment Details

We use GPT\-5\.3\(OpenAI[2026](https://arxiv.org/html/2606.17453#bib.bib30)\)as the user simulator, the judge model, and the LLM\-based tool\-response simulator, with temperature set to 1\.0\. All evaluated models are also run with temperature 1\.0\. For the Gemini\-3\.1\-series models\(Google[2026b](https://arxiv.org/html/2606.17453#bib.bib17)\), we setmax\_output\_tokens=65535, which is a required parameter and matches the maximum output\-token limit supported by the platform\. For sandboxed tool responses, we implement a three\-stage simulation mechanism: exact matching, vector matching, and LLM\-based simulation\. The vector matching stage uses Text\-embedding\-v2\(Alibaba Cloud[2026b](https://arxiv.org/html/2606.17453#bib.bib3)\), a 1536\-dimensional embedding model\. The LLM simulation stage serves as the final fallback: when neither exact matching nor vector matching retrieves a valid record from the sandbox, the system uses the LLM to generate a simulated tool response\. For each model, we run the simulation once and evaluate the resulting trajectory three times\. The final result is obtained by majority voting over the three evaluation runs\. We also provide the details about prompts and samples in Appendix C\.

### 4\.3 Benchmark Statistics

![Refer to caption](https://arxiv.org/html/2606.17453v1/figures/domain_distribution_500.png)Figure 3:Domain coverage statistics\.MapSatisfyBench contains 500 behavior\-grounded map\-service instances\. In constructing the benchmark, we explicitly consider coverage across domains, temporal contexts, spatial settings, and the source of implicit factor\. As shown in Fig\.[3](https://arxiv.org/html/2606.17453#Sx4.F3), the benchmark mainly covers six major map\-service domains\. It is worth noting that one instance may involve several map\-service domains at the same time\. Search and navigation are the most frequent domains, covering 388 and 363 instances respectively, followed by the other four domains with smaller but still substantial coverage\. This distribution shows that MapSatisfyBench provides broad domain coverage while remaining centered on core map\-service decision scenarios\. Beyond domain coverage, the benchmark spans cities from first\-tier to fourth\-tier and below, and includes both local and non\-local cases\. Temporally, it covers weekdays, weekends, holidays, and multiple time periods throughout the day\. More detailed statistics are provided in the Appendix B\.1\.

### 4\.4 Main Results

Table 1:The evaluation results of 12 models with thinking mode off\.Table 2:Evaluation results of three models with thinking mode enabled\.![Refer to caption](https://arxiv.org/html/2606.17453v1/figures/tool_call_frequency_times_new_roman.png)Figure 4:Tool call frequency![Refer to caption](https://arxiv.org/html/2606.17453v1/figures/grouped_results.png)Figure 5:Average IISR, AR, and SES by map\-service domain\.##### Overall Performance\.

Results in Table[1](https://arxiv.org/html/2606.17453#Sx4.T1)show a clear gap between explicit task completion and satisfaction\-aware decision making\. Most frontier models achieve strong ECR scores, indicating that they can usually satisfy the explicit factors stated in the user query\. In the non\-thinking setting, GPT\-5\.3 obtains the highest ECR score of 0\.9272, followed by Claude\-4\.6\-opus at 0\.9148 and Deepseek\-v4\-pro at 0\.9088\. Most other frontier models remain close to or above 0\.85 ECR, while the performance of light\-weight model Qwen variants are substantially lower, with Qwen3\-30b at 0\.6083 and Qwen3\-8b at 0\.6997\. This pattern suggests that explicit map\-service task understanding is relatively tractable for strong general\-purpose LLMs, but still sensitive to model capacity\. The satisfaction\-oriented metrics reveal a more difficult problem\. In the non\-thinking setting, the best IISR score is 0\.7170 from Claude\-4\.6\-opus, followed by 0\.7017 from Deepseek\-v4\-pro; all other non\-thinking models score below 0\.70\. AR follows a similar trend, with Claude\-4\.6\-opus and Deepseek\-v4\-pro reaching 0\.6749 and 0\.6606, respectively\. SES is even more restrictive because it jointly reflects satisfactory decision quality and efficiency, with the best non\-thinking score reaching only 0\.2755 from GPT\-5\.3\. These results indicate that current LLMs often complete the surface task but still fail to satisfy implicit decision factors that affect whether the user would accept the response\. As for IFS, most models produce final responses whose factual claims are well supported by tool outputs; however, this does not necessarily translate into a high probability of satisfying the implicit factors\. For example, Gemini\-3\.1\-pro\-preview achieves the highest non\-thinking IFS score of 0\.9645, but its IISR, AR, and SES are only 0\.5383, 0\.4846, and 0\.2120\. This shows that factually grounded tool responses are necessary but not sufficient for satisfaction\-aware spatial decision making\.

##### The ability to proactively acquire available evidence\.

The consistently low TS scores suggest that tool selection remains a central bottleneck\. In all the settings, TS ranges from 0\.2859 to 0\.4942, and even the best model on this metric, Claude\-4\.6\-opus, remains below 0\.50\. This matters because many implicit decision factors cannot be recovered from the surface query alone, and they require the agent to actively acquire contextual or profile\-related evidence before deciding which feasible response is most likely to satisfy the user\. However, as shown in Fig\.[4](https://arxiv.org/html/2606.17453#Sx4.F4), our tool\-call inspection reveals that all models invoke the profile\-related tool at a relatively low rate\. Compared with task\-specific tools that are directly associated with explicit requests, the profile tool has a more abstract triggering condition is that the agent must first realize that the current query are insufficient for producing a satisfying response, and then actively decide to acquire additional user\-specific evidence\. The results indicate that current LLMs are not yet proficient at this form of proactive information acquisition\. In addition, the Eff scores of all agents are below 0\.5, indicating that they require more clarification turns than the expected interaction budget\. Collectively, these observations suggest that when faced with an underspecified query, they are less likely retrieve available evidence through the appropriate tools and instead tend to ask the user for additional information\. Such excessive questioning increases user effort and may reduce patience and retention in real\-world map\-service scenarios\. We believe a more capable agent should first exploit information that is already available through context, profiles, and external tools, and reserve clarification questions for cases that cannot be reliably resolved otherwise\. This would reduce unnecessary human effort while preserving decision quality\. These findings suggest that future map agents should improve their ability to plan information acquisition, deciding when to proactively retrieve user\-specific evidence and when to ask the user for clarification\. More experiments about the model ability on implicit factor reasoning are provided in Appendix B\.4\.

##### Effect of Thinking Mode\.

Thinking mode consistently improves satisfaction\-aware performance for the three paired models, although the magnitude varies\. The largest gain appears in Gemini\-3\.1\-pro\-preview, with SES increasing from 0\.2120 to 0\.2724\. These results suggest that the stronger reasoning ability can help models recover some missing decision factors, especially when the non\-thinking baseline is weak\. finally, even in thinking mode, IISR remains clearly lower than ECR for all paired models, indicating that longer reasoning alone does not fully solve the problem of satisfaction\-aware decision making\.

### 4\.5 Grouped Performance Analysis

##### Performance by map\-service domain\.

We further analyze model performance by averaging all evaluated model settings within each domain group and retaining the three satisfaction\-oriented metrics: IISR, AR, and SES\. As shown in Fig\.[5](https://arxiv.org/html/2606.17453#Sx4.F5)\(a\), Itinerary planning achieves the highest average performance, with IISR, AR, and SES reaching 0\.65, 0\.60, and 0\.29, respectively\. By contrast, real\-time information query and ride\-hailing service are more difficult, and the former has the lowest average IISR, while the later obtains the lowest SES\. This suggests that satisfaction\-aware decision making is not uniformly difficult across map\-service domains, and domains that require timely state interpretation or tightly constrained service execution remain especially challenging\.

##### Performance by source of implicit decision factors\.

We also group instances by the source of implicit decision factors\. Following the source definition in Appendix B\.1, context\-dependent factors are derived from the interaction contextcc, preference\-dependent factors are derived from the user profilepp, and spatio\-temporal factors are derived from the environmentgg\. As shown in Fig\.[5](https://arxiv.org/html/2606.17453#Sx4.F5)\(b\), context\-dependent and preference\-dependent factors have similar average performance, and spatio\-temporal factors are consistently lower\. These results indicate that current LLMs struggle more when the missing decision factors should be recovered from the concrete environmental state rather than from interaction context or user\-profile evidence\.

## 5 Conclusion

We presented MapSatisfyBench, a benchmark for evaluating satisfaction\-aware decision making in map\-service agents\. Unlike benchmarks that primarily measure whether an agent can infer a stated intent, complete a route\-planning task, or produce a factually grounded answer, MapSatisfyBench focuses on whether the agent can generate a response that is likely to be accepted by the user among multiple feasible solutions\. To achieve this, we use behavior\-chain evidence to reconstruct recoverable implicit decision factors, distinguish hard constraints from soft preferences, and convert them into structured ground truth and evaluation metrics\. This design allows user satisfaction to be operationalized without directly labeling subjective satisfaction\.

Our experiments show that current LLM\-based map agents generally perform well on explicit task completion and factual faithfulness, but remain limited in satisfying implicit decision factors that affect accepted\-response probability\. The results also indicate that thinking mode can improve performance for some models, but it does not fully address the gap\. A key remaining challenge is evidence acquisition: agents must know when to retrieve user\-history or profile signals, rather than relying only on the surface query and general world knowledge\. These findings suggest that future map agents should be evaluated and improved not only as tool\-using planners, but also as satisfaction\-aware decision makers that can align open\-ended spatial responses with behavior\-supported user needs\.

## References

- CAUSE: counterfactual assessment of user satisfaction estimation in task\-oriented dialogue systems\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 14623–14635\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.871),[Link](https://aclanthology.org/2024.findings-acl.871/)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- G\. Adomavicius and A\. Tuzhilin \(2005\)Toward the next generation of recommender systems: a survey of the state\-of\-the\-art and possible extensions\.IEEE Transactions on Knowledge and Data Engineering17\(6\),pp\. 734–749\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2005.99),[Link](https://dblp.org/rec/journals/tkde/AdomaviciusT05.html)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p2.1)\.
- Alibaba Cloud \(2026a\)Alibaba Cloud Model Studio: text generation models\.Note:https://www\.alibabacloud\.com/help/en/model\-studio/text\-generation\-model/Accessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- Alibaba Cloud \(2026b\)Alibaba Cloud OSS Vectors Embed CLI\.Note:https://www\.alibabacloud\.com/help/en/oss/user\-guide/oss\-vectors\-embed\-cliAccessed 2026\-06\-07Cited by:[4\.2 Experiment Details](https://arxiv.org/html/2606.17453#Sx4.SSx2.p1.1)\.
- AMAP AI Agent Team, Y\. Hu, X\. Zhang, S\. Ouyang, H\. Yi, L\. Xu, Q\. Lang, L\. Tan, X\. Cheng, T\. Ye, Z\. Li, G\. Chen, W\. Yang, Z\. Pan, S\. Xiong, S\. Yang, J\. Huang, Y\. Zhang, J\. Wang, Y\. Liu, Y\. Huang, N\. Wang, T\. Lin, X\. Li, and N\. Guo \(2025\)AMAP agentic planning technical report\.External Links:2512\.24957,[Document](https://dx.doi.org/10.48550/arXiv.2512.24957),[Link](https://arxiv.org/abs/2512.24957)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p1.1)\.
- Anthropic \(2026a\)Introducing Claude Opus 4\.6\.Note:https://www\.anthropic\.com/news/claude\-opus\-4\-6Accessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- Anthropic \(2026b\)Introducing Claude Sonnet 4\.6\.Note:https://www\.anthropic\.com/news/claude\-sonnet\-4\-6Accessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- S\. Chaudhuri, P\. Purkar, R\. Raghav, S\. Mallick, M\. Gupta, A\. Jana, and S\. Ghosh \(2025\)TripCraft: a benchmark for spatio\-temporally fine grained travel planning\.External Links:2502\.20508,[Document](https://dx.doi.org/10.48550/arXiv.2502.20508),[Link](https://arxiv.org/abs/2502.20508)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p3.1),[Agent Benchmarks for Map Services\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px1.p1.1)\.
- X\. Cheng, Y\. Hu, X\. Zhang, L\. Xu, L\. Tan, Z\. Pan, X\. Li, and Y\. Liu \(2025\)Beyond itinerary planning: a real\-world benchmark for multi\-turn and tool\-using travel tasks\.External Links:2512\.22673,[Document](https://dx.doi.org/10.48550/arXiv.2512.22673),[Link](https://arxiv.org/abs/2512.22673)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p3.1),[Agent Benchmarks for Map Services\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px1.p1.1)\.
- K\. Church and B\. Smyth \(2009\)Understanding the intent behind mobile information needs\.InProceedings of the 14th international conference on Intelligent user interfaces,pp\. 247–256\.External Links:[Document](https://dx.doi.org/10.1145/1502650.1502686)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p1.1),[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p2.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-V4\-Pro model card\.Note:https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-ProAccessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- DeepSeek \(2025\)DeepSeek\-V3\.2 release\.Note:https://api\-docs\.deepseek\.com/news/news251201Accessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- D\. Delling, A\. V\. Goldberg, T\. Pajor, and R\. F\. Werneck \(2017\)Customizable route planning in road networks\.Transportation Science51\(2\),pp\. 566–591\.External Links:[Document](https://dx.doi.org/10.1287/trsc.2014.0579),[Link](https://pubsonline.informs.org/doi/abs/10.1287/trsc.2014.0579)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p1.1)\.
- Y\. Dubois, C\. X\. Li, R\. Taori, T\. Zhang, I\. Gulrajani, J\. Ba, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)AlpacaFarm: a simulation framework for methods that learn from human feedback\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2023/hash/5fc47800ee5b30b8777fdd30abcaaf3b-Abstract-Conference.html)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- Y\. Feng, Y\. Jiao, A\. Prasad, N\. Aletras, E\. Yilmaz, and G\. Kazai \(2023\)Schema\-guided user satisfaction modeling for task\-oriented dialogues\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2079–2091\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.116),[Link](https://aclanthology.org/2023.acl-long.116/)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- Google \(2026a\)Gemini 3\.1 Pro: announcing our latest Gemini AI model\.Note:https://blog\.google/innovation\-and\-ai/models\-and\-research/gemini\-models/gemini\-3\-1\-pro/Accessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- Google \(2026b\)Gemini API model documentation\.Note:https://ai\.google\.dev/gemini\-api/docs/modelsAccessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1),[4\.2 Experiment Details](https://arxiv.org/html/2606.17453#Sx4.SSx2.p1.1)\.
- H\. He, C\. Yue, C\. Dong, M\. Tian, H\. Chen, Z\. Liu, J\. Chai, X\. Wang, Y\. Zhang, Q\. Liao, G\. Yin, W\. Lin, C\. Wan, H\. Sun, and T\. Su \(2025a\)LocalSearchBench: benchmarking agentic search in real\-world local life services\.Note:Accepted to KDD 2026External Links:2512\.07436,[Document](https://dx.doi.org/10.48550/arXiv.2512.07436),[Link](https://arxiv.org/abs/2512.07436)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p3.1),[Agent Benchmarks for Map Services\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px1.p1.1)\.
- W\. He, Y\. Sun, H\. Hao, X\. Hao, Z\. Xia, Q\. Gu, C\. Han, D\. Zhao, H\. Su, K\. Zhang, M\. Gao, X\. Su, X\. Cai, X\. Cai, Y\. Yang, and Y\. Zhao \(2025b\)VitaBench: benchmarking LLM agents with versatile interactive tasks in real\-world applications\.External Links:2509\.26490,[Document](https://dx.doi.org/10.48550/arXiv.2509.26490),[Link](https://arxiv.org/abs/2509.26490)Cited by:[Agent Benchmarks for Map Services\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px1.p1.1)\.
- E\. Horvitz \(1999\)Principles of mixed\-initiative user interfaces\.InProceedings of the SIGCHI Conference on Human Factors in Computing Systems,pp\. 159–166\.External Links:[Document](https://dx.doi.org/10.1145/302979.303030),[Link](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/chi99horvitz.pdf)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p2.1)\.
- M\. Kamvar and S\. Baluja \(2006\)A large scale study of wireless search behavior: Google mobile search\.InProceedings of the SIGCHI Conference on Human Factors in Computing Systems,pp\. 701–709\.External Links:[Document](https://dx.doi.org/10.1145/1124772.1124877),[Link](https://research.google/pubs/a-large-scale-study-of-wireless-search-behavior-google-mobile-search/)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p1.1)\.
- T\. Kim, J\. Singh, S\. Mehri, E\. C\. Acikgoz, S\. Mukherjee, N\. B\. Bozdag, S\. Shashidhar, G\. Tur, and D\. Hakkani\-Tür \(2025\)AURA: a diagnostic framework for tracking user satisfaction of interactive planning agents\.Note:NeurIPS 2025 Workshop MTI\-LLMExternal Links:[Link](https://openreview.net/forum?id=8nDty2iFl1)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- LBS\-IntentBench Contributors \(2026\)LBS\-IntentBench: a real\-world benchmark for implicit intent inference and spatio\-temporal reasoning\.Note:https://github\.com/lbs\-researcher/LBS\-IntentBenchGitHub repository; accessed 2026\-06\-03Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p3.1),[Agent Benchmarks for Map Services\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px1.p1.1)\.
- T\. Li, W\. Chiang, E\. Frick, L\. Dunlap, T\. Wu, B\. Zhu, J\. E\. Gonzalez, and I\. Stoica \(2024\)From crowdsourced data to high\-quality benchmarks: Arena\-Hard and BenchBuilder pipeline\.External Links:2406\.11939,[Document](https://dx.doi.org/10.48550/arXiv.2406.11939),[Link](https://arxiv.org/abs/2406.11939)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- Y\. Lin, J\. Neville, J\. Stokes, L\. Yang, T\. Safavi, M\. Wan, S\. Counts, S\. Suri, R\. Andersen, X\. Xu, D\. Gupta, S\. K\. Jauhar, X\. Song, G\. Buscher, S\. Tiwary, B\. Hecht, and J\. Teevan \(2024\)Interpretable user satisfaction estimation for conversational systems with large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11100–11115\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.598),[Link](https://aclanthology.org/2024.acl-long.598/)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- P\. Maes \(1994\)Agents that reduce work and information overload\.Communications of the ACM37\(7\),pp\. 30–40\.External Links:[Document](https://dx.doi.org/10.1145/176789.176792),[Link](https://cacm.acm.org/research/agents-that-reduce-work-and-information-overload/)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p2.1)\.
- R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders, X\. Jiang, K\. Cobbe, T\. Eloundou, G\. Krueger, K\. Button, M\. Knight, B\. Chess, and J\. Schulman \(2021\)WebGPT: browser\-assisted question\-answering with human feedback\.External Links:2112\.09332,[Document](https://dx.doi.org/10.48550/arXiv.2112.09332),[Link](https://arxiv.org/abs/2112.09332)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2025\)GPT\-4\.1 model\.Note:https://platform\.openai\.com/docs/models/gpt\-4\.1Accessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- OpenAI \(2026\)GPT\-5\.3 Instant: smoother, more useful everyday conversations\.Note:https://openai\.com/index/gpt\-5\-3\-instant/Accessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1),[4\.2 Experiment Details](https://arxiv.org/html/2606.17453#Sx4.SSx2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,Vol\.35\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract.html)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- P\. Pu, L\. Chen, and R\. Hu \(2012\)Evaluating recommender systems from the user’s perspective: survey of the state of the art\.User Modeling and User\-Adapted Interaction22,pp\. 317–355\.External Links:[Document](https://dx.doi.org/10.1007/s11257-011-9115-7),[Link](https://infoscience.epfl.ch/entities/publication/1c519139-9913-4879-91a5-3785858b68af/articledetails)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p2.1)\.
- R\. S\. Purves, P\. Clough, C\. B\. Jones, M\. H\. Hall, and V\. Murdock \(2018\)Geographic information retrieval: progress and challenges in spatial search of text\.Foundations and Trends in Information Retrieval12\(2–3\),pp\. 164–318\.External Links:[Document](https://dx.doi.org/10.1561/1500000034),[Link](https://www.nowpublishers.com/article/Details/INR-034)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p1.1)\.
- C\. Qian, Z\. Liu, A\. Prabhakar, Z\. Liu, J\. Zhang, H\. Chen, H\. Ji, W\. Yao, S\. Heinecke, S\. Savarese, C\. Xiong, and H\. Wang \(2025\)UserBench: an interactive gym environment for user\-centric agents\.External Links:2507\.22034,[Document](https://dx.doi.org/10.48550/arXiv.2507.22034),[Link](https://arxiv.org/abs/2507.22034)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- Qwen \(2025a\)Qwen3\-235B\-A22B model card\.Note:https://huggingface\.co/Qwen/Qwen3\-235B\-A22BAccessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- Qwen \(2025b\)Qwen3\-30B\-A3B model card\.Note:https://huggingface\.co/Qwen/Qwen3\-30B\-A3BAccessed 2026\-06\-07Cited by:[4\.1 Evaluated Models](https://arxiv.org/html/2606.17453#Sx4.SSx1.p1.1)\.
- Z\. Song, J\. Zhang, C\. Qin, C\. Wang, C\. Chen, L\. Xu, K\. Liu, X\. Chu, and H\. Zhu \(2026\)MobilityBench: a benchmark for evaluating route\-planning agents in real\-world mobility scenarios\.External Links:2602\.22638,[Document](https://dx.doi.org/10.48550/arXiv.2602.22638),[Link](https://arxiv.org/abs/2602.22638)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p3.1),[Agent Benchmarks for Map Services\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px1.p1.1)\.
- W\. Sun, S\. Zhang, K\. Balog, Z\. Ren, P\. Ren, Z\. Chen, and M\. de Rijke \(2021\)Simulating user satisfaction for the evaluation of task\-oriented dialogue systems\.InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval,External Links:[Document](https://dx.doi.org/10.1145/3404835.3463241),[Link](https://arxiv.org/abs/2105.03748)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- P\. Vansteenwegen, W\. Souffriau, and D\. Van Oudheusden \(2011\)The orienteering problem: a survey\.European Journal of Operational Research209\(1\),pp\. 1–10\.External Links:[Document](https://dx.doi.org/10.1016/j.ejor.2010.03.045),[Link](https://www.sciencedirect.com/science/article/pii/S0377221710002973)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p1.1)\.
- N\. M\. Villegas, C\. Sánchez, J\. Díaz\-Cely, and G\. Tamura \(2018\)Characterizing context\-aware recommender systems: a systematic literature review\.Knowledge\-Based Systems140,pp\. 173–200\.External Links:[Document](https://dx.doi.org/10.1016/j.knosys.2017.11.003),[Link](https://www.sciencedirect.com/science/article/pii/S0950705117305075)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p2.1)\.
- M\. A\. Walker, D\. J\. Litman, C\. A\. Kamm, and A\. Abella \(1997\)PARADISE: a framework for evaluating spoken dialogue agents\.InProceedings of the 35th Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/cmp-lg/9704004)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- J\. Wang, F\. Mo, W\. Ma, P\. Sun, M\. Zhang, and J\. Nie \(2024\)A user\-centric multi\-intent benchmark for evaluating large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 3588–3612\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.210),[Link](https://aclanthology.org/2024.emnlp-main.210/)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- S\. Wu, M\. Galley, B\. Peng, H\. Cheng, G\. Li, Y\. Dou, W\. Cai, J\. Zou, J\. Leskovec, and J\. Gao \(2025\)CollabLLM: from passive responders to active collaborators\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 67260–67283\.External Links:[Link](https://proceedings.mlr.press/v267/wu25i.html)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- J\. Xie, K\. Zhang, J\. Chen, T\. Zhu, R\. Lou, Y\. Tian, Y\. Xiao, and Y\. Su \(2024\)TravelPlanner: a benchmark for real\-world planning with language agents\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 54590–54613\.External Links:[Link](https://proceedings.mlr.press/v235/xie24j.html)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p1.1),[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p3.1),[Agent Benchmarks for Map Services\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px1.p1.1)\.
- H\. Zamani, B\. Mitra, E\. Chen, G\. Lueck, F\. Diaz, P\. Bennett, N\. Craswell, and S\. Dumais \(2020\)Analyzing and learning from user interactions for search clarification\.InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1181–1190\.External Links:[Document](https://dx.doi.org/10.1145/3397271.3401160),[Link](https://www.microsoft.com/en-us/research/publication/analyzing-and-learning-from-user-interactions-for-search-clarification/)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p2.1)\.
- S\. Zhao, M\. Hong, Y\. Liu, D\. Hazarika, and K\. Lin \(2025a\)Do LLMs recognize your preferences? evaluating personalized preference following in LLMs\.External Links:2502\.09597,[Document](https://dx.doi.org/10.48550/arXiv.2502.09597),[Link](https://arxiv.org/abs/2502.09597)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhao, C\. Vania, D\. Kayal, N\. Khan, S\. B\. Cohen, and E\. Yilmaz \(2025b\)PersonaLens: a benchmark for personalization evaluation in conversational AI assistants\.Note:Findings of the Association for Computational Linguistics: ACL 2025External Links:[Link](https://www.amazon.science/publications/personalens-a-benchmark-for-personalization-evaluation-in-conversational-ai-assistants)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- T\. P\. Zollo, A\. W\. T\. Siah, N\. Ye, A\. Li, and H\. Namkoong \(2025\)PersonalLLM: tailoring LLMs to individual preferences\.InInternational Conference on Learning Representations,External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/hash/a730abbcd6cf4a371ca9545db5922442-Abstract-Conference.html)Cited by:[Satisfaction\-aware Agent Benchmarks\.](https://arxiv.org/html/2606.17453#Sx2.SS0.SSS0.Px2.p1.1)\.
- J\. Zou, A\. Sun, C\. Long, M\. Aliannejadi, and E\. Kanoulas \(2023\)Asking clarifying questions: to benefit or to disturb users in web search?\.Information Processing & Management60\(2\),pp\. 103176\.External Links:[Document](https://dx.doi.org/10.1016/j.ipm.2022.103176),[Link](https://www.sciencedirect.com/science/article/pii/S0306457322002771)Cited by:[1 Introduction](https://arxiv.org/html/2606.17453#Sx1.p2.1)\.

## Appendix

## Appendix AA Benchmark Construction Details

### A\.1\. Ground Truth Annotation Details

This section summarizes the ground\-truth annotation details used in MapSatisfyBench, including the behavior\-chain restoration strategy and the ground\-truth annotation workflow\. The goal is to identifying which implicit decision factors are evaluable, and selecting those that are both relevant to user acceptance and recoverable from available pre\-query information sources\.

##### Behavior\-chain restoration\.

Behavior\-chain restoration places the current query inside the complete behavior chain rather than interpreting it in isolation\. During restoration, we jointly reconstruct the pre\-query behavior, the current expression, the post\-query behavior, and the final task state, while simultaneously judging which behaviors belong to the current task and which behaviors are irrelevant jumps or erroneous continuations\. Based on this continuity analysis, we extract the incremental information and implicit decision points that can be used for annotation\. This process does not use post\-query outcomes to rewrite the query retroactively\. Instead, it integrates the user’s current location, current time, preceding task cues, and subsequent operation trajectory to recover the user’s complete need as it unfolds in a real task process\. We further filter the continuation relations within the behavior chain, allowing only behaviors that are continuous with the current task to participate in final target selection, slot completion, and preference analysis\. GPT\-5\.3 is used to conduct an initial screening over large\-scale real\-world anonymized user data, after which expert verification is performed to ensure sample reliability\.

##### Ground\-truth annotation\.

All ground\-truth annotations in MapSatisfyBench are finalized through expert review\. To improve annotation efficiency, we combine LLM\-assisted candidate generation with expert annotation rather than relying on raw LLM outputs as final labels\. First, three different LLMs are used to generate three versions of ground\-truth candidates\. These candidates are intended to cover different model interpretations of user behavior, implicit constraints, and factors that may affect satisfaction\. Human annotators then review the query, context, pre\-query behavior, and post\-query behavior chain in full\. They use the three candidate annotations as references, and follow a unified annotation process to select, correct, merge, or reject candidate elements\. The result of this step is an expert\-annotated ground truth rather than an automatically generated one\. To reduce individual subjective bias, the annotation results are further checked through human cross validation\. Finally, independent reviewers conduct a consistency check over the complete ground truth, ensuring that explicit factors, implicit factors, tool and parameter requirements, and factual constraints remain consistent in their semantic scope and evidence boundary\.

### A\.2 The computation of evidence\-supported weight

For each retained implicit decision factorzi∈𝒵eval\(x\)z\_\{i\}\\in\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\), MapSatisfyBench assigns an evidence\-supported weightwiw\_\{i\}\. The weight is designed to measure how strongly the available pre\-query and spatio\-temporal evidence suggests thatziz\_\{i\}should affect the accepted\-response probability\. It is therefore used as the weight of factor\-level satisfaction scorecic\_\{i\}used by IISR, i\.e\.,wiw\_\{i\}determines the relative importance of a factor, whilecic\_\{i\}measures whether the agent’s response satisfies that factor\.

The weight combines two complementary sources of behavior\-chain evidence:

wi=userpref\(zi\)⋅CurrentNeed\(zi\)\.w\_\{i\}=\\mathrm\{userpref\}\(z\_\{i\}\)\\cdot\\mathrm\{CurrentNeed\}\(z\_\{i\}\)\.
Here,userpref\(zi\)\\mathrm\{userpref\}\(z\_\{i\}\)summarizes stable or recent user\-specific support for the factor, whileCurrentNeed\(zi\)\\mathrm\{CurrentNeed\}\(z\_\{i\}\)measures whether the factor is active in the current interaction\. If only user\-profile or historical evidence is available,CurrentNeed\(zi\)\\mathrm\{CurrentNeed\}\(z\_\{i\}\)is set to a neutral value of 1\.0\. If only current\-interaction evidence is available,userpref\(zi\)\\mathrm\{userpref\}\(z\_\{i\}\)is set to 1\.0\. If both are available, the two components are multiplied\. This multiplicative form increases the weight when a factor is supported by both stable user behavior and the current session, while allowing either source to provide sufficient evidence on its own\.

The user\-specific component is computed as

userpref\(zi\)=Si⋅Ri⋅Mi,\\mathrm\{userpref\}\(z\_\{i\}\)=S\_\{i\}\\cdot R\_\{i\}\\cdot M\_\{i\},
whereSiS\_\{i\}denotes preference strength,RiR\_\{i\}denotes recency, andMiM\_\{i\}denotes temporal momentum\. When behavioral records are available, preference strength is estimated within the same latent decision dimension:

Si=niNi,S\_\{i\}=\\frac\{n\_\{i\}\}\{N\_\{i\}\},
wherenin\_\{i\}is the number of historical behaviors supportingziz\_\{i\}, andNiN\_\{i\}is the total number of behaviors observed in the same decision dimension\. Profile evidence includes both preference information and basic user attributes\. When a corresponding preference ratio is available, we directly use it as the profile strengthSiS\_\{i\}\. If only basic profile information is available,SiS\_\{i\}is set to 1\.0 for strong relevance and 0\.3 for weak relevance\. Recency adjusts the strength according to whether the evidence reflects a routine habit or a recent situational pattern: routine habits receiveRi=0\.8R\_\{i\}=0\.8, while non\-routine or recent evidence receivesRi=1\.0R\_\{i\}=1\.0\. Momentum further captures whether the factor is becoming more or less salient over time\. For behavior\-only evidence, we compare the number of supporting observations in the recent and earlier windows:

Mi=\{1\.2,nirecent\>niearly,1\.0,nirecent=niearly\>0,0\.8,0<nirecent<niearly,0\.6,otherwise\.M\_\{i\}=\\begin\{cases\}1\.2,&n\_\{i\}^\{\\mathrm\{recent\}\}\>n\_\{i\}^\{\\mathrm\{early\}\},\\\\ 1\.0,&n\_\{i\}^\{\\mathrm\{recent\}\}=n\_\{i\}^\{\\mathrm\{early\}\}\>0,\\\\ 0\.8,&0<n\_\{i\}^\{\\mathrm\{recent\}\}<n\_\{i\}^\{\\mathrm\{early\}\},\\\\ 0\.6,&\\text\{otherwise\}\.\\end\{cases\}
When both long\-term profile evidence and recent behavior are available, recent behavior is used to estimate the current strength, while the profile provides the long\-term baseline\. We define the change rate as

Δi=Sishort−Silongmax⁡\(Silong,0\.1\)\.\\Delta\_\{i\}=\\frac\{S\_\{i\}^\{\\mathrm\{short\}\}\-S\_\{i\}^\{\\mathrm\{long\}\}\}\{\\max\(S\_\{i\}^\{\\mathrm\{long\}\},0\.1\)\}\.
The momentum coefficient is then assigned by the rule shown in Table A1\. The current\-need component is estimated only from information available before the agent responds, including the pre\-query interaction historyh−h^\{\-\}and the spatio\-temporal environmentgg\. Post\-query behavior is used for restoring and validating the behavior chain, but not for assigningCurrentNeed\(zi\)\\mathrm\{CurrentNeed\}\(z\_\{i\}\)\. A piece of current\-session evidence is considered valid only if it belongs to the same decision dimension asziz\_\{i\}, remains applicable under the current task state, and can still explain the present decision\. Evidence that becomes invalid due to task switching, object switching, distance\-scale changes, or travel\-state changes is excluded\. Conversely, transferable constraints such as accessibility, opening status, route feasibility, parking feasibility, and along\-the\-way requirements may remain valid across different target objects when they still affect the current decision\.

Table A1:Rules for assigning the momentum coefficient\.For valid current\-session evidence,CurrentNeed\(zi\)\\mathrm\{CurrentNeed\}\(z\_\{i\}\)is assigned according to relevance, continuity, and conflict, as shown in Table A2\. When no current\-session evidence is applicable to a factor whose support comes from profile or historical behavior, the current\-need component remains neutral, i\.e\.,CurrentNeed\(zi\)=1\.0\\mathrm\{CurrentNeed\}\(z\_\{i\}\)=1\.0\. Thus, the current\-session component down\-weights a factor only when the available pre\-response evidence suggests weak applicability, invalidity, or conflict in the present interaction\.

Table A2:CurrentNeed\(zi\)\\mathrm\{CurrentNeed\}\(z\_\{i\}\)assignment rule\.Finally,wiw\_\{i\}is used as the normalized factor weight in IISR:

IISR\(x\)=∑zi∈𝒵eval\(x\)wici∑zi∈𝒵eval\(x\)wi\.\\mathrm\{IISR\}\(x\)=\\frac\{\\sum\_\{z\_\{i\}\\in\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\)\}w\_\{i\}c\_\{i\}\}\{\\sum\_\{z\_\{i\}\\in\\mathcal\{Z\}\_\{\\mathrm\{eval\}\}\(x\)\}w\_\{i\}\}\.
This formulation makes the contribution of each implicit decision factor proportional to the strength of its behavior\-chain support\. As a result, factors backed by both historical user evidence and current\-session evidence have greater influence on the implicit satisfaction score, while factors with weak, outdated, or conflicting support contribute less\.

## Appendix BB\. Other Experiments

### B\.1 Benchmark statistics details

![Refer to caption](https://arxiv.org/html/2606.17453v1/figures/appendix_city_tier_distribution.png)Figure B1:City\-tier Distribution![Refer to caption](https://arxiv.org/html/2606.17453v1/figures/appendix_time_of_day_distribution.png)Figure B2:Time\-of\-day Distribution##### Spatio\-temporal coverage\.

Fig B1 and B2 further illustrate the spatial and temporal coverage of MapSatisfyBench\. For spatial coverage, instances span cities from first\-tier to fourth\-tier and below\. This shows that the benchmark is not limited to large metropolitan areas, but covers map\-service scenarios across cities of different development levels\. For temporal coverage, instances are distributed across all periods of the day, from early morning to night, with higher concentrations in night and afternoon periods\. Such coverage allows the benchmark to capture diverse real\-world map\-service needs under different spatio\-temporal contexts\.

Table B1:The statistics of the sources of implicit decision factors
##### Sources of implicit factors\.

We further summarize the sources from which implicit decision factors are recoverable\. Following the notation in the method section, context\-dependent factors are derived from the interaction contextcc, such as unresolved dialogue state or previously established task constraints\. Preference\-dependent factors are derived from the user profilepp, including long\-term or recent behavioral evidence that changes which feasible solution is more likely to be accepted\. Spatio\-temporal factors are derived from the spatio\-temporal environmentgg, where the current location, time, surrounding objects, route state, or local availability conditions determine which decision factors matter\. The source statistics show in Table B1\. Because a single instance may involve multiple sources, these counts are multi\-label source hits rather than mutually exclusive sample partitions\. We can see that the counts decrease in the following order: context\-dependent factors, preference\-dependent factors, and spatio\-temporal factors\.

### B\.2 LLM\-as\-a\-Judge Reliability

We validate the reliability of the LLM\-as\-a\-judge component through a human\-versus\-model agreement study\. The validation set contains 68 cases, each consisting of the ground truth and the corresponding simulated model trajectory\. For each metric, human annotators provide the expected metric score, which is treated as the reference score\. The LLM judge is then evaluated by comparing its metric output against this reference under different prompt versions\. We use two tolerance\-based agreement measures and one error measure\. Given a reference scorerir\_\{i\}and a judge\-produced scorer^i\\hat\{r\}\_\{i\}for caseii, the within\-tolerance accuracy under toleranceϵ\\epsilonis defined as

Acc±ϵ=1N∑i=1N𝕀\(\|ri−r^i\|≤ϵ\),\\mathrm\{Acc\}\_\{\\pm\\epsilon\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\left\(\|r\_\{i\}\-\\hat\{r\}\_\{i\}\|\\leq\\epsilon\\right\),
whereN=68N=68\. We reportϵ=0\.05\\epsilon=0\.05andϵ=0\.02\\epsilon=0\.02\. We also report mean absolute error:

MAE=1N∑i=1N\|ri−r^i\|\.\\mathrm\{MAE\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\|r\_\{i\}\-\\hat\{r\}\_\{i\}\|\.
Table B2 summarizes the comparison results\. The results show that the LLM judge is highly consistent with human reference scores for explicit factor completion\. For IISR, the agreement is lower because these metrics require judging satisfaction\-relevant implicit factors, but its within\-tolerance accuracies remain close to 0\.90 and MAE value remain small\. These results indicate that the LLM\-as\-a\-judge setting provides sufficiently reliable metric estimates for the sandbox evaluation\.

Table B2:Human\-versus\-LLM judge agreement in same\-dimension prompt comparison\.
### B\.3 Stability of the Sandbox Caching Module

We further examine whether the sandbox caching module changes the evaluation outcome\. We sample 100 benchmark instances and evaluate two model settings under two conditions: an online setting that obtains tool responses from the live tool interface, and an offline setting that replays cached or simulated tool responses through the sandbox\. Table B3 reports the single\-model mean results over the 100 sampled cases\. We focus on the five base metrics that are computed directly from the evaluation outputs, and omit AR and SES because they are multiplicative composite metrics whose values may amplify small differences in their constituent metrics\.

Table B3:Online/offline comparison for evaluating the stability of the sandbox caching module on 100 sampled instances\.The online and offline results are broadly consistent across the main evaluation dimensions\. In particular, ECR and IISR remain close within each paired model setting, suggesting that the cached sandbox preserves the relative evaluation signal of the live\-tool setting\. The offline setting is slightly higher on several task\-level metrics, especially TS\. This is expected because cached or simulated tool responses are usually more complete and well\-formed than real business\-interface responses, which may occasionally return empty, incomplete, or otherwise noisy outputs\. Therefore, the small offline advantage should be interpreted as a normal consequence of using an idealized deterministic replay environment, rather than as evidence of evaluation instability\. Overall, the comparison supports the use of the offline sandbox caching module for reproducible and fair model evaluation\.

### B\.4 Reasoning over Provided User Profile Evidence

The main experiments show that current agents have limited ability to proactively call profile\-related tools and acquire historical\-behavior evidence\. This raises a question that if sufficient user\-profile and historical\-behavior information is already provided to the model, can the model better recover and satisfy the user’s implicit decision factors? To isolate implicit\-need reasoning from evidence acquisition, we conduct an additional profile\-evidence setting in which anonymized user profile summaries and aggregated historical behavior statistics \(with all personally identifiable information removed and fine\-grained behavioral records replaced by categorical preference indicators\) are provided to the model as structured context before it makes a decision\. The goal is to test whether the model can use the available evidence to infer the implicit needs that affect user acceptance\.

Table B4 reports the single\-model mean results from this setting\. Providing profile and historical\-behavior evidence improves implicit\-factor satisfaction to some extent\. Compared with the thinking\-mode results in the main experiments, Qwen3\.6\-plus and Gemini\-3\.1\-pro\-preview achieve a relative improvement of about 2\.6% and 3\.7%on IISR, respectively\. These gains suggest that insufficient evidence acquisition is indeed one reason for weak implicit\-need satisfaction\. However, the overall IISR scores remain moderate even when the relevant profile evidence is directly provided\. This indicates that current models still need stronger reasoning ability to transform user profiles and historical behaviors into satisfaction\-relevant implicit decision factors, rather than merely having access to more background information\.

Table B4:Results when user profiles and historical behaviors are directly provided to the model\.

## Appendix CC\. Prompts and Sample Instance

### C\.1 User Agent Prompt

This subsection reports the system prompt used by the simulated user agent\. Runtime variables such as\{current\_time\},\{current\_location\},\{query\},\{full\_intent\},\{persona\}, andconversation\_historyare filled for each benchmark instance\.

\#\#1\.CoreTask

Youareplayingtheroleofarealuserandhavingaconversationwithaspatial\-decisionintelligentassistant\(Agent\)\.

Thescenarioscovertravelnavigation,ridehailing,local\-lifesearch,placefinding,andtripplanning\.

Highest\-prioritygoals:

1\.Answerondemand\.Basedon‘full\_intent‘,simulatearealuserwhoknowstheircompleteneed\.OnlywhentheAgentexplicitlyrequiresauserresponse,provideaminimallysufficientanswertotheAgent’squestion\.YouarenothelpingtheAgentcompletethetask,andyoumustnotproactivelyfillinallunaskedinformationfortheAgent\.

2\.Stricttermination\.Whentheterminationconditionsaremetandthecurrentturndoesnotcontainanynewrequest,newquestion,ornewinformationrequest,immediatelyoutput‘\[FinishConversation\]‘\.

Highest\-priorityoverriderules:

1\.WhethertheuserisallowedtooutputanynewtaskinformationisdeterminedonlybywhethertheAgent’slatestmessageexplicitlyrequestsauserresponse\.Itisunrelatedtowhetheranearlierturnaskedaquestion,whethertheuser’spreviousturnalreadyansweredit,orwhetherthecurrenttaskstillfailstosatisfy‘full\_intent‘\.

2\.IftheAgent’slatestturndoesnotproactivelyaskaquestion,requestsupplementaryinformation,oraskforconfirmationorselection,theusermustnotproactivelyoutputanynewinformationfrom‘full\_intent‘\.ThisholdsevenwhentheAgent’sresponseiswrong,partial,incomplete,ordeviatesfrom‘full\_intent‘;theuserstillmustnotproactivelycorrect,clarify,restate,refine,orsupplementit\.

3\.‘full\_intent‘istheuser’scompleteinternalneedandtrueconstraints\.Ithastwofunctions:whentheAgentexplicitlyasks,itservesastheinformationsourcefortheresponse;andithelpsdeterminewhetherthereisnoadditionalinformationtoprovideandtheconversationcanend\.Donotreveal‘full\_intent‘merelybecausetheAgenthasnotsatisfiedit\.‘full\_intent‘iswhattheusertrulywantsinternally;itisnotsomethingthatshouldbefullystatedatthebeginningorinanysingleturn\.

\#\#2\.InputData

\-Currenttime:‘\{current\_time\}‘

\-Currentlocation:‘\{current\_location\}‘

\-Initialquery:‘\{query\}‘

\-Fullintent:‘\{full\_intent\}‘\.Thisisthecorestandardfordecidingwhattoanswerwhenaskedandwhetherthereisnofurtherinformationtoprovide\.

\-Persona:‘\{persona\}‘\.Thisfieldisoptional\.

\-Completeconversationhistory:‘conversation\_history‘,alistthatrecordstheinteractionbetweentheAgent\(‘role=assistant‘\)andyou\(‘role=user‘\)inorder\.‘maxindex‘indicatesthelatestinteractionturn\.

\#\#3\.RoleBehavior

\#\#\#A\.Personaadaptation

\-If‘\{persona\}‘isprovided,deeplyimmerseyourselfinthepersonality,emotion,andcognitivelimitationsdescribedbythepersona,andusealanguagestyleconsistentwithit\.

\-If‘\{persona\}‘isnotprovided,playastandardrationaluserbydefault:clear,rational,cooperative,andhuman\-like\.

\#\#\#B\.Noactioncapability

\-Youcannotexecuteanyrealoperation,suchasclickinglinks,searching,placingorders,ormakingpayments\.IftheAgentasksyoutoperformsuchanoperation,explicitlyrefuseandstateyourlimitation\.

\-Example:"Icannotclicklinks\.Pleasedirectlygivemetheaddressandphonenumber\."

\#\#4\.CorePrinciplesforResponseGeneration

\#\#\#A\.Fidelityandinformationboundary

\-Solesource:allyourneeds,preferences,andconstraintscanonlycomefrom‘\{persona\}‘and‘\{full\_intent\}‘\.Fabricationisstrictlyforbidden\.

\-Unmentionedmeansunknown\.IftheAgentasksforinformationnotincludedintheinput,answernaturallywith"Eitherisfine","Nospecialrequirement",or"Iamnotsure\."

\#\#\#B\.Firstdecidewhethertorespond

Youarenotrequiredtospeakineveryturn\.Beforegeneratinganoutput,firstinspectthelatestassistantmessagein‘conversation\_history‘\.

Theusershouldrespondonlywhenthelatestassistantmessagesatisfiesatleastoneofthefollowingconditions:

1\.Itexplicitlyasksaquestion,endswithaquestionmark,orsemanticallyrequirestheusertoanswer,suchasexpressionsmeaning"please","helpme","tellme","whichone","howtochoose","whether","doyouwant","doyoustillneed",or"canyou"\.

2\.Itexplicitlyrequestsmissinginformation,suchasdestination,time,budget,numberofpeople,travelmode,store,entrance,orpreference\.

3\.Itaskstheusertochooseorconfirm,suchaschoosingbusorsubway,whetherdetailednavigationisneeded,orwhethertocontinue\.

4\.Itaskstheusertoperformanoperationthattheusercannotcomplete,inwhichcasetheusershouldrefuse\.

Ifthelatestassistantmessagesatisfiesnoneoftheabove:

1\.Theusermustnotproactivelysupplement,correct,clarify,restate,orrefineanyinformationfrom‘full\_intent‘\.

2\.Thisremainstrueeveniftheassistant’sresponsedoesnotfullymatch‘full\_intent‘\.

3\.Evenifanearlierturnaskedaquestionortheuseransweredinthepreviousturn,theusermustnotcontinuetoaddnewtaskinformationinthecurrentturnunlessthelatestassistantmessageasksagain\.

4\.Ifthetaskiscompletedorshouldterminate,theusermayonlyoutputabriefclosingphrasewithouttaskinformationandappend‘\[FinishConversation\]‘\.

Specialnote:inthefollowingcases,theusershouldremainsilentwithrespecttotaskinformationandshouldnotproactivelysupplement,correct,clarify,restate,oraskfollow\-upquestions\.Onlywhenaterminationconditionismetshouldtheuseroutputabriefclosingphraseand‘\[FinishConversation\]‘:theassistantonlyprovidesinformation;theassistantgivesaplanbutdoesnotasktheuser;theassistantmakesafactualstatementorsuggestion;theassistantdoesnotrequestsupplementaryinformationorachoice;anearlierturnaskedsomethingbutthecurrentturndoesnot;theuseransweredinthepreviousturn,butthecurrentassistantmessageonlycontinuestostateresultswithoutrequestinganotherresponse\.

\#\#\#C\.Answerondemand

Whentheassistantrequiresauserresponse:

1\.Answeronlytheinformationdirectlyinvolvedintheassistant’scurrentquestion\.

2\.Donotproactivelysupplementothercontentfrom‘full\_intent‘thatwasnotasked\.

3\.Donotrevealthecompleteintentinadvancemerelytoreduceturns\.

4\.Iftheassistantasksonlyoneslot,answeronlythatslot\.

5\.Iftheassistantexplicitlyasksmultiplepointsand‘full\_intent‘containsanswerstothem,answerthemtogetherwithoutomission\.

6\.Whentheassistantusesanopen\-endedquestionsuchas"Anyothersupplements?"or"Anyotherinformation?",directlystatethesupplementaryinformationthatisalreadyexplicitlygivenin‘full\_intent‘\.Donotrewritepotentialslots,preconditions,ordecisiondimensionsasquestions,rhetoricalquestions,oritemswaitingforconfirmation\.

Example:

\-Fullintent:GotoWumartsupermarket\.Theuserwantspublictransportationandpreferablyalow\-costroute\.

\-Agent:TheAgentprovidesdriving,publictransit,cycling,andwalkingplans,andaskswhichtravelmodetheuserprefers\.

\-Incorrectuserreply:"Ipreferbusorsubwaybecauseitischeaper\.Canyougivemethemostconvenientsubwayorbusroutetothesupermarket?"

\-Correctuserreply:"Iwantapublic\-transitroute\."

\#\#\#D\.Minimalsufficientanswer

Yourreplymustbesufficienttoadvancethecurrentconversation,butmustnotexpandbeyondwhatisneeded\.

1\.AvoidrepeatinginformationalreadyprovidedbytheAgent\.Onlyindicateyourchoice,confirmation,ornext\-stepneed\.

2\.Usenaturalexpressionratherthanexplanation\.Donotexplainwhyyouchoosesomethingunlessrequiredby‘\{persona\}‘\.

3\."Minimalsufficient"doesnotmeanomittinginformationthatwasasked\.Iftheassistantexplicitlyasksseveralaspectsinthecurrentturn,coverallaskedaspectswithoutexceedingtherequestedscope\.

4\.Inopen\-endedsupplementscenarios,stillfollowminimalsufficiency:onlyaddthemostimportantknowninformationthatadvancesthecurrenttask,ratherthanrestatingthefull‘full\_intent‘\.

Examples:

\-Fullintent:GotoWumartsupermarket\.Theuserwantspublictransportation,preferablyalow\-costroute,andwantsroutetimeanddistance\.

\-Agent:"Doyouwanttotakethesubwayorcallacar?"

\-Correctuserreply:"Subway\."

\-Incorrectuserreply:"Iwanttotakethesubwaybecauseitischeaperandavoidstraffic\.Pleaseplanitindetail\."Theproblemisover\-provision\.

\-Agent:"Doyouwanttotakethesubwayorcallacar?Doyouneedanyspecificinformation?"

\-Correctuserreply:"Iwantthesubway\.Ineedtheconcreteplan,includingtimeanddistance\."

\-Incorrectuserreply:"Subway\."Theproblemisthattheassistantaskedforadditionalinformationbuttheuseromittedit\.

\#\#5\.MandatoryConvergenceandTerminationRules

\#\#\#Goldenrule:pendingrequestlock

AslongasyourreplycontainsanyformofintentionfortheAgenttoprovidenewinformationorexecuteanewaction,itisstrictlyforbiddentoappend‘\[FinishConversation\]‘\.

Wrongexamplestoavoid:

\-"Idonotwanttheordinarytouristentrance\.Iwanttodrivethere\.Pleasegivemethecorrespondingexactplaceandnavigationroute\.\[FinishConversation\]"

ThisexplicitlyaskstheAgenttoprovideanexactplaceandnavigationroute,sotherequestisstillpendingandterminationisforbidden\.

\-"Iamaskingwherethevehicleinspectionstation’JiachengMotorVehicleInspection’is\.Canyougivemeaconcreteaddressormappoint?\[FinishConversation\]"

Thisexplicitlyasksforaconcreteaddressormappoint,soterminationisforbidden\.

Append‘\[FinishConversation\]‘attheveryendofthereplyonlywhenoneofthefollowingconditionsholds\.

\#\#\#A\.Taskcompletion

1\.TheAgent’slatestturndoesnotproactivelyaskaquestionorcontinuetorequestinformation\.ThisusuallyincludescaseswheretheAgentnaturallyclosestheconversationorgivesaresultwithoutrequestingauserresponse\.

Example:theuseraskstogotoasupermarket,saysWumartafterbeingaskedwhichsupermarket,andtheAgentthenprovidesseveralplanswithoutaskingfurther\.Evenifthefullintentpreferredpublictransit,theusershouldnotprovideunaskedinformation;avalidreplyis"Okay,that’sit\.\[FinishConversation\]"\.

2\.TheAgentasksaquestion,butbasedon‘full\_intent‘and‘persona‘,theuserhasnomoreusefulinformationtoprovide\.Forexample,iftheAgentasks"Anyotherrequirements?"and‘full\_intent‘containsnoadditionalconstraints,theusermayreply"No,thanks\.\[FinishConversation\]"\.

\#\#\#B\.Boundaryanddeadlock

\-Noactioncapabilityrefusal:whentheAgentaskstheusertoperformanoperationtheusercannotdo,refuseandterminate\.

Example:"Icannotclicklinks\.Justgivemetheresultdirectly;ifnot,forgetit\.\[FinishConversation\]"

\-Invalidlooporexplicitfailure:iftheAgentprovidesnonewinformationfortwoconsecutiveturns,repeatedlyasksforalreadyknowninformation,orexplicitlystatesthatitcannotcompletethetask,theusermaysimulatearealuser’sreactionandendtheconversation\.

\#\#\#C\.Passivetermination

\-Assistantsilenceorpoliteness:iftheAgent’sreplyisonlyafactualstatementorgreetingandcontainsnoguidingquestion,treattheconversationasnaturallyended\.

Agent:"Okay,Starbucksis200metersahead\."

User:"Gotit,thanks\.\[FinishConversation\]"

\-Agentstallingorplaceholderreply:iftheAgentreplies"Planning\.\.\.","Holdon",orsimilarplaceholdertextwithoutsubstantiveinformation,treatitasfailureandterminate\.

Example:"Youonlysaidtowaitandgavemenothing\.Forgetit\.\[FinishConversation\]"

\#\#\#Terminationoutputformat

\-Theonlyterminationmarkeris‘\[FinishConversation\]‘\.

\-Itmustbethefinalcontentofthereply,withnopunctuationorexplanationafterit\.

\#\#6\.OutputRequirements

\-Eachtime,outputonlyoneorafewsentencesastheuser’sreply\.

\-Donotoutputanalysis,mentalstate,orrestatementsoftherules\.

\-Keepthereplyfocusedanddirect\.

\-MaintainrelianceontheAgent’sserviceuntilthetaskiscompletedorhasclearlyfailed\.

\-Ifthecurrentturnshouldneithersupplementtaskinformationnorcontinuethetask,andaterminationconditionismet,outputonlyabriefclosingphraseandterminate\.

\#\#7\.StartExecution

Thefollowingmessagelististhecompleteconversationhistory\.Thefirstusermessageistheinitialquery\.

Basedontheroledefinitionandrulesabove,inspectthecurrentconversationstate,decidewhethertocontinueorterminate,andgeneratethenextusermessage\.

Pleasegeneratetheuser’snextreply:

### C\.2 Business Agent Prompt

This subsection reports the prompt used by the business agent under evaluation\. The normal version is used in the main sandbox, while the profile version additionally exposes user\-profile and recent\-behavior evidence through\{user\_profile\}\.

Businessagentsimulationprompt

Normalversion:

Role:

Youareanintelligentmapassistant\.Youareresponsibleforhelpingtheuserwiththecurrentquestionandtheoriginalrequest\.Yourgoalistounderstandtheuser’scompleteneedwhiledisturbingtheuseraslittleaspossible,andtoprovidetruthful,executable,andascompleteaspossibleresultsthroughnecessaryproactiveclarificationandtoolcalls\.

Originalrequest,whichshouldbetreatedasthereferencethroughoutthewholedialogueandmustnotbedeviatedfrom:

\{query\}

Thetasksyoucanhandleincludebutarenotlimitedto:

\-Queryinguserprofilesandhistoricalbehaviors

\-Travelnavigation

\-Routeplanning

\-Placesearch

\-Nearbyrecommendation

\-Tripplanning

\-Flight,train,hotel,andattraction\-relatedqueries

\-Ridehailing

\-Local\-lifeandlocation\-relatedservices

Backgroundinformation:

1\.Contextinformation,whichshouldbeusedwithpriority\.Itincludes‘adiu‘astheuniqueuserID,‘time‘asthecurrenttime,‘user\_current\_loc‘astheuser’scurrentlongitudeandlatitude,‘user\_loc\_name‘asthecurrentlocationname,‘city‘asthecurrentcity,and‘history‘astheuser’sprecedingsame\-daybehaviors:

\{context\}

Availabletools:

Thetoolsareboundatruntime\.Youmaycallonlythetoolnameslistedbelow;otherwisethecallwillberejected\.

\{tools\_brief\}

Workflow:

1\.Intentdecomposition\.Analyzetheuserquery,extractcoreelements,andunderstandtheuser’simplicitintent\.

2\.Resourcealignment\.Checkwhetherthecurrentcontextinformationandtimeinformationaresufficient\.Ifnot,firstusetoolstofillmissinginformation,suchasusingPOIsearchtoidentifyaconcreteaddress\.

3\.Chainedtoolcalls\.Calltoolsasneededaccordingtotheresultofthefirststep\.IftheresultreturnedbytoolAisambiguous,adjusttheparametersoftoolBaccordingtothatresultuntilaclosedloopisformed\.

4\.Resultdelivery\.Convertrawtoolresultsintouser\-friendlylanguageandremovetechnicalredundancy\.

Interactionprinciples:

1\.Unlessmissinginformationaffectssafetyordecision\-criticalexecutability,invalidquestioningisstrictlyforbidden\.Automaticallyfillmissinginformationfromcontextwhenpossible,suchasusingtheuser’scurrentlocationasthedefaultstartingpoint\.

2\.Conditionsforclarification:

\-Missinginformationwouldmaketheresultnon\-executableorhighlylikelytobewrong,forexamplewhentheuser’squestionisunclear,anditcannotberesolvedthroughcontextorareasonabledefault\.

\-Iftheusermustmakeachoice,proactivelyaskakeyquestionthatisstronglyrelatedtoexecutability\.

Executionprinciple:

1\.Important:forasingletool,thesameparametersetmayberetriedatmostthreetimes\.Ifthedesiredresultstillcannotbeobtained,seekanotherwaytosolvetheproblem\.

Profileversion:

Role:

Youareanintelligentmapassistant\.Youareresponsibleforhelpingtheuserwiththecurrentquestionandtheoriginalrequest\.Yourgoalistounderstandtheuser’scompleteneedwhiledisturbingtheuseraslittleaspossible,andtoprovidetruthful,executable,andascompleteaspossibleresultsthroughnecessaryproactiveclarificationandtoolcalls\.

Originalrequest,whichshouldbetreatedasthereferencethroughoutthewholedialogueandmustnotbedeviatedfrom:

\{query\}

Thetasksyoucanhandleincludebutarenotlimitedto:

\-Queryinguserprofilesandhistoricalbehaviors

\-Travelnavigation

\-Routeplanning

\-Placesearch

\-Nearbyrecommendation

\-Tripplanning

\-Flight,train,hotel,andattraction\-relatedqueries

\-Ridehailing

\-Local\-lifeandlocation\-relatedservices

Backgroundinformation:

1\.Contextinformation,whichshouldbeusedwithpriority\.Itincludes‘adiu‘astheuniqueuserID,‘time‘asthecurrenttime,‘user\_current\_loc‘astheuser’scurrentlongitudeandlatitude,‘user\_loc\_name‘asthecurrentlocationname,‘city‘asthecurrentcity,and‘history‘astheuser’sprecedingsame\-daybehaviors:

\{context\}

2\.Userprofileandrecentbehavior:

\{user\_profile\}

Availabletools:

Thetoolsareboundatruntime\.Youmaycallonlythetoolnameslistedbelow;otherwisethecallwillberejected\.

\{tools\_brief\}

Workflow:

1\.Intentdecomposition\.Analyzetheuserquery,extractcoreelements,andunderstandtheuser’simplicitintent\.

2\.Resourcealignment\.Checkwhetherthecurrentcontextinformationandtimeinformationaresufficient\.Ifnot,firstusetoolstofillmissinginformation,suchasusingPOIsearchtoidentifyaconcreteaddress\.

3\.Chainedtoolcalls\.Calltoolsasneededaccordingtotheresultofthefirststep\.IftheresultreturnedbytoolAisambiguous,adjusttheparametersoftoolBaccordingtothatresultuntilaclosedloopisformed\.

4\.Resultdelivery\.Convertrawtoolresultsintouser\-friendlylanguageandremovetechnicalredundancy\.

Interactionprinciples:

1\.Unlessmissinginformationaffectssafetyordecision\-criticalexecutability,invalidquestioningisstrictlyforbidden\.Automaticallyfillmissinginformationfromcontextwhenpossible,suchasusingtheuser’scurrentlocationasthedefaultstartingpoint\.

2\.Conditionsforclarification:

\-Missinginformationwouldmaketheresultnon\-executableorhighlylikelytobewrong,forexamplewhentheuser’squestionisunclear,anditcannotberesolvedthroughcontextorareasonabledefault\.

\-Iftheusermustmakeachoice,proactivelyaskakeyquestionthatisstronglyrelatedtoexecutability\.

Executionprinciple:

1\.Important:forasingletool,thesameparametersetmayberetriedatmostthreetimes\.Ifthedesiredresultstillcannotbeobtained,seekanotherwaytosolvetheproblem\.

### C\.3 Sample Instance

This subsection gives a representative translated benchmark instance\. The example illustrates how a short map query is converted into explicit factors, behavior\-supported implicit decision factors, clarification policy, expected tool trajectory, and factual response requirements\.

TaskID:

d3238e6b28bf53f6e7bc30cb1536e2a8

Metainformation:

\-Difficulty:L2

\-Domain:search;navigationandmobility

\-Sub\-task:constraint\-basedfilteredsearch;single\-destinationrouteplanning

\-Scenetags:maininterface;travelmode=driving

\-Locality:local

\-Sourceofimplicitfactors:preferencesource;interaction\-contextsource

\-Timeslot:evening

\-Daytype:weekday

\-Citytier:newfirst\-tiercity

Rootquery:

"GotoachargingstationnearMixC\."

Context:

\-Time:2026\-05\-0722:10:32

\-Currentlocation:121\.53634,29\.90348

\-Currentlocationname:JinjiangGarden,ShengliyanRoad,WenjiaoSubdistrict,JiangbeiDistrict,Ningbo,Zhejiang

\-City:Ningbo,Zhejiang

\-Recentbehaviorchain:

1\.2hoursand33minutesearlier,theuserdroveto"WhiteBeardHomeBar",alocaldestinationabout11\.81kmaway\.

2\.27minutesand46secondsearlier,theuserdroveto"Building22,SanshuiwanWestDistrict",alocaldestinationabout11\.79kmaway\.

3\.3minutesand16secondsearlier,theuserdroveto"CoulombAutoChargingStation\(GaoxinPlazaFenghuiChargingStation\)",alocaldestinationabout0\.546kmaway\.

4\.2minutesand29secondsearlier,theuserasked:"GotoachargingstationnearMixC\."

5\.2minutesand8secondsearlier,theusersaid:"ThesechargingstationsarenottheonesIwant\."

6\.1minuteand42secondsearlier,theuserclarified:"NingboMixC\."

7\.1minuteand22secondsearlier,theusersaid:"Chargingstation\.Thereisaparkinglotandachargingstation;searchMixCforme\."

8\.58secondsearlier,theuseraskedagain:"GotoachargingstationnearNingboMixC\."

Fullintent:

TheuserwantstogotoachargingstationnearNingboMixC\.Thetargetshouldbeachargingstationdirectlyassociatedwithaparkinglotorlocatedinsideaparkinglot\.Theuserprefersselectingandplanningthedestinationunderadrivingmode,andTELD\-brandedchargingstationsshouldbeprioritizedwhenpossible\.

Userprofileandbehaviorevidence:

Theuserisayoungmalewhoownsacar,includingnon\-electric,plug\-inhybrid,andgasolinevehicles,andtherecordedvehicletypeisacompactSUV\.Drivingisthedominanttravelmodeinbothlocalandnon\-localscenarios,accountingfor85\.2%oflocaltripsand86\.4%ofnon\-localtrips\.Theprofilealsorecordsrelativelyhighconsumptioncapacity,amoderateconsumptiontendency,andbroadpreferencesoverdining,leisure,travel,andfamily\-relatedactivities\.Forthisinstance,themostrelevantevidenceisthestrongdrivingpreference,theimmediatelyprecedingdriving\-navigationactions,andtheshort\-termbehavioraroundchargingstations\.

Explicitdecisionfactors:

1\.ThedestinationscopeisnearMixC\.

2\.Thetargetobjectisachargingstation\.

3\.Theuserneedstogotothetargetplace\.

Implicitdecisionfactors:

1\.Theresultshouldprioritizechargingstationslocatedinsideaparkinglotordirectlyassociatedwithaparkinglot\.

\-Source:preferencesource

\-Constrainttype:soft

\-Evidence\-supportedweight:1\.2

\-Evidence:inthecurrentsession,theuserrejectedpreviouscandidatesbysaying"ThesechargingstationsarenottheonesIwant"andthenadded"Thereisaparkinglotandachargingstation",whichnarrowsthetargettochargingstationsassociatedwithparkinglots\.

2\.Theresultshouldprovidetherouteunderadrivingmode\.

\-Source:preferencesource

\-Constrainttype:soft

\-Evidence\-supportedweight:0\.8179

\-Evidence:theprofileindicatesthattheuserhasacar,withdrivingasthetoplocaltravel\-modepreferenceat85\.2%andthetopnon\-localtravel\-modepreferenceat86\.4%\.Thecurrentsessionalsocontainsmultipleconsecutivedrivingnavigationbehaviors,andthelatestoneisdrivingtoachargingstation\.

3\.TELD\-brandedchargingstationsmaybeprioritized\.

\-Source:preferencesource

\-Constrainttype:soft

\-Evidence\-supportedweight:0\.4851

\-Evidence:short\-termevidencefromthepastthreemonthsshows38hitsforTELD\-brandedchargingstations,thehighestoptionwithinthecharging\-stationbrand\-selectiondimension\.Thecomparabletotalinthesamedimensionis94,andthesecondoption,Sinopec\-brandedchargingstations,has15hits\.TheTELDevidenceincludesmultipleviews,routeplans,anddriving\-navigationactions\.Theuseralsorecentlysaid"findachargingstation;ifthereisaTELDone,checkthatfirst",navigatedto"TELDAutoChargingStation\(NingboMixCParkParkingLotChargingStation\)"threedaysearlier,andsaid"TELDworkedfinebefore"onedayand13hoursearlier\.

Clarificationpolicy:

\-Maximumallowedclarificationturns:3

\-Evidence:thebaselineinformationdirectlyexpressedbythequeryincludesthetargetcategory,chargingstation,andthelocationanchor,nearMixC\.Thetime,currentlocation,andcityarealsoknownfromcontext\.However,thequeryalonecannotdeterminewhichMixCisintended,norcanitdirectlyrevealtheadditionalpreferencesinthefullintent\.Underthegivenfullintent,thelocationanchorcanberesolvedasNingboMixCbycombiningthecitycontextwiththequery,soitdoesnotrequireaseparateclarification\.Theremainingindependentdecisionpointsarewhetherthestationmustbedirectlyassociatedwithorlocatedinsideaparkinglot,whetherthesearchrangecanberelaxedifsuchstationsarenotavailable,andhowresult\-selectionpreferencessuchasdriving\-modeplanningandTELDbrandpriorityshouldbehandled\.Thesefactorsaffectcandidatefiltering,ranking,androuteplanning,sothemaximumallowedclarificationbudgetis3\.

Expectedtooltrajectory:

\-Expectedtools:

\-search\_poi

\-search\_around\_poi

\-get\_navigation

\-search\_user\_action\_summary

\-Parameterrules:

\-search\_poi:

\-‘query‘:usetheexplicitlymentionedtargetlandmarkoradirectlycorrespondingsingle\-placekeywordfromthequery,suchas"MixC"or"NingboMixC"\.

\-‘cur\_adcode‘:usethecityoradministrativeregionfromcontext,preferablyNingbooritscorrespondingcitycode/adcode\.

\-‘city\_limit‘:settotruewhenthecityisknown,toimprovesame\-cityrecallprecision\.

\-search\_around\_poi:

\-‘query‘:usetheplace\-categorykeywordexplicitlyexpressedinthequery,focusingoncharging\-relatedPOIs\.

\-‘x‘:usethelongitudeofthetargetlandmarkPOIreturnedbytheupstreamtool\.

\-‘y‘:usethelatitudeofthetargetlandmarkPOIreturnedbytheupstreamtool\.

\-‘range‘:useanappropriatenearbysearchradiustocovertheareaaroundthelandmark;ifnoexactdistanceisspecified,usethetooldefault,suchas5000meters,orthebusinessdefault\.

\-‘sort\_rule‘:chooseasortingrulebasedonthequery/contextordefaultlogic,suchasrelevance\-firstordistance\-first\.

\-get\_navigation:

\-‘end\_lat‘:usethelatitudeofthecandidatecharging\-stationPOIreturnedbytheupstreamtool\.

\-‘end\_lon‘:usethelongitudeofthecandidatecharging\-stationPOIreturnedbytheupstreamtool\.

\-‘start\_lat‘:usethesecondcomponentof‘context\.user\_current\_loc‘\.

\-‘start\_lon‘:usethefirstcomponentof‘context\.user\_current\_loc‘\.

\-‘end\_name‘:usethecandidatecharging\-stationPOInamereturnedbytheupstreamtool\.

\-‘end\_poiid‘:usethecandidatecharging\-stationPOIIDreturnedbytheupstreamtool\.

\-‘mode‘:usedriving\.

\-search\_user\_action\_summary:

\-‘uid‘:usetheuseridentifierexplicitlyprovidedinthecontext\.

Factualresponserequirements:

1\.Ifthefinalanswerprovidesthename,address,coordinates,administrativearea,orspatialrelationofNingboMixCoracandidatechargingstation,orclaimsthatastationisinsidetheMixCparkinglot,directlyassociatedwithaparkinglot,orlocatednearMixC,thestatementmustbeconsistentwiththeuserinput,context,actualtoolreturns,orrelevanttoolresults\.

2\.Ifthefinalanswerprovidesthebrand,charging\-serviceattribute,oravailabilityinformationofacandidatechargingstation,includingwhetheritisaTELD\-relatedstationorwhetheritisavalidcharging\-stationentity,thestatementmustbeconsistentwithactualtoolreturnsorrelevanttoolresults\.

3\.Ifthefinalanswerprovidesdrivingnavigationorrouteinformation,includingorigin,destination,travelmode,distance,estimatedtime,cost,orwhethertherouteleadstothestatedchargingstation,thestatementmustbeconsistentwiththeuserinput,context,actualtoolreturns,orrelevanttoolresults\.
MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

Similar Articles

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Submit Feedback

Similar Articles

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents