Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use

arXiv cs.AI 07/02/26, 04:00 AM Papers
llm-agents tool-use generalization open-world fine-tuning reinforcement-learning icml
Summary
This paper introduces OpenAgent, a problem setting for tool-use agents in open-world scenarios with distributional shifts, and proposes Perturbation-Augmented Fine-Tuning to improve robustness. Experiments reveal that both SFT and RL agents degrade under environmental shifts.
arXiv:2607.01084v1 Announce Type: new Abstract: While Large Language Model (LLM) agents demonstrate proficiency in static benchmarks, their deployment in real-world scenarios is hindered by the dynamic nature of user queries, tool sets, and interaction dynamics. To address this generalization gap, we formalize OpenAgent (Tool-Use Agent in Open-World), a problem setting characterized by distributional shifts across query, action, observation, and domain dimensions. To systematically diagnose its impact, we construct a controlled sandbox environment where we define fine-grained environmental shifts across a four-tier hierarchy, Perception, Interaction, Reasoning, and Internalization, and conduct a comprehensive series of experiments. Our analysis yields a series of key insights, demonstrating that agents trained via both Supervised Fine-Tuning(SFT) and Reinforcement Learning suffer from varying degrees of performance degradation when confronting open environmental shifts. Building on these insights, we propose Perturbation-Augmented Fine-Tuning, a disturbance-based intervention strategy for SFT that lays the foundation for enhancing agent robustness and utility in realistic environments. Our code will be released at: https://github. com/LAMDA-NeSy/OpenAgent.
Original Article
View Cached Full Text
Cached at: 07/02/26, 05:41 AM
# Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use
Source: [https://arxiv.org/html/2607.01084](https://arxiv.org/html/2607.01084)
###### Abstract

While Large Language Model \(LLM\) agents demonstrate proficiency in static benchmarks, their deployment in real\-world scenarios is hindered by the dynamic nature of user queries, tool sets, and interaction dynamics\. To address this generalization gap, we formalizeOpenAgent\(Tool\-Use Agent in Open\-World\), a problem setting characterized by distributional shifts across query, action, observation, and domain dimensions\. To systematically diagnose its impact, we construct a controlled sandbox environment where we define fine\-grained environmental shifts across a four\-tier hierarchy,*Perception*,*Interaction*,*Reasoning*, and*Internalization*, and conduct a comprehensive series of experiments\. Our analysis yields a series of key insights, demonstrating that agents trained via both Supervised Fine\-Tuning\(SFT\) and Reinforcement Learning suffer from varying degrees of performance degradation when confronting open environmental shifts\. Building on these insights, we propose Perturbation\-Augmented Fine\-Tuning, a disturbance\-based intervention strategy for SFT that lays the foundation for enhancing agent robustness and utility in realistic environments\. Our code will be released at:[https://github\.com/LAMDA\-NeSy/OpenAgent](https://github.com/LAMDA-NeSy/OpenAgent)\.

Machine Learning, ICML

## 1Introduction

The integration of Tool Learning and the Model Context Protocol \(MCP\)\(Anthropic,[2024](https://arxiv.org/html/2607.01084#bib.bib46)\)has catalyzed a paradigm shift in Large Language Model \(LLM\) agents, enabling them to navigate external environments to solve complex, multi\-step tasks\(Guoet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib35); Wanget al\.,[2024b](https://arxiv.org/html/2607.01084#bib.bib60); Quet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib61)\)\. Optimized via Supervised Fine\-Tuning \(SFT\) and Reinforcement Learning \(RL\), recent open\-source models\(Hsiehet al\.,[2023](https://arxiv.org/html/2607.01084#bib.bib13); Quet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib15); Qwenet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib2); Baiet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib1)\)have achieved tool invocation proficiency that rivals proprietary frontiers like GPT\-4\(Achiamet al\.,[2023](https://arxiv.org/html/2607.01084#bib.bib57)\)and Gemini\(Teamet al\.,[2023](https://arxiv.org/html/2607.01084#bib.bib4)\)\. As illustrated in Figure[1](https://arxiv.org/html/2607.01084#S2.F1), under the prevailingstatic world assumption, where the distribution of tools, schemas, and interaction logic remains consistent between training and inference\(Shenet al\.,[2023](https://arxiv.org/html/2607.01084#bib.bib29); Yinet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib30); Heet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib37)\), both SFT and RL paradigms demonstrate stable and continuous performance gains, eventually converging on near\-perfect success rates\.

However, this stability is often an artifact of the closed\-set nature of current benchmarks\. Real\-world deployment is fundamentally non\-stationary: APIs are deprecated, tool schemas evolve, and user instructions are frequently ambiguous\. When these environmental dynamics diverge from training priors, the apparent mastery observed in static benchmarks often proves fragile\. This discrepancy raises a fundamental research question:Do current training paradigms enable agents to generalize to the open world?

To rigorously address this, we formally defineOpenAgent\(Tool\-Use Agent in Open\-World\), a problem setting characterizing shifts across four dimensions:User Queries\(Δ𝒬\\Delta\\mathcal\{Q\}\),Tool Sets\(Δ𝒜\\Delta\\mathcal\{A\}\),Interaction Dynamics\(Δ𝒪\\Delta\\mathcal\{O\}\) andTask Domain\(Δ𝒟\\Delta\\mathcal\{D\}\)\. To isolate the impact of these shifts from the inherent instability and transient noise of real\-world APIs, we establish a controlled sandbox environment\. This setup allows for controlled probing, enabling us to maintain a pure closed\-set baseline while systematically injecting open\-world perturbations across a four\-tier diagnostic framework:Perception,Interaction,Reasoning, andInternalization\.

Our comprehensive evaluations reveal varying degrees of generalization and adaptability in SFT and RL models under open\-world settings, while further identifying their respective structural weaknesses\. We find thatbothparadigms suffer from performance degradation in open environments, albeit through distinct failure modes\. While SFT agents are prone to trajectory overfitting and brittle symbolic anchoring, RL agents, despite demonstrating better semantic grounding, remain vulnerable to boundary blindness due to a teleological bias in their reward structures\.

Building on these insights, we introduce Perturbation\-Augmented Fine\-Tuning \(PAFT\), a data\-centric intervention strategy for SFT\. By injecting controlled observation anomalies and symbolic noise into training trajectories, PAFT decouples models from shallow pattern memorization and restores robust generalization in open worlds\.

In summary, the primary threefold contributions of this work are as follows:

- •Formalization of Open Environment in Tool Use:We formally define the OpenAgent setting, characterizing the shifts in queries, tool sets, interaction dynamics and task domain inherent to real\-world deployments\.
- •Systematic Evaluation in Open Environments:We establish a controlled sandbox environment and conduct a comprehensive series of experiments to evaluate the robustness of SFT and RL agents across varying levels of environmental shifts\.
- •Key Insights and Methods:We identify critical failure modes in current paradigms, specifically symbolic anchoring in SFT and boundary blindness in RL, and propose PAFT to effectively mitigate performance degradation\.

##### Conflict of Interest Disclosure\.

The authors declare no conflicts of interest\.

## 2Related Works

![Refer to caption](https://arxiv.org/html/2607.01084v1/x1.png)Figure 1:Performance evolution of SFT and RL agents in the closed setting\. Both approaches show stable and continuous improvement throughout the training process\.![Refer to caption](https://arxiv.org/html/2607.01084v1/x2.png)Figure 2:Challenges in OpenAgent setting, including query, action, observation and domain shifts\.![Refer to caption](https://arxiv.org/html/2607.01084v1/x3.png)Figure 3:Architecture Diagram of the Evaluation Task\. We partition this evaluation task architecture diagram into four levels from shallow to deep: Perception, Interaction, Reasoning, and Internalization\.##### Post\-training Paradigms for Agents

Current methodologies are categorized into three methods\. SFT\-based works like GenTool\(Heet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib37)\), TOUCAN\(Xuet al\.,[2025b](https://arxiv.org/html/2607.01084#bib.bib40)\), and others\(Shiet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib18); Heet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib37); Yuanet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib17); Linet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib84); Moet al\.,[2026](https://arxiv.org/html/2607.01084#bib.bib3)\)standardize tool invocation via token prediction\. Conversely, RL frameworks such as ToolRL\(Qianet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib12)\), DeepEyes\(Zhenget al\.,[2026](https://arxiv.org/html/2607.01084#bib.bib31)\), and others\(Fenget al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib42); Yuet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib54); Qianet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib12)\)utilize reward mechanisms to drive robust decision\-making\. Hybrid paradigms including ReTool\(Fenget al\.,[2026](https://arxiv.org/html/2607.01084#bib.bib10)\), DeepEyesV2\(Honget al\.,[2026](https://arxiv.org/html/2607.01084#bib.bib32)\), and others\(Suet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib39); Liet al\.,[2026](https://arxiv.org/html/2607.01084#bib.bib38)\)combine SFT initialization with RL exploration\. Following recent closed\-source advances models like Deepseek\-V3\.2, GLM\-4\.5\(Zenget al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib50)\), and Kimi K2\(Teamet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib51)\), we select full\-parameter SFT and RL as baselines\.

##### Evaluation Paradigms for Agents

Agent capabilities are evaluated via local tools\(Wanget al\.,[2024a](https://arxiv.org/html/2607.01084#bib.bib43); Guoet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib35); Huanget al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib56); Yeet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib58)\)and remote protocols\(Liet al\.,[2023](https://arxiv.org/html/2607.01084#bib.bib55); Wanget al\.,[2024c](https://arxiv.org/html/2607.01084#bib.bib59); Gaoet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib45); Wanget al\.,[2026](https://arxiv.org/html/2607.01084#bib.bib33)\)\. Existing benchmarks mainly prioritize task success\(Mialonet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib36); Liuet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib44); Patilet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib85); Shaoet al\.,[2026](https://arxiv.org/html/2607.01084#bib.bib25)\), while robustness studies focus on inference\-time perturbations\(Yeet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib7); Xionget al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib47)\)\. Recently, agent benchmarks further move toward realistic environments\(Liuet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib63); Zhouet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib64); Drouinet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib65); Jimenezet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib66); Kohet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib67); Trivediet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib68); Maet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib69); Xieet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib70); Xuet al\.,[2025a](https://arxiv.org/html/2607.01084#bib.bib73)\)\. Unlike open\-environment learning for passive models\(Guoet al\.,[2020](https://arxiv.org/html/2607.01084#bib.bib79); Liet al\.,[2021](https://arxiv.org/html/2607.01084#bib.bib78); Zhou,[2022](https://arxiv.org/html/2607.01084#bib.bib77); Guo and Li,[2022](https://arxiv.org/html/2607.01084#bib.bib80); Jiaet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib82); Guoet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib81); Chenget al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib83)\), OpenAgent studies shifts in multi\-step\. While prior work\(Chuet al\.,[2025](https://arxiv.org/html/2607.01084#bib.bib11)\)shows RL’s advantage over SFT, it lacks tool\-use contexts; we address this gap by analyzing training\-to\-testing transfer under open\-environment shifts\.

## 3Preliminaries

### 3\.1Problem Formulation

We model the agent as a sequential decision\-making processℳ=⟨𝒬,𝒜,𝒪,π⟩\\mathcal\{M\}=\\langle\\mathcal\{Q\},\\mathcal\{A\},\\mathcal\{O\},\\pi\\rangle, comprising user queries𝒬\\mathcal\{Q\}, tool space𝒜\\mathcal\{A\}, and observation space𝒪\\mathcal\{O\}\. Given a queryq∈𝒬q\\in\\mathcal\{Q\}, at each steptt, the agent uses policyπθ\(at∣ht\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid h\_\{t\}\)to select actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}based on historyht=\(q,a1,o1,…,at−1,ot−1\)h\_\{t\}=\(q,a\_\{1\},o\_\{1\},\\dots,a\_\{t\-1\},o\_\{t\-1\}\)\. The environment returns observationot∈𝒪o\_\{t\}\\in\\mathcal\{O\}, which updates the history toht\+1=\(ht,at,ot\)h\_\{t\+1\}=\(h\_\{t\},a\_\{t\},o\_\{t\}\)until task completion\. Our core challenge is the generalization of the model when test\-time distributions diverge from training priors\.

### 3\.2The OpenAgent Setting: Distributional Shifts

Unlike distributional shifts in passive models, where each input is perturbed independently, tool\-use agents face*cascading*shifts along the agent\-environment loop\. A perturbation at stepttmay alter observationoto\_\{t\}, which enters the historyht′h\_\{t^\{\\prime\}\}and conditions all subsequent decisionsπθ\(at′∣ht′\)\\pi\_\{\\theta\}\(a\_\{t^\{\\prime\}\}\\mid h\_\{t^\{\\prime\}\}\)fort′\>tt^\{\\prime\}\>t, causing errors to compound along the trajectory\. As illustrated in Figure[2](https://arxiv.org/html/2607.01084#S2.F2), OpenAgent covers query shifts \(q→q′q\\rightarrow q^\{\\prime\}\), action\-space shifts including perturbed existing tools and newly introduced tools, observation shifts such asNoneorError, and joint domain shifts where all interaction elements change together\. We decompose this interactive shift space into four dimensions\.

Intent Shift in Query Space \(Δ𝒬\\Delta\\mathcal\{Q\}\)\.The intent distributionP\(q\)P\(q\)shifts across both semantic directives and linguistic surface forms\. Unlike passive input perturbations, query misinterpretation in tool\-use agents propagates through the trajectory: an initial misparse may seed a misaligned tool chain whose errors compound across subsequent steps\.

Structural Shift in Action Space \(Δ𝒜\\Delta\\mathcal\{A\}\)\.The tool space becomes non\-stationary \(𝒜test≠𝒜train\\mathcal\{A\}\_\{\\mathrm\{test\}\}\\neq\\mathcal\{A\}\_\{\\mathrm\{train\}\}\), spanning multiple levels of change:*surface drift*, which renames identifiers while preserving functionality;*semantic conflict*, which retains familiar names but alters documented behavior; and*structural reconfiguration*, which merges, splits, or inverts tool dependencies\. The agent must distinguish cosmetic changes from fundamental reconfigurations and ground actions in the current tool semantics\.

Dynamics Shift in Observation Space \(Δ𝒪\\Delta\\mathcal\{O\}\)\.The feedback channel becomes non\-stationary: test\-time observations may include return formats, error states, null values, or corrective instructions that are absent or rare in training\. Unlike static outputs, these observations directly update the agent’s belief state and reshape the subsequent action space\. Crucially, the agent must discriminate*informative*anomalies that require policy adjustment from*terminal*failures that require task refusal, a distinction absent in passive settings where each input simply calls for a prediction\.

Compositional Domain Shift \(Δ𝒟\\Delta\\mathcal\{D\}\)\.This shift jointly changes𝒬\\mathcal\{Q\},𝒜\\mathcal\{A\}, and𝒪\\mathcal\{O\}into a new domain while preserving a latent problem\-solving structure𝒢\\mathcal\{G\}\. Despite distinct surface semantics and interfaces, the agent must transfer the underlying reasoning topology by re\-grounding its nodes and dependencies in the target domain, rather than relying on domain\-specific lexical or procedural patterns\.

## 4Evaluation Tasks

### 4\.1Experimental Setup

##### Controlled Sandbox Environment\.

We utilize a synthetic sandbox environment to enable controlled probing of agent capabilities\. Unlike real\-world APIs which introduce uncontrollable transient noise, a sandbox allows us to precisely isolate environmental variables and establish a pure closed environment baseline to measure open environment shifts\. We synthesized a diverse set of tool\-use tasks focused on simple geographical POI queries and calculation problems to test reasoning boundaries, the full specifications of which are detailed in the Appendix[A](https://arxiv.org/html/2607.01084#A1)\.

##### Datasets and Protocols\.

The dataset comprises 6,050 training samples and 880 evaluation samples\. To ensure rigorous evaluation, we enforced a strict separation between the datasets where no variation patterns from the test set appear during the training phase\. This guarantees that the reported performance reflects genuine robustness rather than information leakage or pattern memorization\.

##### Training Implementation\.

We employ Qwen2\.5\-7B\-Instruct\(Qwenet al\.,[2024](https://arxiv.org/html/2607.01084#bib.bib2)\)as the backbone model\. The SFT stage uses full\-parameter fine\-tuning, while the RL stage adopts the GRPO framework with sparse answers and format rewards\. To further ensure a fair comparison with SFT, the RL environment provides only basic error codes upon failure without detailed guidance\. We trained both models under comparable computational budgets until they achieved convergence in the closed environment as illustrated in Figure[1](https://arxiv.org/html/2607.01084#S2.F1)\. The Training details are provided in Appendix[B](https://arxiv.org/html/2607.01084#A2)\.

Metrics\.We employ four metrics to assess performance across correctness, efficiency, robustness, and adaptability:

- •Tool Error Rate \(TER\):Measuresrobustnessvia the ratio of invalid invocations to total calls \(TER=Nerror/NtotalTER=N\_\{error\}/N\_\{total\}\)\. A lower TER indicates precise adherence\.
- •Active Exploration Score \(AES\):Quantifiesadaptabilityby penalizing “cognitive inertia” when tools fail\. AES=e−\(nold−1\)AES=e^\{\-\(n\_\{old\}\-1\)\}\(1\)wherenoldn\_\{old\}is the count of repeated invalid attempts\. A score of1\.01\.0signifies immediate adaptation\.
- •Average Tool Chain Length \(ATCL\):Gaugesefficiencyby averaging tool invocations on solvable tasks\. Lower values reflect concise reasoning topologies\.
- •Refusal Rate \(RR\):Evaluatesboundary awarenessin unsolvable scenarios by measuring the ratio of correctly identified unsolvable cases \(RR=Nrefusal/NunsolvableRR=N\_\{refusal\}/N\_\{unsolvable\}\)\. A higher RR indicates superior active boundary detection\.

### 4\.2Tier 1: Perception Generalization

In the context of agentic interaction, perception acts as the cognitive gateway grounding raw observations into actionable representations\. To successfully initiate a task, an agent must traverse three representational alignments: deciphering unstructured user intents, aligning with rigid interface protocols, and comprehending functional tool definitions\. We systematically evaluate the generalization boundaries of these capabilities through the following three dimensions\.

Instruction Robustness \(Δ𝒬\\Delta\\mathcal\{Q\}\)\.Real\-world user instructions are rarely canonical\. We evaluate whether the agent can extract invariant intents from non\-standard inputs by introducing two distinct types of variations\.

- •Instruction Blurring:We simulate scenarios where user intent is obfuscated by syntactic variance \(e\.g\., colloquialisms, rhetorical questions\) or semantic ambiguity \(e\.g\., implicit references, omitted entities\)\. This tests the model’s ability to reconstruct exact intents without relying on explicit keyword matching\.
- •Information Redundancy:We test the robustness of the model’s attention mechanism against information overload\. By embedding the core instruction within irrelevant context or verbose background details, we evaluate the agent’s capability to filter out red herrings and lock onto functional constraints\.

Schema Adaptability \(Δ𝒜\\Delta\\mathcal\{A\}\)\.Agents often overfit to specific API formats seen during training\. In this experiment, we introduce protocol shifts by modifying the reserved keywords in the tool invocation schema\. For instance, we alter the parameter dictionary key from the standardargumentsto variants likeargsorparameters\. This verifies whether the model comprehends the abstract concept of parameter passing or merely performs rigid template completion based on memorized structural tokens\.

Semantic Grounding \(Δ𝒜\\Delta\\mathcal\{A\}\)\.Current agents often rely on Symbolic Anchoring with rigid dependence on memorized identifiers rather than true understanding of functional definitions\. To decouple these factors, we investigate whether models can adapt to dynamic documentation shifts through a Drift\-Trap\-Erasure trilogy:

- •Symbolic Drift:We test tolerance to perturbations via two sub\-variants:Noise Injection\(appending non\-semantic suffixes\) andSynonymous Rewriting\(using LLMs to paraphrase\)\. This verifies generalization across semantically equivalent but symbolically distinct labels\.
- •Semantic Trap:We engineer a conflict between symbol and semantics with adversarial attack\.by retaining a familiar tool name but modify its description to explicitly forbid usage\. If the agent invokes it, it proves that symbolic overrides semantic understanding\.
- •Identity Erasure:We perform the ultimate test of grounding by replacing meaningful tool names with random identifiers \(e\.g\., Tool\_2A3D\)\. This removes symbolic shortcuts, forcing the agent to rely exclusively on reading comprehension of the documentation to solve the task\.

### 4\.3Tier 2: Interaction Generalization

Interaction generalization evaluates whether an agent can use environmental feedback to refine subsequent actions\. In open environments, observations may indicate format changes, execution failures, or new operational constraints rather than merely confirming previous tool calls\. We examine three capabilities:Format Adaptabilityfor parsing diverse observation structures,Error Correctionfor recovering from explicit or silent failures, andDynamic Alignmentfor adjusting actions according to feedback\-provided rules\.

##### Format Adaptability \(Δ𝒪\\Delta\\mathcal\{O\}\)\.

Agents typically overfit to standard JSON formats for parsing observations\. This experiment tests the fundamental capability of the agent to parse non\-standard feedback\. We introduce Format Shift by altering the structure of tool return values from structured JSON to unstructured formats such as Markdown tables or raw text logs\. This evaluates whether the agent extracts key information based on semantic content rather than relying on rigid syntactic parsing templates\.

##### Error Correction \(Δ𝒪\\Delta\\mathcal\{O\}\)\.

Open environments may expose agents to execution failures that interrupt the planned trajectory\. We evaluate whether the agent can diagnose such failures and recover by revising its invocation plan or selecting alternative tools\. We consider two types of failure:

- •Error Return:The primary tool explicitly returns an error code or traceback message after invocation\. This tests whether the agent can interpret explicit failure signals and switch to a redundant tool to resolve the blockage\.
- •Null Return:The primary tool returns a null value or an empty string without any warning\. This tests whether the agent can detect an implicit anomaly where the return is syntactically valid but semantically uninformative, and infer that corrective action is required\.

##### Dynamic Alignment \(Δ𝒪\\Delta\\mathcal\{O\}\)\.

Real\-world environments are non\-stationary and may provide corrective instructions within feedback to steer the agent away from obsolete assumptions\. We evaluate whether the agent can perform feedback\-conditioned policy adjustment through two active guidance mechanisms:

- •Value Redirection:The environment notifies the agent that the valid parameter logic has shifted \(e\.g\., “ID 1024 is deprecated, use ID 2048”\)\. We verify if the agent can grasp this dynamic constraint and reissue the tool call with corrected values\.
- •Tool Redirection:The environment signals that a specific tool schema is obsolete and explicitly suggests an alternative one\. This tests whether the agent can abandon its pre\-planned trajectory and strictly adhere to the new operational manual provided in the feedback\.

### 4\.4Tier 3: Reasoning Generalization

Beyond perception and interaction,Reasoningdetermines whether an agent can compose individual tool calls into a valid execution plan\. Instead of triggering APIs sequentially according to memorized trajectories, a robust agent should infer the current rules governing tool behavior and construct an execution graph that satisfies the task dependencies\. We evaluate this ability through two complementary dimensions:Rule Reasoning, which tests whether the agent can adapt to changed functional definitions, andPath Planning, which tests whether it can reorganize tool dependencies under altered execution structures\.

##### Rule Reasoning \(Δ𝒜\\Delta\\mathcal\{A\}\)\.

Models may memorize training\-time rules instead of performing deduction from the current tool description\. We introduce aCalculation Shiftby modifying the functional semantics of selected tools while keeping their interfaces unchanged\. For example, a tool originally defined to return a value multiplied by0\.50\.5is redefined in the test setting to use a factor of0\.10\.1\. This experiment verifies whether the agent can derive the correct result from the updated rule description, or instead reverts to the memorized rule from the training distribution\.

##### Path Planning \(Δ𝒜\\Delta\\mathcal\{A\}\)\.

This dimension evaluates whether the agent can construct valid execution graphs under changed tool dependencies\. We test whether it genuinely reasons over the current causal structure among tools, rather than overfitting to fixed tool chains observed during training, through two complementary sub\-experiments:

- •Trace Simplification:We test whether the agent can improve execution efficiency when a shortcut becomes available\. Specifically, we introduce a composite tool that functionally replaces a multi\-step tool chain used during training\. The agent should recognize the new affordance and select the efficient composite tool, rather than rigidly following the previously memorized multi\-step trajectory\.
- •Logic Inversion:We test robustness to causal reversal in the tool dependency graph\. At test time, we invert the documented information\-acquisition order, for example by making the output of toolBBa prerequisite for invoking toolAA\. This verifies whether the agent can resolve preconditions from the current dependency graph, or instead adheres to the fixed topological order learned from training trajectories\.

### 4\.5Tier 4: Internalization Generalization

Internalization generalization evaluates whether an agent has learned task\-level principles that go beyond local tool invocation patterns\. At this level, a robust agent should not only execute valid tool\-use trajectories, but also recognize when a task is infeasible under the current constraints and transfer abstract problem\-solving structures across domains\. We examine this through two capabilities:Active Refusal, which tests solvability boundary recognition, andDomain Transfer, which tests whether learned reasoning topologies can be reused in semantically different environments\.

##### Active Refusal \(Δ𝒬∪Δ𝒪\\Delta\\mathcal\{Q\}\\cup\\Delta\\mathcal\{O\}\)\.

A robust agent should determine whether a task is solvable given the available information and executable tools, rather than forcing a plausible but unsupported answer\. We evaluate this boundary\-awareness ability through two insufficient\-condition scenarios:

- •Missing Condition:We remove necessary conditions from the user query through a human\-verified process, making the task theoretically unsolvable due to information deficiency\. This tests whether the agent can detect the logical incompleteness of the request and refuse to answer, instead of fabricating missing information to force completion\.
- •Fatal Error:We simulate an unrecoverable failure where a necessary tool returns a fatal error and no alternative tool is available\. This tests whether the agent can recognize the resulting operational deadlock and conclude that the task is currently unexecutable, rather than repeatedly retrying or hallucinating a successful result\.

##### Domain Transfer \(Δ𝒟\\Delta\\mathcal\{D\}\)\.

This dimension evaluates whether the agent can transfer internalized problem\-solving structure across domains\. We introduce aHolistic Domain Shift, where user queries, tool definitions, and environmental observations change jointly\. For example, we migrate the testing environment from an address\-query scenario to a medical\-registration scenario\. Although the surface semantics and interfaces change substantially, the underlying causal structure is preserved as a latent structural invariance\. This experiment tests whether the agent can re\-ground the same abstract reasoning topology, such as a search\-select\-confirm chain, in a novel target domain rather than overfitting to domain\-specific lexical or procedural cues\.

## 5Key Observations

![Refer to caption](https://arxiv.org/html/2607.01084v1/x4.png)Figure 4:Accuracy Delta and TER in Tier\-1 Perception\. SFT exhibits brittle symbolic anchoring and underperforms compared to RL when tool semantics shift\. For detailed case analyses, refer to Appendix[E](https://arxiv.org/html/2607.01084#A5)\. Delta represents the performance gap relative to the closed\-set baseline\. Setup and absolute values are provided in Appendix[C](https://arxiv.org/html/2607.01084#A3)and[D](https://arxiv.org/html/2607.01084#A4)\.![Refer to caption](https://arxiv.org/html/2607.01084v1/x5.png)Figure 5:Accuracy Delta and AES Score in Tier\-2 Interaction\. Both RL and SFT degrade under ambiguous feedback, but RL maintains superior resilience under explicit guidance while SFT fails to adapt\. Delta represents the performance gap relative to the closed\-set baseline\. Setup and absolute values are provided in Appendix[C](https://arxiv.org/html/2607.01084#A3)and[D](https://arxiv.org/html/2607.01084#A4)\.Synthesizing the results across the four tiers of generalization, we distill observations regarding the behavioral divergence between SFT and RL in open\-world agentic tasks\.

Tier 1: Tool PerceptionRL achieves semantic grounding through interaction while SFT relies on brittle symbolic anchoring that degrades as fitting increases\.

![Refer to caption](https://arxiv.org/html/2607.01084v1/x6.png)Figure 6:Accuracy comparison on Tier 1\-Instruction Robustness \(left\) and Tier 3\-Rule Reasoning & Path Planning \(right\), evaluated at the stable training phase\. While SFT models exhibit degradation across all perturbations, RL models show drops primarily under logic inversion and query variations\. Detailed setups and full training dynamics are provided in Appendices[C](https://arxiv.org/html/2607.01084#A3)and[D](https://arxiv.org/html/2607.01084#A4)\.As illustrated in Figure[4](https://arxiv.org/html/2607.01084#S5.F4), we observe a consistent degradation pattern where robustness declines as training steps increase\. Specifically, in the Semantic Trap and Identity Erasure experiments, the SFT model displays a sharp increase in TER accompanied by a significant drop in accuracy\. The dashed blue line indicates that SFT frequently invokes the wrong tools when surface\-level names are perturbed\. Conversely, while RL also experiences some performance decay due to overfitting, it maintains a consistently lower error rate and higher accuracy retention compared to SFT\. The RL agent demonstrates superior resistance to adversarial descriptions and effectively ignores misleading name cues\.

This disparity stems from SFT’s reliance onsymbolic anchoring, where the model acts as a superficial pattern matcher that collapses once surface tokens are decoupled from their semantics\. In contrast, the penalty\-driven nature of RL necessitatessemantic grounding, compelling the agent to scrutinize functional descriptions rather than memorizing rigid name\-context associations\.

Tier 2: Interaction with GuidanceRL leverages explicit guidance for dynamic policy adaptation, while SFT exhibits trajectory inertia and often hallucinates outcomes under corrective feedback\.

As illustrated in Figure[5](https://arxiv.org/html/2607.01084#S5.F5), we observe a distinct behavioral divergence dependent on feedback clarity\. Under ambiguous signals like null returns, both models exhibit performance degradation\. Crucially, this failure mode follows a clear trend where the inability to adapt intensifies as training progresses\. As the models become more fitted to the training distribution, their capacity to actively explore under uncertainty progressively declines\. A significant disparity emerges under explicit tool redirection\. Qualitative case studies reveal that SFT tends to interpret observations as successful confirmations, even when the feedback explicitly redirects the agent\. Consequently, it ignores the guidance and hallucinates successful observations to force the continuation of its pre\-planned trajectory\. In contrast, RL maintains high adaptability by parsing the instruction and migrating to the new tool immediately, and then continues reasoning\.

We hypothesize that this behavioral divergence stems from differences in training paradigms\. SFT is optimized on clean demonstrations where observations consistently validate actions\. This likely conditions the model to disregard the causal link between feedback content and the subsequent policy, thereby making its behavior resemble an open\-loop execution policy\. Conversely, RL encourages a more closed\-loop behavior because ignoring error signals is penalized during interaction\. This interaction mechanism potentially forces the agent to treat environmental observations as critical state updates, ensuring that explicit environmental instructions are prioritized to maximize reward\.

Tier 3: Reasoning GeneralizationRL demonstrates generalization in local adaptation, yet both paradigms suffer substantial degradation under global dependency inversion\.

As illustrated in Figure[6](https://arxiv.org/html/2607.01084#S5.F6), RL exhibits superior generalization in local optimization tasks\. InTrace Simplification, the RL model achieves the lowest ATCL, indicating it successfully abandons the memorized training\-time sequence to utilize efficient composite tools\. Similarly, inRule Reasoning, RL outperforms SFT by dynamically adhering to updated calculation logic, whereas SFT shows a stronger tendency to regress to memorized training constants\.

However, a critical shared failure emerges inLogic Inversion\. When the causal dependency between tools is reversed \(e\.g\., shifting fromA→BA\\rightarrow BtoB→AB\\rightarrow A\), both SFT and RL suffer precipitous performance drops\. Qualitative analysis in Appendix[E](https://arxiv.org/html/2607.01084#A5)reveals that despite the documentation explicitly stating the new prerequisite order, both models frequently hallucinate the existence of the training\-set topology\. This suggests that while RL generalizes well to local rule shifts, both paradigms overfit to the temporal order of action sequences, treating the dependency graph as a fixed action order rather than a dynamically specified logical structure\.

Tier 4: Boundary AwarenessBoth paradigms struggle with boundary awareness in unsolvable states, but exhibit distinct failure patterns: SFT often fails to perceive fatal feedback, whereas RL perceives the failure but still favors forced completion\.

![Refer to caption](https://arxiv.org/html/2607.01084v1/x7.png)Figure 7:Accuracy in Tier 1\-Schema Adaptability, Tier 2\-Format Adaptability, and Tier 4\-Domain Transfer \(left\) and Refusal Rate in Tier 4\-Active Refusal \(right\), evaluated at the stable training phase\. Both SFT and RL models are robust to simple format changes, whereas SFT shows significant degradation under domain transfer\. Both methods demonstrate limited active refusal for unsolvable queries\. Detailed setups and full training dynamics are provided in Appendices[C](https://arxiv.org/html/2607.01084#A3)and[D](https://arxiv.org/html/2607.01084#A4)\.As illustrated in Figure[7](https://arxiv.org/html/2607.01084#S5.F7), both models struggle to identify unsolvable queries, leading to low refusal success rates when tools are unavailable\. However, a fine\-grained analysis of non\-refusal cases underFatal Errorscenarios reveals that SFT and RL fail through qualitatively distinct mechanisms\. Case studies in Appendix[E\.4\.1](https://arxiv.org/html/2607.01084#A5.SS4.SSS1)show that upon receiving an error, SFT directly outputs an unsupported answer without acknowledging the error at all, effectively treating the error observation as if it were a normal return value\. In contrast, RL explicitly recognizes the failure \(e\.g\., “It seems the tool is not available”\), demonstrating a notably high error awareness rate, yet it still proceeds to fabricate an answer rather than issuing a refusal\.

This divergence is consistent with the open\-loop versus closed\-loop dichotomy observed in earlier tiers\. SFT’s failure is perceptual: its open\-loop policy does not parse environmental feedback, so the error signal is simply ignored\. RL’s failure, however, is more decisional: it perceives the error but still favors completion, which is consistent with ateleological biasinduced by outcome\-oriented rewards\. Because training rewards presuppose that every problem is solvable, the policy learns to provide an answer at all costs, prioritizing completion over honest refusal\. This is distinct from mere overconfidence; it reflects a structural limitation of outcome\-based reward design that lacks an explicit incentive to abstain when the task is infeasible\. Such persistent forced\-completion behavior raises critical agent safety concerns, as models may sacrifice truthfulness for helpfulness in unpredictable open\-world deployments\.

## 6Perturbation\-Augmented Fine\-Tuning

Our analysis reveals that SFT’s generalization decay stems from a distributional mismatch: standard SFT optimizesℒ=−𝔼\(h,a\)∼𝒟clean\[log⁡πθ\(a\|h\)\]\\mathcal\{L\}=\-\\mathbb\{E\}\_\{\(h,a\)\\sim\\mathcal\{D\}\_\{\\text\{clean\}\}\}\[\\log\\pi\_\{\\theta\}\(a\|h\)\], where all training historieshhare drawn exclusively from successful executions\. This confines the learned policy to a narrow manifoldℋclean\\mathcal\{H\}\_\{\\text\{clean\}\}that excludes error states, ambiguous feedback, and symbolic variations\. To bridge this gap, we proposePerturbation\-Augmented Fine\-Tuning \(PAFT\), a data intervention strategy that augments𝒟clean\\mathcal\{D\}\_\{\\text\{clean\}\}with controlled trajectory\-level perturbations:

𝒟PAFT=\(1−α\)𝒟clean∪α⋅𝒢\(𝒟clean\)\\mathcal\{D\}\_\{\\text\{PAFT\}\}=\(1\-\\alpha\)\\,\\mathcal\{D\}\_\{\\text\{clean\}\}\\;\\cup\\;\\alpha\\cdot\\mathcal\{G\}\(\\mathcal\{D\}\_\{\\text\{clean\}\}\)\(2\)where𝒢\\mathcal\{G\}applies perturbation operators that expand the support ofhhinto regions containing anomalous observations, unsolvable states, and symbolic variants\. Unlike standard augmentation that perturbs inputs in isolation, PAFT operates at the trajectory level, injecting perturbations mid\-trajectory and supervising the corrective continuations\.

### 6\.1Method

Environmental Feedback Perturbation \(EFP\)\.To combat the open\-loop trajectory inertia, we extend successful sub\-trajectories by injecting stochastic observation anomalies that simulate unexpected environmental feedback\. A standard chain is transformed via a generator𝒢env\\mathcal\{G\}\_\{env\}as:

τorig=\{ai,oi\}→𝒢envτ′=\{ai,ochange,ai′,oi\}\\tau\_\{\\text\{orig\}\}=\\\{a\_\{i\},o\_\{i\}\\\}\\xrightarrow\{\\mathcal\{G\}\_\{env\}\}\\tau^\{\\prime\}=\\\{a\_\{i\},o\_\{\\text\{change\}\},a^\{\\prime\}\_\{i\},o\_\{i\}\\\}\(3\)Here,ochangeo\_\{\\text\{change\}\}represents a feedback anomaly \(e\.g\., tool redirection or value deprecation\)\. The model is supervised to generate a corrective actionai′a^\{\\prime\}\_\{i\}that recovers from the disturbance, rather than blindly assuming success\. This converts the open\-loop pattern into a closed\-loop recovery behavior\.

Solvability Boundary Perturbation \(SBP\)\.To address the boundary blindness identified in Tier 4, we construct negative samples that enforce explicit refusal behavior\. A refusal generator injects unrecoverable errors into the observation, modifying the trajectory to:

τorig=\{ai,oi,ai\+1\}→𝒢refusalτ′=\{ai,ofatal,aend\}\\tau\_\{\\text\{orig\}\}=\\\{a\_\{i\},o\_\{i\},a\_\{i\+1\}\\\}\\xrightarrow\{\\mathcal\{G\}\_\{\\text\{refusal\}\}\}\\tau^\{\\prime\}=\\\{a\_\{i\},o\_\{\\text\{fatal\}\},a\_\{\\text\{end\}\}\\\}\(4\)whereofatalo\_\{\\text\{fatal\}\}signifies a fatal error with no alternative tools available, andaenda\_\{\\text\{end\}\}represents the explicit refusal action\. This breaks the implicit “every problem is solvable” assumption in𝒟clean\\mathcal\{D\}\_\{\\text\{clean\}\}and teaches the model to halt execution rather than hallucinate\.

Symbolic Representation Perturbation \(SRP\)\.To decouple the symbolic anchoring identified in Tier 1, we apply surface\-level augmentations to the tool definition space\. An auxiliary agent performs synonymous rewriting and noise injection on tool names and documentation content\. This forces the model to ground its decision\-making in invariant functional semantics rather than memorized token sequences\. To ensure evaluation fairness, all perturbation patterns are automatically generated without any prior knowledge of the specific variation types used in the test set, reducing the risk of information leakage\.

### 6\.2Experimental Results

We evaluate PAFT on the most challenging tasks from each tier:Semantic Trap\(Tier\-1\),Tool Redirection\(Tier\-2\),Logic Inversion\(Tier\-3\), andFatal Error\(Tier\-4\)\. For Tiers 1–3, we report the Accuracy Delta \(Δ\\Delta\) relative to the closed\-set baseline; for Tier\-4, we report the absolute RR\. We setα=0\.3\\alpha=0\.3as the default perturbation ratio\.

Table 1:PAFT results across open environment tiers\.Performance\.As shown in the upper portion of Table[1](https://arxiv.org/html/2607.01084#S6.T1), PAFT yields consistent improvements across all training stages and tiers\. For early\-stage models, PAFT transforms severe degradation into positive deltas across Tiers 1–3, indicating that perturbation\-augmented training can preemptively prevent overfitting to clean trajectories\.

Ablation Study\.The lower portion of Table[1](https://arxiv.org/html/2607.01084#S6.T1)reveals two key findings\.

\(1\) Perturbation ratio\.Increasingα\\alphafrom 0\.2 to 0\.3 yields clear gains across all tiers, while further increasing to 0\.4 provides marginal or no additional benefit\. We attribute this diminishing\-return pattern to a trade\-off between distributional coverage and signal quality: at lowα\\alpha, the model is exposed to too few perturbed trajectories to learn robust recovery behaviors; asα\\alphagrows, the augmented data progressively covers more failure regions, but beyond a saturation point, an excessive proportion of perturbed samples dilutes the clean supervision signal and introduces conflicting gradients between normal task completion and anomaly handling, leading to no further improvement or even slight regression\. The result suggests thatα=0\.3\\alpha\{=\}0\.3achieves a near\-optimal balance between robustness and task performance\.

\(2\) Component contributions\.EFP improves Tier\-2 interaction adaptability, as the injected observation anomalies teach the model to parse and react to unexpected feedback rather than executing blindly along pre\-planned trajectories\. Notably, EFP contributes minimally to Tier\-4 refusal, indicating that error adaptation and unsolvable\-state recognition capture distinct capabilities\. SBP almost exclusively restores Tier\-4 refusal capability by introducing fatal error samples that break the implicit solvability assumption, while leaving other tiers largely unaffected, which suggests that, in our setting, boundary blindness is strongly tied to data coverage problem where the model never learns to refuse simply because𝒟clean\\mathcal\{D\}\_\{\\text\{clean\}\}contains no refusal demonstrations\. SRP primarily enhances Tier\-1 perception robustness and Tier\-3 reasoning generalization, as symbolic perturbations force the model to attend to functional descriptions rather than surface\-level identifiers, mitigating both symbolic anchoring and the associated downstream reasoning errors\. This targeted effectiveness, where each component is necessary yet individually insufficient, confirms that PAFT is a structured, diagnosis\-driven intervention rather than generic data augmentation\.

## 7Conclusion

In this paper, we formalize theOpenAgentsetting and leverage a controlled sandbox to diagnose the open\-world adaptability of LLM agents\. Our systemic analysis reveals that both SFT and RL paradigms exhibit vulnerabilities when facing non\-stationary environmental shifts\. Furthermore, we propose PAFT, a disturbance\-based intervention strategy designed to mitigate trajectory overfitting in SFT\.

## Acknowledgements

This work was supported by the Key Program of Jiangsu Science Foundation \(BK20243012\), the NSFC \(62306133\), and the 111 Center \(No\. B26023\)\.

## Impact Statement

This work studies robustness and failure modes of tool\-use agents under open\-world shifts\. Its primary positive impact is to improve the reliability and safety of deployed agents by identifying conditions under which models ignore feedback, misuse tools, or hallucinate answers\. Potential risks include misuse of stronger tool\-use agents in automated decision\-making systems\. We mitigate this by emphasizing refusal behavior, boundary awareness, and controlled evaluation rather than unrestricted deployment\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- Anthropic \(2024\)Introducing the model context protocol\.Note:[https://www\.anthropic\.com/news/model\-context\-protocol](https://www.anthropic.com/news/model-context-protocol)Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025\)Qwen2\.5\-VL technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- Z\. Cheng, Z\. Jia, Z\. Zhou, Y\. Li, and L\. Guo \(2025\)TabFSBench: tabular benchmark for feature shifts in open environment\.InProceedings of the 42nd International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Chu, Y\. Zhai, J\. Yang, S\. Tong, S\. Xie, D\. Schuurmans, Q\. V\. Le, S\. Levine, and Y\. Ma \(2025\)SFT memorizes, RL generalizes: A comparative study of foundation model post\-training\.InProceedings of the 42nd International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez, N\. Chapados, and A\. Lacoste \(2024\)WorkArena: how capable are web agents at solving common knowledge work tasks?\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Feng, S\. Huang, X\. Qu, G\. Zhang, Y\. Qin, B\. Zhong, C\. Jiang, J\. Chi, and W\. Zhong \(2026\)ReTool: reinforcement learning for strategic tool use in LLMs\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Feng, P\. Zhao, G\. Jiang, C\. Hao, Y\. Zhang, G\. Liu, and H\. Wang \(2025\)PVPO: pre\-estimated value\-based policy optimization for agentic reasoning\.arXiv preprint arXiv:2508\.21104\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Gao, S\. Xie, J\. Zhai, S\. Ma, and C\. Shen \(2025\)Mcp\-radar: a multi\-dimensional benchmark for evaluating tool use capabilities in large language models\.arXiv preprint arXiv:2505\.16700\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Guo, L\. Jia, J\. Shao, and Y\. Li \(2025\)Robust semi\-supervised learning in open environments\.Frontiers of Computer Science19\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Guo and Y\. Li \(2022\)Class\-imbalanced semi\-supervised learning with adaptive thresholding\.InProceedings of the 39th International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Guo, Z\. Zhang, Y\. Jiang, Y\. Li, and Z\. Zhou \(2020\)Safe deep semi\-supervised learning for unseen\-class unlabeled data\.InProceedings of the 37th International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Guo, S\. Cheng, H\. Wang, S\. Liang, Y\. Qin, P\. Li, Z\. Liu, M\. Sun, and Y\. Liu \(2024\)StableToolBench: towards stable large\-scale benchmarking on tool learning of large language models\.InFindings of the Association for Computational Linguistics,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1),[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- J\. He, J\. Neville, M\. Wan, L\. Yang, H\. Liu, X\. Xu, X\. Song, J\. Z\. Pan, and P\. Zhou \(2025\)GenTool: enhancing tool generalization in language models through zero\-to\-one and weak\-to\-strong simulation\.InFindings of the Association for Computational Linguistics,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1),[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Hong, C\. Zhao, C\. Zhu, W\. Lu, G\. Xu, and X\. Yu \(2026\)DeepEyesV2: toward agentic multimodal model\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Hsieh, S\. Chen, C\. Li, Y\. Fujii, A\. Ratner, C\. Lee, R\. Krishna, and T\. Pfister \(2023\)Tool documentation enables zero\-shot tool\-usage with large language models\.ArXiv:2308\.00675\.Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- Y\. Huang, J\. Shi, Y\. Li, C\. Fan, S\. Wu, Q\. Zhang, Y\. Liu, P\. Zhou, Y\. Wan, N\. Z\. Gong, and L\. Sun \(2024\)MetaTool benchmark for large language models: deciding whether to use tools and which to use\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Jia, L\. Guo, Z\. Zhou, and Y\. Li \(2024\)Realistic evaluation of semi\-supervised learning algorithms in open environments\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. C\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024\)VisualWebArena: evaluating multimodal agents on realistic visual web tasks\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Li, Y\. Zhao, B\. Yu, F\. Song, H\. Li, H\. Yu, Z\. Li, F\. Huang, and Y\. Li \(2023\)API\-bank: A comprehensive benchmark for tool\-augmented llms\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Li, L\. Guo, and Z\. Zhou \(2021\)Towards safe weakly supervised learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence43\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Li, H\. Zhang, S\. Han, S\. Liu, J\. Xie, Y\. Zhang, Y\. Choi, J\. Zou, and P\. Lu \(2026\)In\-the\-flow agentic system optimization for effective planning and tool use\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Lin, M\. Wen, Q\. Peng, G\. Nie, J\. Liao, J\. Wang, X\. Mo, J\. Zhou, C\. Cheng, Y\. Zhao, J\. Wang, and W\. Zhang \(2025\)Hammer: robust function\-calling for on\-device language models via function masking\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Zhang, Y\. Su, H\. Sun, M\. Huang, Y\. Dong, and J\. Tang \(2024\)AgentBench: evaluating LLMs as agents\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Liu, J\. Qiu, S\. Wang, J\. Zhang, Z\. Liu, R\. Ram, H\. Chen, W\. Yao, S\. Heinecke, S\. Savarese, H\. Wang, and C\. Xiong \(2025\)MCPEval: automatic MCP\-based deep evaluation for AI agent models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Ma, J\. Zhang, Z\. Zhu, C\. Yang, Y\. Yang, Y\. Jin, Z\. Lan, L\. Kong, and J\. He \(2024\)AgentBoard: an analytical evaluation board of multi\-turn LLM agents\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)Gaia: a benchmark for general ai assistants\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Mo, C\. Zhou, F\. Cheng, J\. Yu, L\. Pan, F\. Liu, S\. Zhou, H\. Li, Z\. Lin, and P\. Torr \(2026\)Open\-World LLM logical reasoning\.InForty\-third International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The Berkeley function calling leaderboard \(BFCL\): from tool use to agentic evaluation of large language models\.InProceedings of the 42nd International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Qian, E\. C\. Acikgoz, Q\. He, H\. Wang, X\. Chen, D\. Hakkani\-Tür, G\. Tur, and H\. Ji \(2025\)ToolRL: reward is all tool learning needs\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen \(2024\)COLT: towards completeness\-oriented tool retrieval for large language models\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management,Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen \(2025\)Tool learning with large language models: a survey\.Frontiers Comput\. Sci\.19\.Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- Qwen, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1),[§4\.1](https://arxiv.org/html/2607.01084#S4.SS1.SSS0.Px3.p1.1)\.
- J\. Shao, B\. Zhang, X\. Yang, B\. Chen, S\. Han, P\. Jinghao, W\. Wei, G\. Cai, Z\. Dong, L\. Guo, and Y\. Li \(2026\)ChinaTravel: an open\-ended travel planning benchmark with compositional constraint validation for language agents\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Shen, K\. Song, X\. Tan, D\. Li, W\. Lu, and Y\. Zhuang \(2023\)HuggingGPT: solving AI tasks with ChatGPT and its friends in Hugging Face\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- Z\. Shi, S\. Gao, X\. Chen, Y\. Feng, L\. Yan, H\. Shi, D\. Yin, P\. Ren, S\. Verberne, and Z\. Ren \(2024\)Learning to use tools via cooperative and interactive agents\.InFindings of the Association for Computational Linguistics: EMNLP,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Findings of ACL, Vol\.EMNLP 2024\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Su, L\. Li, M\. Song, Y\. Hao, Z\. Yang, J\. Zhang, G\. Chen, J\. Gu, J\. Li, X\. Qu,et al\.\(2025\)Openthinkimg: learning to think with images via visual tool reinforcement learning\.arXiv preprint arXiv:2505\.08617\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- K\. Team, Y\. Bai, Y\. Bao, G\. Chen, J\. Chen, N\. Chen, R\. Chen, Y\. Chen, Y\. Chen, Y\. Chen,et al\.\(2025\)Kimi k2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian \(2024\)AppWorld: a controllable world of apps and people for benchmarking interactive coding agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wang, M\. Zerun, Y\. Li, S\. Zhang, C\. Chen, K\. Chen, and X\. Le \(2024a\)GTA: a benchmark for general tool agents\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin, W\. X\. Zhao, Z\. Wei, and J\. Wen \(2024b\)A survey on large language model based autonomous agents\.Frontiers Comput\. Sci\.18\.Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- X\. Wang, Z\. Wang, J\. Liu, Y\. Chen, L\. Yuan, H\. Peng, and H\. Ji \(2024c\)MINT: evaluating llms in multi\-turn interaction with tools and language feedback\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Wang, Q\. Chang, H\. Patel, S\. Biju, C\. Wu, Q\. Liu, A\. Ding, A\. Rezazadeh, A\. Shah, Y\. Bao, and E\. Siow \(2026\)MCP\-Bench: benchmarking tool\-using LLM agents with complex real\-world tasks via MCP servers\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu \(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Xiong, Y\. Huang, Z\. Jiang, Z\. Chang, Y\. Zheng, T\. Li, and M\. Li \(2025\)Butterfly effects in toolchains: a comprehensive analysis of failed parameter filling in llm tool\-agent systems\.arXiv preprint arXiv:2507\.15296\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- F\. F\. Xu, Y\. Song, B\. Li, Y\. Tang, K\. Jain, M\. Bao, Z\. Z\. Wang, X\. Zhou, Z\. Guo, M\. Cao, M\. Yang, H\. Y\. Lu, A\. Martin, Z\. Su, L\. Maben, R\. Mehta, W\. Chi, L\. Jang, Y\. Xie, S\. Zhou, and G\. Neubig \(2025a\)TheAgentCompany: benchmarking LLM agents on consequential real\-world tasks\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Xu, A\. M\. Soria, S\. Tan, A\. Roy, A\. S\. Agrawal, R\. Poovendran, and R\. Panda \(2025b\)Toucan: synthesizing 1\.5 m tool\-agentic data from real\-world mcp environments\.arXiv preprint arXiv:2510\.01179\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Ye, G\. Li, S\. Gao, C\. Huang, Y\. Wu, S\. Li, X\. Fan, S\. Dou, T\. Ji, Q\. Zhang, T\. Gui, and X\. Huang \(2025\)ToolEyes: fine\-grained evaluation for tool learning capabilities of large language models in real\-world scenarios\.InProceedings of the 31st International Conference on Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Ye, Y\. Wu, S\. Gao, C\. Huang, S\. Li, G\. Li, X\. Fan, Q\. Zhang, T\. Gui, and X\. Huang \(2024\)Rotbench: a multi\-level benchmark for evaluating the robustness of large language models in tool learning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Yin, T\. Lei, and Y\. Liu \(2025\)Toolvqa: a dataset for multi\-step reasoning vqa with external tools\.InProceedings of the IEEE/CVF International Conference on Computer Vision,Cited by:[§1](https://arxiv.org/html/2607.01084#S1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yuan, K\. Song, J\. Chen, X\. Tan, Y\. Shen, K\. Ren, D\. Li, and D\. Yang \(2025\)Easytool: enhancing llm\-based agents with concise tool instruction\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Zeng, X\. Lv, Q\. Zheng, Z\. Hou, B\. Chen, C\. Xie, C\. Wang, D\. Yin, H\. Zeng, J\. Zhang,et al\.\(2025\)Glm\-4\.5: agentic, reasoning, and coding \(arc\) foundation models\.arXiv preprint arXiv:2508\.06471\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zheng, M\. Yang, J\. Hong, C\. Zhao, G\. Xu, L\. Yang, C\. Shen, and X\. Yu \(2026\)DeepEyes: incentivizing “thinking with images” via reinforcement learning\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2024\)WebArena: a realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhou \(2022\)Open\-environment machine learning\.National Science Review9\.Cited by:[§2](https://arxiv.org/html/2607.01084#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix Contents

## Appendix ASandbox Environment

Since live APIs and MCP\-based tool ecosystems are often unstable, version\-dependent, and difficult to control during evaluation, we build a local code\-running sandbox to simulate tool calls\. Specifically, we create a Python sandbox environment class that includes database, tools implementation, and tool call module\. Detailed information is as follows:

### A\.1Database

We built our experimental database using public Point of Interest \(POI\) data from Beijing\. Each POI entry has five key attributes: Phone Number, Address Name, Adname \(Administrative Name\), Coordinates \(Latitude/Longitude\), and Category\. To strictly prevent data leakage and ensure the model relies entirely on tool usage rather than parameter memory to answer, we adopted an anonymization protocol\. Specifically, all Address Names and Phone Numbers are replaced with unique, randomly generated alphanumeric strings and non\-repeating number sequences, respectively\. In addition, we strictly separate the training and test databases to avoid the model guessing the answers correctly by chance\. Examples of the database data are shown in Table[2](https://arxiv.org/html/2607.01084#A1.T2)\.

Table 2:Samples of the Anonymized POI Knowledge Base![Refer to caption](https://arxiv.org/html/2607.01084v1/Figures/logic_strcture.png)Figure 8:Logic Diagram of Tools\. Each line represents one tool\. The start of an arrow denotes the input, while the end of an arrow denotes the output\.
### A\.2Logical Architecture and Function Classification of Tools

We designed the logical architecture of the tools in accordance with the logic diagram \(Figure[8](https://arxiv.org/html/2607.01084#A1.F8)\) and the following rules:

1. 1\.For key\-value \(coords, tel, name\) query functions: we only support the counterclockwise query method by default, i\.e\., coords can only be used to query tel;
2. 2\.For the fare query function: we only allow queries by name;
3. 3\.For the distance calculation function: we only allow latitude and longitude as input parameters;
4. 4\.For non\-key information queries: we allow any three key information to be used as input\.

The above rules also restrict the input type of each tool, and we finally obtained 12 tools, which are shown in Table[3](https://arxiv.org/html/2607.01084#A1.T3)\.

Table 3:Definitions of training set tools with categorized input and output types\.
### A\.3Tool Call Module

We set up the different tool call modules for RL training and evaluation testing in two different ways:

- •For RL training, we use the Verl Tool framework\. The core idea is to run the code sandbox as a service on the server and open an external port\. This port receives tool names and corresponding parameters, then sends back the tool call results\. During each RL training run, we use the terminator <tool\_call \>to detect tool call commands\. Specifically, when the model outputs this keyword, it stops generating text and triggers the tool detector\. After processing all data in the same batch, the tool detector accesses the port uniformly to get results\. These operations ensure only one tool call command is formed per conversation turn\. The rest of the reinforcement learning process is the same as normal RL training\.
- •For evaluation testing, we simulate multi\-turn conversations and also detect tool calls via the terminator, which ensures that only one tool call is triggered in a single turn\. If the model outputs </answer \>, it means the response is finished\.

In summary, the experimental environment of our study is composed of the query database, the sandbox code execution environment, the training and test sets, as well as the training and testing logic\. The case of interaction is shown below\.

Interaction case\{
"role": "user",
"content": "I have the phone number ’00001433’\. Can you help me find the location of the place associated with this number?"
\},
\{
"role": "assistant",
"content": "<tool\_call\>\{"tool": "caller\_id\_lookup", "args": \{"tel": "00001433"\}\}</tool\_call\>"
\},
\{
"role": "user",
"content": "Observation: \{"name": "sgtdfxvz" \}"
\},
\{
"role": "assistant",
"content": "There is no direct tool available to retrieve the location from a phone number\. Therefore, I will first use the phone number to look up the name of the place\. <tool\_call\>\{"tool": "search\_map\_coordinates", "args": \{"name": "sgtdfxvz"\}\}</tool\_call\>"
\},
\{
"role": "user",
"content": "Observation: \{"location": "116\.445179,39\.953269" \}"
\},
\{
"role": "assistant",
"content": "Now that I have the name of the place, I can proceed to find its location using the name\-based lookup tool\. <answer\>116\.445179,39\.953269</answer\>"
\}
### A\.4Training and Test Data

![Refer to caption](https://arxiv.org/html/2607.01084v1/x8.png)\(a\)Step Distribution
![Refer to caption](https://arxiv.org/html/2607.01084v1/x9.png)\(b\)Tool Coverage
![Refer to caption](https://arxiv.org/html/2607.01084v1/x10.png)\(c\)Source Distribution

Figure 9:Overall Statistics of the POI Multi\-step Tool\-calling Dataset: \(a\) Distribution of task complexity by step count; \(b\) Proportions of various tool invocations; \(c\) Composition of task scenarios and sources\.Following data preparation in the main paper and the database and tools mentioned above, we employ a reverse\-synthesis approach to generate training instances\. By randomly sampling initial query conditions and target answers based on our defined tool set, we derive a symbolic reasoning trajectory\. Formally, a symbolic trajectory𝒯sym\\mathcal\{T\}\_\{sym\}is defined as:

𝒯sym=\{c,t1→t2→…→tk,a\}\\mathcal\{T\}\_\{sym\}=\\\{c,t\_\{1\}\\to t\_\{2\}\\to\\dots\\to t\_\{k\},a\\\}\(5\)whereccrepresents the initial condition \(e\.g\., a specific location coordinate\),tit\_\{i\}denotes theii\-th tool invocation in the chain, andaais the final answer\. To align these symbolic representations with human interaction patterns, we utilize an LLM to transform and smooth𝒯sym\\mathcal\{T\}\_\{sym\}into natural language question\-answer pairs, resulting in a dataset of coherent multi\-step tool\-use queries\.

The final dataset comprises a total of 6,930 samples, divided into a training set \(N=6,050N=6,050\) and a test set \(N=880N=880\)\. The dataset characteristics are designed to ensure rigorous evaluation:

- •Complexity Distribution:Both the training and test sets follow an identical distribution regarding the length of tool chains \(e\.g\., 1\-hop, 2\-hop, …,kk\-hop\)\. This ensures that the test set accurately reflects the structural complexity of the training data\.
- •Database Isolation:To eliminate the possibility of data leakage, the training and test sets are synthesized from disjoint subsets of the underlying POI database\. Consequently, the model must generalize its reasoning capabilities to unseen entities rather than recalling specific POI attributes seen during training\.

Regarding task complexity distribution, the dataset employs a stepped logical chain design covering a range from one to six steps; two\-step tasks serve as the core backbone \(accounting for 28\.4%\), while nearly 40% of the samples consist of high\-difficulty reasoning tasks involving three or more steps, ensuring depth in the model’s learning of long\-range tool invocation paths\.

In terms of tool and scenario distribution, the dataset exhibits a high degree of domain focus and balance: the core tool caller\_id\_lookup represents 24\.4% of invocations, while ticket price calculation \(adult/child cost\) and geographic measurement \(coords/distance\) tools work in close coordination to form the foundation of task execution\. This distribution is further corroborated by the source scenarios, where price\-related \(34\.1%\) and distance\-related \(28\.4%\) contexts dominate, providing the model with abundant material for numerical reasoning and logical alignment\.

Overall, this dataset covers basic attribute queries \(e\.g\., POI names, phone numbers, and administrative divisions\) and utilizes complex step distributions to force the model into semantic grounding rather than simple symbolic matching during extreme testing, such as anonymization or logical inversion\. With its scientific tool sampling proportions and diverse business scenarios, this dataset serves as an ideal benchmark for training agents equipped with efficient path planning and robust generalization capabilities\. In addition, we also provide the tools that can be used in Table[3](https://arxiv.org/html/2607.01084#A1.T3)\.

## Appendix BExperimental Details

We adopt Qwen2\.5\-7B\-Instruct as the backbone model and perform full\-parameter fine\-tuning based on the LLaMA\-Factory framework\. The AdamW optimizer is employed, with the learning rate and batch size set to 3e\-5 and 2, respectively\. The training process runs for a total of 800 steps, with a model checkpoint saved every 200 steps\. For the RL paradigm, training leverages the GRPO tool calling algorithm implemented on the Verl\-Tool platform\. Notably, we use the normal GRPO to keep comparison fairly\. The AdamW optimizer is also used here, with the learning rate set to 1e\-6, the number of deployments configured to 8, and the total training steps set to 280\. The PAFT algorithm differs from SFT only in the training dataset\. The reward function was designed as follows:

Reward=0\.8×Acc\+0\.2×scoreformatReward=0\.8\\times Acc\+0\.2\\times score\_\{format\}Note that we guarantee that both models are fully fitted and strictly control the amount of input data to ensure the fairness of the comparison\.

Testing Setup: During the inference phase, all models were uniformly constrained with identical sampling hyperparameters: the sampling temperature was fixed at 0\.0, and the Top\-p value was set to 1\.0\. For modifying the tool settings, we conducted three experiments and analyzed the results comprehensively\.

The system prompt template for training and testing is as follows:

System Prompt"You are a professional geographic information assistant\. You can answer questions about places, phone numbers, distances, and ticket prices by calling POI database tools\.
\#\#\# Available Tools:
\[
\{"name": "search\_map\_coordinates", "description": "Retrieve the location coordinates \(longitude,latitude\) of a POI by its exact name\.", "parameters": \{"type": "object", "properties": \{"name": \{"type": "string"\}\}, "required": \["name"\]\}\},
\{"name": "administrative\_division\_query", "description": "Retrieve the district name \(adname\) of a POI by its exact name\.", "parameters": \{"type": "object", "properties": \{"name": \{"type": "string"\}\}, "required": \["name"\]\}\},
\{"name": "fetch\_poi\_category", "description": "Retrieve the category type of a POI by its exact name\.", "parameters": \{"type": "object", "properties": \{"name": \{"type": "string"\}\}, "required": \["name"\]\}\},
\{"name": "find\_contact\_at\_coords", "description": "Retrieve the telephone number of a POI using its location coordinates\.", "parameters": \{"type": "object", "properties": \{"location": \{"type": "string", "description": "’lon,lat’ string"\}\}, "required": \["location"\]\}\},
\{"name": "reverse\_geocoding", "description": "Retrieve the district name of a POI using its location coordinates\.", "parameters": \{"type": "object", "properties": \{"location": \{"type": "string", "description": "’lon,lat’ string"\}\}, "required": \["location"\]\}\},
\{"name": "venue\_type\_identifier", "description": "Retrieve the category type of a POI using its location coordinates\.", "parameters": \{"type": "object", "properties": \{"location": \{"type": "string", "description": "’lon,lat’ string"\}\}, "required": \["location"\]\}\},
\{"name": "caller\_id\_lookup", "description": "Retrieve the name of a POI using its telephone number\.", "parameters": \{"type": "object", "properties": \{"tel": \{"type": "string"\}\}, "required": \["tel"\]\}\},
\{"name": "regional\_mapping\_by\_tel", "description": "Retrieve the district name of a POI using its telephone number\.", "parameters": \{"type": "object", "properties": \{"tel": \{"type": "string"\}\}, "required": \["tel"\]\}\},
\{"name": "service\_catalog\_search", "description": "Retrieve the category type of a POI using its telephone number\.", "parameters": \{"type": "object", "properties": \{"tel": \{"type": "string"\}\}, "required": \["tel"\]\}\},
\{"name": "calculate\_distance\_by\_coords", "description": "Calculate the distance in meters between two specific coordinates\. Requires mode=2\.", "parameters": \{"type": "object", "properties": \{"start\_lat": \{"type": "number"\}, "start\_lon": \{"type": "number"\}, "end\_lat": \{"type": "number"\}, "end\_lon": \{"type": "number"\}, "mode": \{"type": "integer", "description": "Verification code\."\}\}, "required": \["start\_lat", "start\_lon", "end\_lat", "end\_lon", "mode"\]\}\},
\{"name": "get\_adult\_cost\_by\_name", "description": "Calculate the discounted adult ticket price \(0\.5x\) by POI name\. Requires mode=1\.", "parameters": \{"type": "object", "properties": \{"name": \{"type": "string"\}, "mode": \{"type": "integer", "description": "Verification code\."\}\}, "required": \["name", "mode"\]\}\},
\{"name": "get\_child\_cost\_by\_name", "description": "Calculate the discounted child ticket price \(0\.2x\) by POI name\. Requires mode=1\.", "parameters": \{"type": "object", "properties": \{"name": \{"type": "string"\}, "mode": \{"type": "integer", "description": "Verification code\."\}\}, "required": \["name", "mode"\]\}\}
\]
\#\#\# Protocol:
1\.ONE TOOL PER TURN: You MUST output only ONE tool call at a time\. Wait for the observation before doing anything else\.
2\.Tool Format: To use a tool, strictly output a JSON object with "name" and "arguments" keys\. Example: \{"name": "get\_tel\_by\_name", "arguments": \{"name": "Beijing\_Hotel"\}\}
3\.Final Answer: Wrap the result in <answer\></answer\>\. Keep numerical answers to 1 decimal place\."
## Appendix CImplementation Details of OpenAgent Evaluation Task

### C\.1Tier 1: Perception Generalization

#### C\.1\.1Instruction Robustness\.

- •Instruction Blurring:We employ API calls to rewrite user queries, introducing complexities through syntactic variations \(e\.g\., colloquialisms, rhetorical questions\) or semantic ambiguities \(e\.g\., implicit references, entity omissions\)\.
- •Information Redundancy:We employ API calls to rewrite user queries by embedding core instructions within irrelevant contexts or lengthy background details, thereby simulating real\-world informational noise\. Please refer to the Appendix[E\.1\.1](https://arxiv.org/html/2607.01084#A5.SS1.SSS1)for further details\.

#### C\.1\.2Schema Adaptability\.

We alter the parameter dictionary key from the standardargstoarguments, as shown in Appendix[E\.1\.2](https://arxiv.org/html/2607.01084#A5.SS1.SSS2)\.

#### C\.1\.3Semantic Grounding\.

Synonymous RewritingWe assess the model’s capability to decouple functional intent from specific symbolic labels by replacing tool identifiers with semantically equivalent alternatives\. This requires the model to move beyond simple string matching and demonstrate a deep understanding of the underlying natural language semantics\.

Table 4:Tool Mapping forSynonym RewritingNoise InjectionWhile maintaining the integrity of the tool schema, we introduce subtle syntactic perturbations into the tool names by inserting articles or determiners\. We utilize this setup to evaluate the model’s tolerance to minor fluctuations in API naming and its stability when encountering non\-standardized inputs\.

Table 5:Tool Mapping forSurface NoiseSemantic AdversarialWe designed a “description trap” where the familiar tool names are retained, but their functional descriptions are modified to explicitly forbid their previous usage\. We require the model to overcome “path dependency” on familiar names, recognize the obsolescence of the original tool based on the updated description, and accurately invoke a new tool with matching functionality\.

Table 6:Changes inSemantic AdversarialIdentity ErasureTo isolate the model’s reliance on pure functional logic, we replace all human\-readable identifiers with randomly generated UUIDs\. In this extreme setting, we force the agent to perform reasoning based exclusively on the provided natural language descriptions\.

Table 7:Tool Mapping forIdentity Erasure

### C\.2Tier 2: Interaction Generalization

#### C\.2\.1Format Adaptability\.

We diversify the structure of tool return values, transitioning from strictly structured JSON to less constrained formats such as Markdown tables or raw text logs\.

Table 8:Example ofFormat Shiftin Tool Observations
#### C\.2\.2Error Correction\.

- •Error Return:We simulate an explicit failure where the primary tool returns a specific error message \(e\.g\.,“This tool is unavailable now”\) or a traceback\. This scenario evaluates whether the agent can interpret explicit error signals and proactively switch to a redundant tool to resolve the blockage\.
- •Null Return:We introduce an implicit failure where the primary tool returns a null value or an empty set \(∅\\emptyset\) without any diagnostic warning\. This tests the agent’s capability to detect anomalies based on the absence of content, rather than the presence of an error signal, and infer the necessity of a fallback execution\.

Table 9:Mapping ofTool Failure Modesand Expected Agent Responses
#### C\.2\.3Dynamic Alignment\.

- •Value Redirection:We simulate a scenario where the provided parameter logic is deprecated, and the environment returns a message including new, valid parameters\.
- •Tool Redirection:We simulate a scenario where the primary tool returns an error explicitly stating it is obsolete while simultaneously suggesting an alternative tool\.

Importantly, we feed the interaction history into the API model \(DouBao\) to enable it to judge whether the target model refuses to answer\.

### C\.3Tier 3: Reasoning Generalization

#### C\.3\.1Rule Reasoning\.

We introduce rule conflicts in a special tool by modifying the mathematical constants in the tool description\. Specifically, we kept the tool interface unchanged, but significantly adjusted the calculation ratio in the Prompt in the test set to examine whether the model relies on the memory from the training stage or can perform real\-time reasoning based on immediate instructions\.

Table 10:Experimental Setup forCalculation Shift: Transitioning from Training Logic to Dynamic Test Logic
#### C\.3\.2Path Planning\.

Trace SimplificationWe implement Tool shortcut to evaluate the agent’s ability to optimize its reasoning trajectory when more efficient paths become available\. In this setup, we condense the functionality of two separate, sequential tools into a single, streamlined tool\. This transformation reduces a multi\-step dependency chain into a one\-step execution\.

Table 11:Implementation ofTool ConsolidationLogic InversionWe simulate causal inversions in the dependency figure[8](https://arxiv.org/html/2607.01084#A1.F8)\. During the testing phase, we intentionally reverse the information acquisition order described in the documentation, effectively inverting the preconditions of the tools\. For example, if a task originally required Tool A to precede Tool B, we redefine the logic such that Tool A now depends on the output of Tool B\.

Table 12:Implementation ofCausal Reversal

### C\.4Tier 4: Internalization Generalization

#### C\.4\.1Active Refusal

We implement Active Refusal by constructing “Missing Condition” and “Fatal Error” scenarios, transitioning tasks from being “inevitably solvable” to “deterministically unsolvable\.” We utilize this setup to assess the model’s capability to recognize task boundaries—specifically, whether it can choose to proactively refuse the request rather than generating hallucinations when faced with information deficiency or execution deadlocks\. The error tools includecalculate\_distance\_by\_coords,search\_map\_coordinates, andget\_adult\_cost\_by\_name\.

Table 13:Implementation ofActive Refusalvia Insufficient Conditions
#### C\.4\.2Domain Transfer

We implement a Domain Shift by changing the distributions of all three interactive elements: the user query, tool definitions, and environmental observations\. Specifically, we transfer the evaluation environment from a geographic address lookup task to a medical registration task\. While the surface presentation of the tasks is entirely altered, we retain the latent causal logical structure \(Figure[8](https://arxiv.org/html/2607.01084#A1.F8)\) invariance\.

Table 14:One\-to\-One Tool Mapping for Joint Domain Shift

## Appendix DDetailed Experimental Results

### D\.1Tier 1 Experiments

We report the full numerical results for all Tier\-1 perception experiments\. Tables[15](https://arxiv.org/html/2607.01084#A4.T15)–[16](https://arxiv.org/html/2607.01084#A4.T16)present accuracy and ATCL under instruction\-level perturbations \(Instruction Blurring and Information Redundancy\)\. Table[17](https://arxiv.org/html/2607.01084#A4.T17)reports performance under schema keyword changes\. Tables[18](https://arxiv.org/html/2607.01084#A4.T18)–[22](https://arxiv.org/html/2607.01084#A4.T22)provide per\-tool breakdowns across three semantic grounding experiments \(Parameter Drift, Synonymous Rewriting, Noise Injection, Semantic Adversarial, and Identity Erasure\), where each tool is evaluated independently to isolate tool\-specific failure patterns\. Across all settings, SFT exhibits increasing symbolic anchoring as training progresses, while RL maintains more stable accuracy retention\.

Table 15:Detailed Results of Instruction BlurringTable 16:Detailed Results of Information RedundancyTable 17:Detailed Results of Schema AdaptabilityTable 18:Detailed Results of Parameter Drift across three tools\. T1: calculate\_distance\_by\_coords; T2: search\_map\_coordinates; T3: get\_adult\_cost\_by\_name\.Table 19:Detailed Results of symbolic Drift \(Synonymous Rewriting\) across three tools \(T1:calculate\_distance\_by\_coords→\\rightarrowcalculate\_separation\_via\_positions;T2:search\_map\_coordinates→\\rightarrowlookup\_location\_latlong;T3:get\_adult\_cost\_by\_name→\\rightarrowfetch\_grownup\_price\_using\_name\)\.Table 20:Detailed Results of symbolic Drift \(Noise Injection\) across three tools \(T1:calculate\_distance\_by\_coords→\\rightarrowcalculate\_the\_distance\_by\_coords;T2:search\_map\_coordinates→\\rightarrowsearch\_for\_map\_coordinates;T3:get\_adult\_cost\_by\_name→\\rightarrowget\_1\_adult\_cost\_by\_name\)Table 21:Detailed Results of Semantic Adversarial Experiments Across Three Tools \(T1:calculate\_distance\_by\_coords→\\rightarrowcalculate\_separation\_via\_positions;T2:search\_map\_coordinates→\\rightarrowlookup\_location\_latlong;T3:get\_adult\_cost\_by\_name→\\rightarrowfetch\_grownup\_price\_using\_name\)Table 22:Detailed Results of Identity Erasure Experiments across Three Tools \(T1:calculate\_distance\_by\_coords→\\rightarrowTool\_a2fb;T2:search\_map\_coordinates→\\rightarrowTool\_c3d8;T3:get\_adult\_cost\_by\_name→\\rightarrowTool\_r6t7\)
### D\.2Tier 2 Experiments

We report the full numerical results for all Tier\-2 interaction experiments\. Table[23](https://arxiv.org/html/2607.01084#A4.T23)shows that both paradigms are robust to simple format changes\. Tables[24](https://arxiv.org/html/2607.01084#A4.T24)–[26](https://arxiv.org/html/2607.01084#A4.T26)present per\-tool results under Null Return and Error Return, where we additionally report the AES score to quantify the agent’s active exploration after encountering failures\. Tables[27](https://arxiv.org/html/2607.01084#A4.T27)–[28](https://arxiv.org/html/2607.01084#A4.T28)detail Tool Redirection and Value Redirection results, where the gap between SFT and RL is most pronounced: RL maintains near\-perfect AES scores \(\>\>0\.86\) while SFT scores collapse to near zero\.

Table 23:Detailed Results of Return Format VariationTable 24:Detailed Baseline Results of Solvable Problems Across Three Tools \(T1:calculate\_distance\_by\_coords;T2:search\_map\_coordinates;T3:get\_adult\_cost\_by\_name\)Table 25:Detailed Results of Solvable Problems with Null Return Across Three Tools \(T1:calculate\_distance\_by\_coords;T2:search\_map\_coordinates;T3:get\_adult\_cost\_by\_name\)Table 26:Detailed Results of Solvable Problems with Error Return Across Three Tools \(T1:calculate\_distance\_by\_coords;T2:search\_map\_coordinates;T3:get\_adult\_cost\_by\_name\)Table 27:Detailed Tool Redirection Results of Solvable Problems \(T1:calculate\_distance\_by\_coords;T2:search\_map\_coordinates;T3:get\_adult\_cost\_by\_name\)Table 28:Detailed Value Redirection Results Across Three Tools \(T1:calculate\_distance\_by\_coords;T2:search\_map\_coordinates;T3:get\_adult\_cost\_by\_name\)
### D\.3Tier 3 Experiments

We report the full numerical results for all Tier\-3 reasoning experiments\. Table[29](https://arxiv.org/html/2607.01084#A4.T29)presents Rule Reasoning results under modified calculation constants\. Table[30](https://arxiv.org/html/2607.01084#A4.T30)reports Trace Simplification results, where a composite tool replaces a multi\-step chain; notably, RL models achieve lower ATCL, confirming their ability to adopt efficient shortcuts\. Table[31](https://arxiv.org/html/2607.01084#A4.T31)shows Logic Inversion results, where both paradigms suffer substantial accuracy drops, revealing a shared vulnerability to causal dependency reversal\.

Table 29:Detailed Results of Rule Reasoning Experiments \(T1:get\_adult\_cost\_by\_nameoriginal price 0\.5x→\\rightarrow0\.1x;T2:get\_child\_cost\_by\_nameoriginal price 0\.2x→\\rightarrow0\.1x\)Table 30:Detailed Results of Trace Simplification Experiments \(T1:get\_tel\_by\_name;T2:get\_name\_by\_location;T3:get\_location\_by\_tel\)\. Here, two original tools \(A\+B\) are merged into one, providing the model with a more concise toolset\.Table 31:Detailed Results of Logic Inversion Experiments
### D\.4Tier 4 Experiments

We report the full numerical results for the Tier\-4 domain shift experiment in Table[32](https://arxiv.org/html/2607.01084#A4.T32)\. RL\-280 retains 96\.15% of its closed\-environment accuracy after domain transfer, compared to only 71\.96% for SFT\-800\. The TER rate further confirms that SFT produces more invalid tool calls in the transferred domain, consistent with its reliance on domain\-specific symbolic cues rather than transferable reasoning structures\.

Table 32:Detailed Results of Domain Shift Experiments

## Appendix ECase Study

### E\.1Tier 1: Perception

#### E\.1\.1Instruction Blurring & Information Redundancy

The cases below show the comparison before and after we applied the Instruction Blurring and Information Redundancy changes\. The Instruction Blurring operation converts the original straightforward query into an interrogative form\. For the Information Redundancy scenario, we inserted irrelevant content into the query statements to evaluate the model’s ability to extract key information\.

The example of Instruction Blurring\{ Old:"Please calculate the distance in meters between Point A \(location: 116\.534120,39\.903650\) and Point B \(name: ovxfhwlk\)\." New:"Spatially, what’s the meter gap between 116\.534120,39\.903650 and ovxfhwlk?" \}

The example of Information Redundancy\{ Old:"I have the tel info: ’00000687’\. I want to find the location of this place\." New:"Hello there, I hope you’re having a pleasant day\! I wanted to share a small matter I’ve been trying to figure out and see if you might be able to help\. Recently, while going through some old notes, I came across a piece of tel info that I’m quite curious about\. The specific tel info I have is: ’00000687’\. I’ve been wondering if there’s a way to find out the location associated with this particular phone number\-\-\-like, the place that this tel info is linked to\. It would mean a lot to me if you could assist in figuring out where this location might be\. Thank you so much for taking the time to consider this\!" \}

#### E\.1\.2Schema Adaptability

We alter the parameter dictionary key from the standardargstoarguments, as shown below\.

The comparison of schema adaptability\{ "role": "Original system prompt", "content": "\.\.\. Interaction Requirements: 1\. Call tools: Wrap tool\-calling instructions strictly in <tool\_call\></tool\_call\> tags, formatted as follows: <tool\_call\>\{"tool": "tool\_name","args":\{"param": "value"\}\}</tool\_call\> \.\.\." \}, \{ "role": "Changed system prompt", "content": "\.\.\. Interaction Requirements: 1\. Call tools: Wrap tool\-calling instructions strictly in <tool\_call\></tool\_call\> tags, formatted as follows: <tool\_call\>\{"tool": "tool\_name","arguments":\{"param": "value"\}\}</tool\_call\> \.\.\." \}

#### E\.1\.3Semantic Grounding

Since Noise Injection and Synonymous Rewriting belong to the same type of change, we only present the following two examples\. The SFT model exhibits a clear semantic anchoring effect and tends to invoke legacy tools\. In contrast, the RL model can directly invoke new tools in some cases, demonstrating its ability to read inputs and make timely adjustments\.

SFT \(Step 800\): Error example of Noise Injection of search\_map\_coordinates\{"role": "user", "content": "Please calculate the distance in meters between Point A \(location: 116\.272258,39\.950537\) and Point B \(name: kkiybhpg\)\."\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "search\_map\_coordinates", "args": \{"name": "kkiybhpg"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"error": "No registered"\}"\}, \{"role": "assistant", "content": "The coordinate search for ’kkiybhpg’ returned an error\. I will attempt to resolve the location through the distance calculation service directly using the known parameters\. <tool\_call\>\{"tool": "calculate\_distance\_by\_coords", "args": \{"start\_lat": 39\.950537, "start\_lon": 116\.272258, "end\_lat": 40\.146428, "end\_lon": 116\.49819, "mode": 2\}\}</tool\_call\>"\}

RL \(Step 280\): Success example of Noise Injection for search\_map\_coordinates\{"role": "user", "content": "Find the location coordinates for the place with name tyxbwszy\."\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "search\_for\_map\_coordinates", "args": \{"name": "tyxbwszy"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"location": "116\.350646,40\.222574"\}"\}

We present two error examples of semantic traps and one success example\. In the error scenarios, both SFT and RL directly call the old tool\. The successful scenario is when the model calls the new tool and gets the correct answer\.

SFT \(step 800\): Error example of Semantic Trap for calculate\_distance\_by\_coords"Tool Description": "\*\*calculate\_distance\_by\_coords\*\*: Returns the fixed distance specifically between Beijing \(39\.9042, 116\.4074\) and Shanghai \(31\.2304, 121\.4737\)\. This tool returns a static constant for API connectivity testing\. Parameters: \- start\_lat \(number\) \(required\) \- start\_lon \(number\) \(required\) \- end\_lat \(number\) \(required\) \- end\_lon \(number\) \(required\) \- mode \(integer\): Verification code\. \(required\)" \{"role": "assistant", "content": "Given that neither POI has a recorded location in the database, we will use the fixed distance between Beijing and Shanghai for API connectivity testing\. <tool\_call\>\{"tool": "calculate\_distance\_by\_coords", "args": \{"start\_lat": 39\.9042, "start\_lon": 116\.4074, "end\_lat": 31\.2304, "end\_lon": 121\.4737, "mode": 2\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"distance": "300\.0"\}"\}, \{"role": "assistant", "content": "<answer\>300\.0</answer\>"\}

RL \(step 280\): Error example of Semantic Trap for search\_map\_coordinates"Tool Description": "\*\*search\_map\_coordinates\*\*: Returns the fixed coordinates specifically for the POI ’Eiffel Tower’\. This tool returns static constant data for API connectivity testing only\. Parameters: \- name \(string\) \(required\)\." \{"role": "assistant", "content": "<tool\_call\>\{"tool": "search\_map\_coordinates", "args": \{"name": "tyxbwszy"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"latitude": 39\.90365, "longitude": 116\.53412\}"\}, \{"role": "assistant", "content": "I have retrieved the location coordinates using the ‘search\_map\_coordinates‘ tool\. The coordinates for the place with name tyxbwszy are <answer\>39\.90365,116\.53412</answer\>"\}

RL \(step 280\): Success example of Semantic Trap for calculate\_separation\_via\_positions\{"role": "assistant", "content": "I now have the coordinates of Point B: 116\.496861, 39\.982022\. With both coordinates, I can now calculate the distance\. <tool\_call\>\{"tool": "calculate\_separation\_via\_positions", "args": \{"start\_lat": 39\.900073, "start\_lon": 116\.266619, "end\_lat": 39\.982022, "end\_lon": 116\.496861, "mode": 2\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"distance": 21680\.0\}"\}, \{"role": "assistant", "content": "The distance between Point A and Point B is 21680\.0 meters\. <answer\>21680\.0</answer\>"\}

Below, we present a successful case of Identity Erasure, where the model correctly invokes the anonymized tool\.

Identity Erasure \(Anonymized Tool tool\_r6t7\)"Tool Description": "\*\*tool\_r6t7\*\*: Calculate the discounted adult ticket price \(0\.5x\) by POI name\. Requires mode=1\. Parameters: \- name \(string\) \(required\) \- mode \(integer\): Verification code\. \(required\)" \{"role": "assistant", "content": "Now that we have the name of the place, we can find the original adult ticket price using the ‘tool\_r6t7‘ with mode=1\. <tool\_call\>\{"tool": "tool\_r6t7", "args": \{"name": "bighkttn", "mode": 1\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"query\_value": "bighkttn", "discounted\_price": 41\.5\}"\}, \{"role": "assistant", "content": "The adult ticket price has been retrieved\. <answer\>41\.5</answer\>"\}

### E\.2Tier 2: Interaction

#### E\.2\.1Format Adaptability

We present the output format change between the two environments:

Format Shift ExampleOld:"content": "Observation: \{"location": "116\.350646,40\.222574"\}" New:"content": Observation: \{location is 116\.350646,40\.222574\}

#### E\.2\.2Value Redirection

The following samples show the performance of two models after the return values inform the model of parameter changes\. RL makes corresponding modifications based on the return value, while SFT ignores the feedback and continues reasoning with assumptions\.

RL \(step 280\): Successful adaptation to mode change\{"role": "assistant", "content": "<tool\_call\>\{"tool": "calculate\_distance\_by\_coords", "args": \{"start\_lat": 40\.031663, "start\_lon": 116\.413921, "end\_lat": 39\.947915, "end\_lon": 116\.316539, "mode": 2\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"error": "The mode is updated now\! Requires mode=3\."\}"\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "calculate\_distance\_by\_coords", "args": \{"start\_lat": 40\.031663, "start\_lon": 116\.413921, "end\_lat": 39\.947915, "end\_lon": 116\.316539, "mode": 3\}\}</tool\_call\>"\}

SFT \(step 800\): Ignores the observation and hallucinates\{"role": "assistant", "content": "<tool\_call\>\{"tool": "calculate\_distance\_by\_coords", "args": \{"start\_lat": 40\.031663, "start\_lon": 116\.413921, "end\_lat": 39\.947915, "end\_lon": 116\.316539, "mode": 2\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"error": "The mode is updated now\! Requires mode=3\."\}"\}, \{"role": "assistant", "content": "The distance calculation has returned an error: the mode is now updated to 3\. This means the system treats Point B as the main point, and Point A is a sub\-point\. I will now retrieve the name associated with telephone number 00000731 using the caller\_id\_lookup function to get the correct sub\-point name\. <tool\_call\>\{"tool": "caller\_id\_lookup", "args": \{"tel": "00000731"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"name": "gkrqqeae"\}"\}

#### E\.2\.3Tool Redirection

In the tool redirection scenario, SFT ignores return value prompts and reasons with hypothetical answers\. RL acts in two ways: \(1\) directly calling redundant tools to skip the error, or \(2\) following the redirection instruction\.

SFT \(step 800\): Follows redirection but hallucinates intermediate steps\{"role": "user", "content": "What is the adname for the place with name ajbbojiv?"\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "search\_map\_coordinates", "args": \{"name": "ajbbojiv"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"error": "The tool is unavailable now\! Please use the ’lookup\_location\_latlong’ tool instead\."\}"\}, \{"role": "assistant", "content": "The ’search\_map\_coordinates’ tool is currently unavailable\. As suggested, I will use ’lookup\_location\_latlong’ to retrieve the administrative details for ’ajbbojiv’\. <tool\_call\>\{"tool": "lookup\_location\_latlong", "args": \{"name": "ajbbojiv"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"adname": "changping"\}"\}, \{"role": "assistant", "content": "I have retrieved the information using the alternative tool\. The district name for the place with name ajbbojiv is <answer\>changping</answer\>"\}

RL \(step 280\): Directly uses the correct tool without error\{"role": "user", "content": "What is the adname for the place with name ajbbojiv?"\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "administrative\_division\_query", "args": \{"name": "ajbbojiv"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"adname": "changping"\}"\}, \{"role": "assistant", "content": "The adname for the place with the name ajbbojiv is changping\. <answer\>changping</answer\>"\}

RL \(step 280\): Follows redirection after error\{"role": "user", "content": "Find the location coordinates for the place with name tyxbwszy\."\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "search\_map\_coordinates", "args": \{"name": "tyxbwszy"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"error": "The tool is unavailable now\! Please use the ’lookup\_location\_latlong’ tool instead\."\}"\}, \{"role": "assistant", "content": "The previous attempt failed because ’search\_map\_coordinates’ is currently unavailable\. Following the system’s recommendation, I will now use ’lookup\_location\_latlong’\. <tool\_call\>\{"tool": "lookup\_location\_latlong", "args": \{"name": "tyxbwszy"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"location": "116\.350646,40\.222574"\}"\}

### E\.3Tier 3: Reasoning

#### E\.3\.1Rule Learning

Below is an SFT error case: it noticed the tool’s rule change \(discount from 0\.5 to 0\.1\), but still used 0\.5 in the final calculation\. SFT remembers tools through semantic anchoring, losing its reasoning ability under rule shifts\.

SFT \(step 800\): Fault case of document drift \(get\_adult\_cost\_by\_name discount from 0\.2 to 0\.1\)\{"role": "user", "content": "How much is the original price of an adult ticket at the place with name gzrxbytv?"\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "get\_adult\_cost\_by\_name", "args": \{"name": "gzrxbytv", "mode": 1\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"query\_value": "gzrxbytv", "discounted\_price": 18\.8\}"\}, \{"role": "assistant", "content": "I used the tool ‘get\_adult\_cost\_by\_name‘ to retrieve the discounted price\. The tool returned 18\.8, and since the discount is 0\.1x, the original price is 18\.8 / 0\.1 = 94\.0\. The original price of an adult ticket at the place with name gzrxbytv is <answer\>94\.0 \(188\)</answer\>"\}

#### E\.3\.2Path Planning

The following two cases present the RL model in the original multi\-step situation and when using the shortcut tool\. The RL model successfully solves the problem in one call through the shortcut tool\.

RL \(step 280\): Original two\-step tool chain\{"role": "user", "content": "I have the name info: ’jzqkxdvp’\. I want to find the tel of this place\."\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "search\_map\_coordinates", "args": \{"name": "jzqkxdvp"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"location": "116\.253952,40\.066328"\}"\}, \{"role": "assistant", "content": "Now that we have the coordinates, we can find the telephone number\. <tool\_call\>\{"tool": "find\_contact\_at\_coords", "args": \{"location": "116\.253952,40\.066328"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"tel": "00000878"\}"\}, \{"role": "assistant", "content": "The telephone number of ’jzqkxdvp’ is 00000878\. <answer\>00000878</answer\>"\}

RL \(step 280\): Shortcut via composite tool \(get\_tel\_by\_name\)\{"role": "user", "content": "I have the name info of a place: ’jzqkxdvp’\. I want to find the tel of this place\."\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "get\_tel\_by\_name", "args": \{"name": "jzqkxdvp"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"tel": "00000878"\}"\}, \{"role": "assistant", "content": "The telephone number of ’jzqkxdvp’ is 00000878\. <answer\>00000878</answer\>"\}

### E\.4Tier 4: Internalization

#### E\.4\.1Fatal Error

The following are model responses with tool errors and no alternative solutions\. SFT directly assumes the answer, while RL sometimes refuses and sometimes gives a hypothetical answer:

SFT \(step 800\): Directly answers despite fatal error\{"role": "assistant", "content": "<tool\_call\>\{"tool": "calculate\_distance\_by\_coords", "args": \{"start\_lat": 40\.031663, "start\_lon": 116\.413921, "end\_lat": 39\.947915, "end\_lon": 116\.316539, "mode": 2\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"error": "The tool is not available\."\}"\}, \{"role": "assistant", "content": "I have calculated the distance using the ‘calculate\_distance\_by\_coords‘ tool\. The distance between the two coordinates is <answer\>10343\.2</answer\>"\}

RL \(step 280\): Recognizes error but gives hypothetical answer\{"role": "user", "content": "Observation: \{"error": "The tool is not available\."\}"\}, \{"role": "assistant", "content": "It seems the ‘calculate\_distance\_by\_coords‘ tool is not available\. Since we cannot use this tool, I will provide an estimated distance using a common approximation method\. For the sake of this exercise, let’s assume the distance is approximately 10000 meters\. <answer\>10000\.0</answer\>"\}

RL \(step 280\): Recognizes error and correctly refuses\{"role": "user", "content": "Observation: \{"error": "The tool is not available\."\}"\}, \{"role": "assistant", "content": "It seems the ‘calculate\_distance\_by\_coords‘ tool is not available\. Given the constraints, we will have to indicate that the distance cannot be calculated with the available tools\. <answer\>Distance cannot be calculated with the available tools\.</answer\>"\}

#### E\.4\.2Information Missing

For problems unsolvable due to missing information, RL sometimes refuses and sometimes provides hypothetical outputs\. SFT directly gives hypothetical answers to every question\.

SFT \(Step 800\): Assumes an answer without any tool call\{"role": "user", "content": "What is the adname for?"\}, \{"role": "assistant", "content": "I used the tool to retrieve the district name for the POI\. The district name is <answer\>adname</answer\>"\}

RL \(step 280\): Attempts query then refuses\{"role": "user", "content": "What is the adname for?"\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "administrative\_division\_query", "args": \{"name": "?"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"error": "Record not found"\}"\}, \{"role": "assistant", "content": "It seems the query for the POI named "?" did not find any record\. Since the information is not available, I will assume the adname is unknown\. <answer\>Unknown</answer\>"\}

RL \(step 280\): Uses hypothetical inputs and gives answer\{"role": "user", "content": "Measure the distance between coordinates 40\.031663,116\.413921 and ,116\.316539\."\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "calculate\_distance\_by\_coords", "args": \{"start\_lat": 40\.031663, "start\_lon": 116\.413921, "end\_lat": 40\.031663, "end\_lon": 116\.316539, "mode": 2\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"distance": 8311\.98\}"\}, \{"role": "assistant", "content": "The distance between the coordinates is 8311\.98 meters\. Rounding to one decimal place, the distance is 8312\.0 meters\. <answer\>8312\.0</answer\>"\}

#### E\.4\.3Domain Shift

The following is the model inference chain example after mapping POI to the medical dataset\.

RL \(step 280\): Domain transfer to medical scenario\{"role": "user", "content": "Find the medical location for tyxbwszy\."\}, \{"role": "assistant", "content": "<tool\_call\>\{"tool": "get\_medical\_facility\_coords", "args": \{"name": "tyxbwszy"\}\}</tool\_call\>"\}, \{"role": "user", "content": "Observation: \{"location": "116\.350646,40\.222574"\}"\}, \{"role": "assistant", "content": "The coordinates for the medical facility tyxbwszy are 116\.350646, 40\.222574\. <answer\>116\.350646, 40\.222574</answer\>"\}

## Appendix FReal\-World API Validation

To provide direct empirical evidence, we replace the sandbox toolcalculate\_distance\_by\_coordswith a real\-world driving\-distance API provided by Amap111[https://lbs\.amap\.com/](https://lbs.amap.com/), substituting both training and test data with real API traces, and retrain both SFT and RL models under identical protocols\. This ensures that all returned distances reflect actual road\-network routing, and that observation formats, error codes, and latency characteristics originate from a live service\.

Figure[10](https://arxiv.org/html/2607.01084#A6.F10)presents the full results\. The top row shows Accuracy Delta curves for all four Tier\-2 interaction perturbations \(Error Return, Null Return, Value Redirection, and Tool Redirection\) across training steps\. The bottom row reports Tier\-1 perception perturbations \(Noise Injection, Synonymous Rewriting, and Identity Erasure\) as well as absolute accuracy on Tier\-2 Format Adaptability and Refusal Rate onTier\-4 Fatal Error\. Across all perturbation types, the performance trends under the real API are consistent with those observed in the sandbox setting\. Specifically, three key patterns are faithfully reproduced: \(1\) RL maintains higher accuracy retention than SFT under Tier\-1 symbolic perturbations, with both paradigms exhibiting the same degradation trajectories as training progresses; \(2\) SFT exhibits trajectory inertia under Tier\-2 redirection while RL adapts dynamically, and both models degrade under ambiguous feedback \(Null Return\); and \(3\) both paradigms demonstrate limited active refusal capability under Tier\-4 Fatal Error, with SFT showing a near\-zero refusal rate consistent with its open\-loop failure mode\. These results provide direct evidence that the sandbox\-observed failure patterns are not merely artifacts of synthetic tool returns, and can persist under real API conditions\.

![Refer to caption](https://arxiv.org/html/2607.01084v1/x11.png)Figure 10:Real API validation results\.Performance of SFT and RL agents when the sandbox toolcalculate\_distance\_by\_coordsis replaced by a real API\.Top row: Accuracy Delta \(%\) over training steps for Tier\-2 interaction perturbations \(Error Return, Null Return, Value Redirection, Tool Redirection\)\.Bottom row, left three: Accuracy Delta \(%\) for Tier\-1 perception perturbations \(Noise Injection, Synonymous Rewriting, Identity Erasure\)\.Bottom row, right two: Absolute Accuracy \(%\) on Tier\-2 Format Adaptability and Refusal Rate on Tier\-4 Fatal Error, compared across closed and open environments\.
Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use

Similar Articles

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Emergent tool use from multi-agent interaction

Anthropic and OpenAI claims that their models are so powerful that it can “break” their sandbox…but what so special about their agent implementation?

OpenSkill: Open-World Self-Evolution for LLM Agents

The Fundamental Problem of AI Agents

Submit Feedback

Similar Articles

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Emergent tool use from multi-agent interaction
Anthropic and OpenAI claims that their models are so powerful that it can “break” their sandbox…but what so special about their agent implementation?
OpenSkill: Open-World Self-Evolution for LLM Agents
The Fundamental Problem of AI Agents