Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

arXiv cs.CL Papers

Summary

Introduces AutoTool, a model that adaptively decides whether to invoke tools for multimodal LLM reasoning, achieving significant accuracy and efficiency gains through reinforcement learning and dual-mode reasoning.

arXiv:2605.19852v1 Announce Type: new Abstract: Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\% accuracy gain on V* benchmark compared to the base model, and a 44.9\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:27 AM

# Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
Source: [https://arxiv.org/html/2605.19852](https://arxiv.org/html/2605.19852)
###### Abstract

Tool\-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models \(MLLMs\)\. However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools\. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions\. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query\. Within a reinforcement learning framework, we design an explicit dual\-mode reasoning strategy with mode\-specific reward functions to guide the model toward producing accurate responses\. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool\-assisted and text\-centric reasoning throughout training, and promotes free exploration in later stages\. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21\.8% accuracy gain on V\* benchmark compared to the base model, and a 44\.9% improvement in efficiency over existing tool\-augmented methods on POPE benchmark\. Code is available at[https://github\.com/MQinghe/AutoTool](https://github.com/MQinghe/AutoTool)\.

Machine Learning, ICML

## 1Introduction

By decomposing complex problems into a sequence of reasoning steps, chain\-of\-thought \(CoT\) prompting\(Weiet al\.,[2022](https://arxiv.org/html/2605.19852#bib.bib10); Kojimaet al\.,[2022](https://arxiv.org/html/2605.19852#bib.bib11)\)has endowed multimodal large language models \(MLLMs\)\(Teamet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib12); Liuet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib14); Wanget al\.,[2024b](https://arxiv.org/html/2605.19852#bib.bib13); Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25)\)with stronger reasoning capabilities\. However, most existing approaches follow the textual reasoning paradigm of large language models \(LLMs\)\(Achiamet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib28); Dubeyet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib27); Guoet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib21)\), leaving current MLLMs constrained by linguistic bias that limits their ability to effectively leverage multimodal information\. The multimodal CoT \(MCoT\) prompt\(Zhanget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib31); Wanget al\.,[2025a](https://arxiv.org/html/2605.19852#bib.bib32)\), exemplified by the “Thinking with Images” approach of OpenAI o3\(OpenAI,[2025](https://arxiv.org/html/2605.19852#bib.bib29)\), injects multimodal context into reasoning to strengthen visual cues and cross\-modal interactions\.

![Refer to caption](https://arxiv.org/html/2605.19852v1/x2.png)Figure 1:\(a, b\) Representative queries that do or do not trigger the zoom\-in tool, illustrating that tool usage is not always necessary, while AutoTool adaptively invokes tools when beneficial\. \(c, d\) Comparison of the proportion of tool\-augmented reasoning trajectories during training, as well as the training and inference time costs between our AutoTool and SOTA DeepEyes\(Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\)\.In MCoT, visual information is typically derived from external tools such as additional search engines\(Fanet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib34); Komeiliet al\.,[2021](https://arxiv.org/html/2605.19852#bib.bib35)\), multiple visual models\(Maet al\.,[2025c](https://arxiv.org/html/2605.19852#bib.bib36); Qiet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib37)\), or image processing methods\(Suet al\.,[2025b](https://arxiv.org/html/2605.19852#bib.bib38); Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\)\. Recent progress in reinforcement learning\(Shaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib39); Guoet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib21); Chenet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib40)\)allows models to acquire tool\-usage skills in a more cost\-efficient and flexible way\(Suet al\.,[2025b](https://arxiv.org/html/2605.19852#bib.bib38); Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33); Suet al\.,[2025a](https://arxiv.org/html/2605.19852#bib.bib59)\)\. While MCoT demonstrates superior reasoning capabilities compared to text\-centric CoT on multiple benchmarks, it also introduces two major challenges\.The first lies in the significantly increased training and inference costs\.Existing tool\-augmented reasoning models, such as OpenThinkIMG\(Suet al\.,[2025b](https://arxiv.org/html/2605.19852#bib.bib38)\)and DeepEyes\(Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\), often rely on fixed tool invocation orchestration or inadequate reward designs\. Consequently, theyimplicitlyfocus on learning how to invoke tools correctly and generate accurate answers, while neglecting whether tool usage is truly necessary\. As illustrated in[Figure1](https://arxiv.org/html/2605.19852#S1.F1)\(c\) and[Figure1](https://arxiv.org/html/2605.19852#S1.F1)\(d\), taking DeepEyes\(Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\)as an example, it consistently encourages tool invocation regardless of task difficulties\. Even for simple queries, the model tends to engage in unnecessary multi\-turn reasoning, substantially increasing computational overhead during both training and inference\. Hence, DeepEyes requires 44\.9 training hours, 20\.3% more than adaptive tool invocation, indicating that redundant tool usage severely slows down the reasoning process\.Furthermore, erroneous tool invocations may interfere with reasoning\.As shown in[Figure1](https://arxiv.org/html/2605.19852#S1.F1)\(b\), when answering a question about the spatial relationship between a person and a car, the model should rely on global understanding, where zoom\-in tool invocation is unnecessary\. However, DeepEyes incorrectly invokes the tool to focus solely on the car region, rather than the combined area of the car and the person, introducing redundant visual information that distracts the reasoning process and ultimately leads to hallucinated responses\. In such cases, the autoregressive nature of LLMs makes frequent tool invocations particularly problematic, as they amplify irrelevant visual cues and cause error accumulation, further intensifying reasoning distraction and hallucination\.

In our opinion, when handling a multimodal query, an ideal model should carefullydetermine whether tool assistance is necessarybefore invocation\. Taking the zoom\-in operation as an example, intuitively, if a question requires close inspection or verification of fine\-grained visual details, the zoom\-in tool becomes essential\. As illustrated in[Figure1](https://arxiv.org/html/2605.19852#S1.F1)\(a\), where the task involves identifying a specific object among multiple candidates, zooming into the target region substantially improves the likelihood of a correct answer\. In contrast, as shown in[Figure1](https://arxiv.org/html/2605.19852#S1.F1)\(b\), when the question involves global understanding, overall layout reasoning, or when the target region is already sufficiently clear, invoking the zoom\-in tool yields negligible benefit and may even introduce unnecessary distractions\.

To address the issue of existing methods that overemphasize tool usage, we introduceAutoTool, which empowers the model to adaptively decide when to engage in “Think with Images” reasoning, reconsidering the common belief that “tools are always beneficial”\. Byexplicitlycontrolling tool usage through two special tokens,<tool\_on\>and<tool\_off\>, AutoTool employs dual reasoning modes that leverage tools for complex problems while recognizing that simple queries can be solved without tool assistance\. This paradigm improves both training and inference efficiency, as well as mitigating hallucinated responses\. Instead of relying on carefully curated SFT data for cold\-start training, we adopt an end\-to\-end reinforcement learning framework that encourages the model to fully explore the two reasoning modes in a simple yet effective manner\.

Within this explicit dual\-mode paradigm, we design distinct reward functions to evaluate reasoning trajectories under different reasoning modes, which we refer to asMode\-Specific Policy Optimization \(MSPO\)\. For the<tool\_on\>mode, the model is trained to accurately utilize the tool while providing correct answers\. Unlike prior methods\(Suet al\.,[2025b](https://arxiv.org/html/2605.19852#bib.bib38); Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\)that primarily emphasize tool invocation, we penalize instances where the model invokes tools but produces incorrect answers, reducing unnecessary or ineffective tool operations\. For the<tool\_off\>mode, the model relies entirely on its internal reasoning to generate accurate answers\.However, learning to masterdual reasoning modes is nontrivial\. Due to the inherent reasoning bias of the foundation model, the policy model tends to prefer the<tool\_off\>mode, which often yields higher rewards more easily, leaving the<tool\_on\>mode underexplored\. To mitigate this imbalance, we propose anAdaptive Mode Balancing \(AMB\)strategy that dynamically adjusts the reward coefficients to control the frequency of the two modes, ensuring sufficient exploration for both\. The constraint is relaxed in the later stage of training, allowing the model to freely determine its preferred mode\. Our contributions can be summarized as follows:

- •We analyze the pros and cons of tool\-assisted reasoning for MLLMs\. While tool invocation can enhance reasoning capabilities, blindly encouraging tool usage increases both training and inference costs and may introduce distracting or redundant information\.
- •We design Mode\-Specific Policy Optimization \(MSPO\), with distinct optimization objectives to different reasoning modes, enabling the model to learn adaptive reasoning with or without tools\.
- •We propose Adaptive Mode Balancing \(AMB\), which adaptively and dynamically adjusts the frequency of the two modes to ensure sufficient exploration of dual reasoning modes throughout training\.

Extensive experiments on multiple multimodal benchmarks demonstrate that AutoTool achieves superior reasoning capability and high efficiency\.

## 2Related Work

![Refer to caption](https://arxiv.org/html/2605.19852v1/x3.png)Figure 2:Illustration of the AutoTool training framework\. Given a multimodal problem, the policy model first decides whether the subsequent reasoning process requires tool invocation\. For each batch of generated reasoning trajectories, different reward functions are applied to evaluate the trajectories under distinct reasoning modes via the Mode\-Specific Policy Optimization \(MSPO\), and the tool invocation reward coefficient is estimated through the Adaptive Mode Balancing \(AMB\) strategy\. The model is optimized via the GRPO\.### 2\.1Multimodal Large Language Models

The emergence of multimodal large language models \(MLLMs\)\(Liuet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib14); Teamet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib12); Hurstet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib67); Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25); Faet al\.,[2026](https://arxiv.org/html/2605.19852#bib.bib91); Liet al\.,[2026](https://arxiv.org/html/2605.19852#bib.bib92)\)marks a major milestone in artificial intelligence and has substantially promoted the development of diverse application domains\(Maet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib93),[2025a](https://arxiv.org/html/2605.19852#bib.bib94),[2025b](https://arxiv.org/html/2605.19852#bib.bib95); Duanet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib96); Yanget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib97); Wanget al\.,[2025c](https://arxiv.org/html/2605.19852#bib.bib98)\)\. Early works such as LLaVA\(Liuet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib14),[2024](https://arxiv.org/html/2605.19852#bib.bib63)\), BLIP\(Liet al\.,[2022](https://arxiv.org/html/2605.19852#bib.bib64),[2023a](https://arxiv.org/html/2605.19852#bib.bib65)\), and Qwen\-VL\(Baiet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib66); Wanget al\.,[2024b](https://arxiv.org/html/2605.19852#bib.bib13); Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25)\)adopt modular architectures that pair pretrained visual encoders \(*e\.g\.*, CLIP\-ViT\(Chertiet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib68); Radfordet al\.,[2021](https://arxiv.org/html/2605.19852#bib.bib69)\), InternViT\(Chenet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib70)\)\) with LLMs, laying the foundation for MLLM development\. These models typically involve large\-scale multimodal alignment training followed by instruction tuning for task adaptation\. Subsequent studies like Flamingo\(Alayracet al\.,[2022](https://arxiv.org/html/2605.19852#bib.bib71)\)and Cambrian\-1\(Tonget al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib72)\)integrate multiple encoders for richer visual representations, while EVE\(Diaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib73)\), MonoInternVL\(Luoet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib74)\), and SAIL\(Leiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib75)\)pursue end\-to\-end architectures that process raw image patches and text tokens within a unified Transformer\. Recently, reinforcement learning has further advanced chain\-of\-thought \(CoT\) reasoning\(Shaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib39); Guoet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib21); Chenet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib40)\), yet most approaches remain text\-centric\(Fanet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib76); Yaoet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib77)\), limiting the model’s understanding of visual content\. To address this, we propose adaptive tool\-assisted zoom\-in reasoning for complex problems, enabling deeper visual exploitation and more interpretable answers\.

### 2\.2Tool\-Augmented Reasoning in MLLMs

The multimodal information processing capability of MLLMs enables human\-like “Thinking with Images” through multimodal chain of thought \(MCoT\) reasoning\(Zhanget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib31); Wanget al\.,[2025a](https://arxiv.org/html/2605.19852#bib.bib32); Suet al\.,[2025c](https://arxiv.org/html/2605.19852#bib.bib30); Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33); OpenAI,[2025](https://arxiv.org/html/2605.19852#bib.bib29)\)\. Recent works such as Visual Sketchpad\(Huet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib78)\), OpenThinkIMG\(Suet al\.,[2025b](https://arxiv.org/html/2605.19852#bib.bib38)\), and Thyme\(Zhanget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib31)\)equip models with planning and orchestration abilities, leveraging diverse external tools, such as semantic segmentation\(Kirillovet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib79); Raviet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib80)\), OCR, and depth estimation\(Yanget al\.,[2024b](https://arxiv.org/html/2605.19852#bib.bib81),[c](https://arxiv.org/html/2605.19852#bib.bib82)\), to inject rich visual cues into the reasoning process\. Beyond explicit tool usage, methods like BAGEL\(Denget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib83)\), Visual Planning\(Xuet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib84)\), and GoT\(Fanget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib9)\)unify generation and reasoning, generating new explicit or implicit visual states from contextual semantics to facilitate subsequent reasoning steps\. Current approaches for acquiring tool\-use capability typically fall into three categories: prompt\-based methods that rely on in\-context learning\(Huet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib78); Liet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib57)\), supervised fine\-tuning that teaches procedural competence from examples\(Wu and Xie,[2024](https://arxiv.org/html/2605.19852#bib.bib44); Maet al\.,[2025c](https://arxiv.org/html/2605.19852#bib.bib36)\), and reinforcement learning that optimizes tool\-use policies through feedback\(Suet al\.,[2025b](https://arxiv.org/html/2605.19852#bib.bib38); Laiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib62)\)\. However, existing studies mainly emphasize how to teach models to use tools correctly, neglecting the critical question of whether tool invocation is necessary\. Thus, our method adaptively decides when and how to invoke tools, achieving a balance between reasoning efficiency and reliability\.

### 2\.3Reinforcement Learning in Large Models

Reinforcement learning has demonstrated remarkable potential in enhancing the reasoning capabilities of large models\(Shaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib39); Guoet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib21); Chenet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib40)\)\. DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib21)\)shows that even simple rule\-based RL strategies can effectively induce strong reasoning behaviors, inspiring a surge of research into RL\-based reasoning enhancement\. Building on this trend, recent works such as DeepEyes\(Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\), TreeVGR\(Wanget al\.,[2025a](https://arxiv.org/html/2605.19852#bib.bib32)\), and Thyme\(Zhanget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib31)\)employ Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib39)\)to guide models in performing accurate tool\-assisted reasoning\. Distinct from these approaches, our method leverages RL not only to reinforce proper tool invocation, but also to explore and coordinate multiple reasoning modes, fostering more adaptive and context\-aware multimodal reasoning\.

## 3Method

### 3\.1Problem Formulation and Preliminary

Given a multimodal queryX=\(Q,V\)X=\(Q,V\), whereQQdenotes the textual query andVVrepresents the visual inputs, we first revisit the traditionaltext\-centric reasoning paradigm\. In this paradigm, the policy modelπθ\\pi\_\{\\theta\}performs reasoning purely in the textual space by generating a sequence of intermediate reasoning stepsR=\{Ti\}i=1IR=\\\{T\_\{i\}\\\}\_\{i=1\}^\{I\}, whereTiT\_\{i\}represents the internal reasoning text at theii\-th step\. At each step, the reasoning stateRi∼πθ\(⋅∣X,R1,…,Ri−1\)R\_\{i\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid X,R\_\{1\},\\ldots,R\_\{i\-1\}\)is sampled fromπθ\\pi\_\{\\theta\}conditioned on the initial query and all previous steps\. Each new reasoning step is appended to the context and fed back into the policy model for subsequent reasoning\. This iterative process continues until the model outputs the final answerYY, or until a predefined limit on context length is reached\. Accordingly, the complete textual reasoning trajectoryγt\\gamma\_\{t\}can be formulated as:

γt=\{X,\(T1\),…,\(TI,Y\)\}\.\\gamma\_\{t\}=\\\{X,\(T\_\{1\}\),\\ldots,\(T\_\{I\},Y\)\\\}\.\(1\)Different from the text\-only paradigm,multimodal reasoning paradigmaugments each step with tool interactions\. Specifically, the reasoning state at theii\-th step is represented as a tripletRi=\(Ti,Ai,Oi\)R\_\{i\}=\(T\_\{i\},A\_\{i\},O\_\{i\}\), whereTiT\_\{i\}denotes the internal reasoning text,AiA\_\{i\}denotes the tool action along with its parameters, andOiO\_\{i\}denotes the observation returned by executing the tool action\. The complete multimodal reasoning trajectoryγm\\gamma\_\{m\}alternates between reasoning and interaction, and terminates with a final textual answerYY\. Accordingly, the resulting multimodal reasoning trajectory is defined as:

γm=\{X,\(Ti,Ai,Oi\)i=1I−1,\(TI,Y\)\}\.\\gamma\_\{m\}=\\\{X,\\;\(T\_\{i\},A\_\{i\},O\_\{i\}\)\_\{i=1\}^\{I\-1\},\\;\(T\_\{I\},Y\)\\\}\.\(2\)

### 3\.2Overview

Compared with the text\-centric reasoning paradigm, multimodal reasoning extends visual information processing from a one\-time encoding to an iterative editing process through explicit tool invocation\. This paradigm allows the model to step out of the textual bias and effectively leverage multimodal cues for reasoning\. However, indiscriminately encouraging the model to invoke tools leads to two major issues: \(1\) the reasoning cost increases significantly during both training and inference, and \(2\) unnecessary or incorrect tool usage may introduce noisy or misleading information, thereby deteriorating the reasoning reliability\.

We introduceAutoToolto break the conventional assumption that “tools are always beneficial” in multimodal reasoning\. It adaptively decides whether tool invocation is necessary for each task and selects the more suitable reasoning mode, achieving a better balance between reasoning efficiency and answer reliability\. As illustrated in[Figure2](https://arxiv.org/html/2605.19852#S2.F2), given a user\-provided multimodal query, AutoTool first determines whether the current question requires the assistance of the tool\. If tool usage is deemed necessary, the policy model invokes a zoom\-in function to locate the region of interest that is most relevant to the query, and appends the resulting cropped visual observation to the reasoning context for subsequent inference\. Otherwise, the policy model performs purely textual reasoning to directly produce the final answer in a more efficient manner\. The policy for whether and how to invoke tools is learned through reinforcement learning, as detailed in the following sections\.

### 3\.3Explicit Dual Reasoning Modes

We define two special control tokens,<tool\_on\>and<tool\_off\>, explicitly indicating whether the model employs tools in subsequent reasoning\.<tool\_on\>triggers tool\-augmented reasoning with<tool\_call\>and<tool\_response\>structures, while<tool\_off\>corresponds to pure textual reasoning without tool usage\. We carefully design the prompts, as detailed in the supplementary material, to explicitly define the applicable scenarios and output formats for both reasoning modes\. The policy model is trained via Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib39)\), a reinforcement learning algorithm that enables effective and efficient exploration of different reasoning strategies without relying on hard\-to\-obtain SFT data\.

Specifically, given a multimodal queryX=\(Q,V\)X=\(Q,V\), we sample a group ofGGcandidate reasoning trajectories\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}from the policy model\. For each trajectoryoio\_\{i\}from the old policyπθo​l​d\\pi\_\{\\theta\_\{old\}\}, we compute a scalar rewardrir\_\{i\}based on both the final answer and the intermediate reasoning process, as detailed in Section[3\.4](https://arxiv.org/html/2605.19852#S3.SS4)\. The rewards\{ri\}i=1G\\\{r\_\{i\}\\\}\_\{i=1\}^\{G\}are then normalized to obtain the advantages\{A^i\}i=1G\\\{\\hat\{A\}\_\{i\}\\\}\_\{i=1\}^\{G\}\. Formally, the optimization objective of GRPO is defined as:

𝒥GRPO​\(θ\)=𝔼X,\{oi\}i=1G∼πθold\[1G∑i=1Gmin\(πθ​\(oi\|X\)πθold​\(oi\|X\)A^i,clip\(πθ​\(oi\|X\)πθold​\(oi\|X\),1−ϵ,1\+ϵ\)A^i\)\],\\begin\{split\}\\mathcal\{J\}\_\{\\text\{GRPO\}\}\(\\theta\)&=\\mathds\{E\}\_\{X,\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\min\\Bigg\(\\frac\{\\pi\_\{\\theta\}\(o\_\{i\}\|X\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(o\_\{i\}\|X\)\}\\hat\{A\}\_\{i\},\\\\ &\\text\{clip\}\\\!\\left\(\\frac\{\\pi\_\{\\theta\}\(o\_\{i\}\|X\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(o\_\{i\}\|X\)\},1\-\\epsilon,1\+\\epsilon\\right\)\\\!\\hat\{A\}\_\{i\}\\Bigg\)\\Bigg\],\\end\{split\}\(3\)
A^i=ri−mean​\(\{r1,r2,…,rG\}\)std​\(\{r1,r2,…,rG\}\),\\hat\{A\}\_\{i\}=\\frac\{r\_\{i\}\-\\text\{mean\}\(\\\{r\_\{1\},r\_\{2\},\\ldots,r\_\{G\}\\\}\)\}\{\\text\{std\}\(\\\{r\_\{1\},r\_\{2\},\\ldots,r\_\{G\}\\\}\)\},\(4\)whereϵ\\epsilonis the clipping hyperparameter and we do not include a KL regularization term\.

Nevertheless, due to the intrinsic reasoning bias inherited from the foundation model, the policy model exhibits a tendency to over\-prefer the<tool\_off\>mode, which yields higher rewards with less effort and consequently hinders adequate exploration of the<tool\_on\>mode\. To encourage sufficient exploration across both reasoning modes, we propose theAdaptive Mode Balancing \(AMB\)strategy that dynamically regulates their respective reward coefficients, ensuring that neither mode is neglected during training\.

For a batch ofNNsamples\{Xi\}i=1N\\\{X\_\{i\}\\\}\_\{i=1\}^\{N\}, we obtainN×GN\\times Grollouts from different reasoning modes\. We record the occurrence counts of the two modes asNonN\_\{\\text\{on\}\}andNoffN\_\{\\text\{off\}\}, respectively, and compute the tool invocation frequency asFon=NonNon\+NoffF\_\{\\text\{on\}\}=\\frac\{N\_\{\\text\{on\}\}\}\{N\_\{\\text\{on\}\}\+N\_\{\\text\{off\}\}\}\. Based on the initial tool invocation reward coefficientλtoolbase\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}, we dynamically adjust it as

λtoolmode=\{λtoolbase\+0\.5−Fon,ifmode=on,λtoolbase−0\.5\+Fon,ifmode=off,\\lambda\_\{\\text\{tool\}\}^\{\\text\{mode\}\}=\\begin\{cases\}\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}\+0\.5\-F\_\{\\text\{on\}\},&\\text\{if \}\\text\{mode\}=\\text\{on\},\\\\\[4\.0pt\] \\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}\-0\.5\+F\_\{\\text\{on\}\},&\\text\{if \}\\text\{mode\}=\\text\{off\},\\end\{cases\}\(5\)whereλtoolmode\\lambda\_\{\\text\{tool\}\}^\{\\text\{mode\}\}denotes the adaptive tool invocation reward coefficient, determined by the reasoning mode of the trajectory\. When the tool invocation frequency becomes too high,λtoolon\\lambda\_\{\\text\{tool\}\}^\{\\text\{on\}\}decreases whileλtooloff\\lambda\_\{\\text\{tool\}\}^\{\\text\{off\}\}increases, encouraging the model to explore the<tool\_off\>mode more actively, and vice versa\. Through adaptive adjustment, the model is encouraged to sufficiently explore both modes during training\.

As training progresses, the model becomes proficient in both reasoning modes\. At the final stage of training \(*e\.g\.*, the last 20 steps\), we remove this adaptive constraint and setλtoolon=λtooloff=λtoolbase\\lambda\_\{\\text\{tool\}\}^\{\\text\{on\}\}=\\lambda\_\{\\text\{tool\}\}^\{\\text\{off\}\}=\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}, allowing the policy to autonomously determine which reasoning mode to employ for each query based on its internal confidence and problem characteristics\. This transition enables the model to shift from guided exploration to self\-directed reasoning, achieving a more natural integration of both reasoning paradigms\.

### 3\.4Mode\-Specific Policy Optimization

To encourage the model to explore different reasoning modes through reinforcement learning while ensuring that it correctly follows the required output formats and performs valid tool invocations for accurate question answering, we design the following reward\.

The overall reward consists of three components: accuracy rewardRaccR\_\{\\text\{acc\}\}, format compliance rewardRformatR\_\{\\text\{format\}\}, and mode\-specific tool invocation rewardRtoolR\_\{\\text\{tool\}\},

R=Racc\+Rformat\+λtoolmode​Rtool\.R=R\_\{\\text\{acc\}\}\+R\_\{\\text\{format\}\}\+\\lambda\_\{\\text\{tool\}\}^\{\\text\{mode\}\}R\_\{\\text\{tool\}\}\.\(6\)
Accuracy rewardRaccR\_\{\\text\{acc\}\}:We evaluate whether the predicted answer is semantically equivalent to the ground truth using a combination of rule\-based metrics and an online reward model \(*e\.g\.*, Qwen2\.5\-72B\-Instruct\)\.

Format rewardRformatR\_\{\\text\{format\}\}:This ensures that the reasoning process and final answer adhere to the prescribed output format,*i\.e\.*, enclosed within<think\></think\>and<answer\></answer\>tags, respectively\.

Mode\-specific tool rewardRtoolR\_\{\\text\{tool\}\}:For the<tool\_on\>mode, the model receivesRtool=1R\_\{\\text\{tool\}\}=1when it correctly performs the zoom\-in tool invocations and produces a correct answer\. If the tool is invoked but the answer is incorrect, a penaltyRtool=−0\.5R\_\{\\text\{tool\}\}=\-0\.5is applied to account for the extra cost of tool usage\. In all other cases,Rtool=0R\_\{\\text\{tool\}\}=0\. For the<tool\_off\>mode, the model is rewardedRtool=1R\_\{\\text\{tool\}\}=1only if it does not invoke the zoom\-in tool and provides a correct answer; otherwise,Rtool=0R\_\{\\text\{tool\}\}=0\.

### 3\.5Inference

During inference, we employ the same prompting scheme as used in training\. The model can autonomously select the reasoning mode based on the characteristics of the query\. Alternatively, the reasoning mode can be manually specified, either by explicitly instructing the model in prompt to perform or skip tool invocation, or by appending the special token<tool\_on\>or<tool\_off\>to the input sequence\.

## 4Experiments

Table 1:Performance comparison of different models on perception benchmarks\. For models with similar sizes, the best performance for each metric is marked asbold, and the second\-best isunderlined\.Table 2:Performance comparison of different models on grounding benchmarks\. The best performance for each metric is marked asbold\.ModelSizerefCOCOrefCOCOgrefCOCO\+ReasonSegtesttestAtestBvaltestvaltestAtestBvaltestvalQwen2\.5\-VL\-7B\(Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25)\)7B84\.786\.678\.183\.477\.076\.682\.168\.576\.351\.159\.5DeepEyes\(Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\)7B86\.090\.579\.686\.180\.380\.487\.267\.879\.250\.661\.5AutoTool7B88\.592\.583\.188\.682\.882\.789\.772\.681\.653\.363\.0Δ​v\.s\.\\Delta v\.s\.Qwen2\.5\-VL\-7B\-↑\\uparrow3\.8↑\\uparrow5\.9↑\\uparrow5\.0↑\\uparrow5\.2↑\\uparrow5\.8↑\\uparrow6\.1↑\\uparrow7\.6↑\\uparrow4\.1↑\\uparrow5\.3↑\\uparrow2\.2↑\\uparrow3\.5

Table 3:Performance comparison on hallucination and reasoning benchmarks\. The best performance for each metric is marked asbold\.ModelSizePOPEMathVistaMathVerseMathVisionWeMathDynaMathLogicVistaAdversarialPopularRandomOveralltesttestminiQwen2\.5\-VL\-7B\(Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25)\)7B85\.987\.088\.987\.270\.643\.614\.816\.130\.857\.245\.5InternVL3\-8B\(Zhuet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib61)\)8B87\.288\.190\.888\.768\.347\.817\.416\.837\.957\.846\.0LLaVA\-OneVision\(Liet al\.,[2024a](https://arxiv.org/html/2605.19852#bib.bib60)\)7B84\.787\.690\.487\.658\.434\.59\.612\.537\.535\.428\.3DeepEyes\(Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\)7B81\.485\.091\.786\.071\.645\.215\.119\.432\.657\.745\.3AutoTool7B86\.188\.492\.388\.972\.845\.915\.019\.434\.058\.046\.7Δ​v\.s\.\\Delta v\.s\.Qwen2\.5\-VL\-7B\-↑\\uparrow0\.2↑\\uparrow1\.4↑\\uparrow3\.4↑\\uparrow1\.7↑\\uparrow2\.2↑\\uparrow2\.3↑\\uparrow0\.2↑\\uparrow3\.3↑\\uparrow3\.2↑\\uparrow0\.8↑\\uparrow1\.2

Table 4:Ablation experiments of each module\. The best performance for each metric is marked asbold\.IDTool onTool offMSPOpenalty\\text\{MSPO\}\_\{\\text\{penalty\}\}AMBfree\\text\{AMB\}\_\{\\text\{free\}\}HRbench\-4KHRbench\-8KV\*FSPFCPOverallFSPFCPOverallAttributeSpatialOverall1✓\\checkmark88\.059\.373\.681\.858\.870\.287\.881\.685\.32✓92\.057\.874\.985\.557\.571\.590\.482\.987\.43✓✓91\.858\.875\.386\.358\.572\.488\.788\.288\.54✓✓✓93\.358\.375\.887\.359\.373\.389\.688\.289\.05✓✓✓92\.860\.876\.885\.061\.573\.289\.689\.589\.56✓✓✓✓92\.561\.376\.988\.060\.074\.091\.388\.290\.1

Table 5:Effect of removing the mode\-balancing constraint at different training steps\. The best performance is marked asbold\.Table 6:Effect of the initial tool invocation reward coefficientλtoolbase\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}\. The best performance is marked asbold\.Table 7:Sensitivity analysis of the efficiency penalty term\.Table 8:Training and inference efficiency comparison between AutoTool and DeepEyes\.### 4\.1Benchmarks and Metrics

We evaluate our model across three categories of benchmarks to comprehensively assess its performance and compare it with existing methods\.

Perception benchmarks\.These include the V\*\(Wu and Xie,[2024](https://arxiv.org/html/2605.19852#bib.bib44)\)and HRbench\(Wanget al\.,[2025b](https://arxiv.org/html/2605.19852#bib.bib47)\)datasets, which consist of high\-resolution images \(ranging from 2K to 8K\)\. The questions in these datasets focus mainly on single\-object attributes, object counting, or relative spatial relationships\. The evaluation metric is the question answering accuracy\.

Grounding benchmarks\.This category includes RefCOCO\(Caesaret al\.,[2018](https://arxiv.org/html/2605.19852#bib.bib48)\), RefCOCO\+\(Caesaret al\.,[2018](https://arxiv.org/html/2605.19852#bib.bib48)\), RefCOCOg\(Kazemzadehet al\.,[2014](https://arxiv.org/html/2605.19852#bib.bib49)\), and ReasonSeg\(Laiet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib50)\)\. Both the COCO series and ReasonSeg require the model to output the bounding\-box of the referred object within an image\. We evaluate grounding accuracy by computing the Intersection\-over\-Union \(IoU\) between the predicted and ground\-truth regions, with a threshold of 0\.5 to determine whether the prediction is considered correct\.

Hallucination benchmark\.POPE\(Liet al\.,[2023b](https://arxiv.org/html/2605.19852#bib.bib51)\)serves as a hallucination detection benchmark that evaluates whether the target object truly exists in the image, and its metric is the prediction accuracy\.

Reasoning benchmarks\.These include MathVista\(Luet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib15)\), MathVerse\(Zhanget al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib52)\), MathVision\(Wanget al\.,[2024a](https://arxiv.org/html/2605.19852#bib.bib53)\), WeMath\(Qiaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib54)\), DynaMath\(Zouet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib55)\), and LogicVista\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib56)\)\. The tasks cover a wide range of reasoning types, including mathematical reasoning, geometric pattern recognition, logical and physical reasoning, chart interpretation, and commonsense reasoning in real\-world scenarios\. Some questions require the model to infer implicit information from the given text or image context\. The performance metric is the accuracy of the answer\.

### 4\.2Implementation Details

Following DeepEyes\(Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\), the training data include fine\-grained samples from the V\*\(Wu and Xie,[2024](https://arxiv.org/html/2605.19852#bib.bib44)\)dataset, chart data from ArxivQA\(Liet al\.,[2024b](https://arxiv.org/html/2605.19852#bib.bib45)\), and reasoning data from ThinkLite\-VL\(Wanget al\.,[2025d](https://arxiv.org/html/2605.19852#bib.bib46)\)\. The inclusion of reasoning data aims to enhance the general reasoning robustness of the model and mitigate overfitting to modality\-specific patterns, where purely textual reasoning and answer generation are performed without relying on tool\-based interactions\. We use Qwen2\.5\-VL\-7B\(Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25)\)as the base policy model and train it with GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib39); Shenget al\.,[2024](https://arxiv.org/html/2605.19852#bib.bib41)\)for 80 iterations on eight H200 GPUs\. An additional two H200 GPUs are used to deploy the reward model, Qwen2\.5\-72B\-Instruct\(Yanget al\.,[2024a](https://arxiv.org/html/2605.19852#bib.bib43)\), via the vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib42)\)inference engine\. Each training batch contains 256 samples, which are divided into 4 PPO mini\-batches\. For each query, the model generates 16 rollouts\. The initial tool invocation reward coefficientλtoolbase\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}is set to 1\.2, and the clipping parameterϵ\\epsilonis set to 0\.2\. We adopt the AdamW optimizer with an initial learning rate of1×10−61\\times 10^\{\-6\}\. The maximum response length of the policy model is set to 20,480 tokens\.

### 4\.3Main Results

Perception benchmarks\.[Table1](https://arxiv.org/html/2605.19852#S4.T1)presents the comparison results of our AutoTool with existing models on perception benchmarks\. All models first generate internal reasoning before producing a final answer\. Visual grounding reasoning models rely on their respective system prompts to trigger tool usage, whereas AutoTool adaptively decides whether to invoke tools using the same system prompt as during training\. Across the majority of splits in both datasets, AutoTool consistently achieves state\-of\-the\-art performance, significantly surpassing both proprietary and open\-source general MLLMs\. Remarkably, AutoTool still maintains a clear advantage even over much larger models such as Qwen2\.5\-VL\-32B and InternVL3\-38B\. Compared with models that also rely on visual grounding–based reasoning, our approach breaks away from a single reasoning paradigm, effectively leveraging the advantages of localized reasoning after accurate grounding, while avoiding redundant or misleading information introduced by unnecessary or incorrect localization\. Compared with the base model Qwen2\.5\-VL\-7B, our training paradigm leads to substantial improvements on perception tasks, achieving 21% and 11% accuracy gains on HRbench and V\* datasets, respectively\.

Grounding benchmarks\.As shown in[Table2](https://arxiv.org/html/2605.19852#S4.T2), AutoTool consistently outperforms models of comparable size across all splits of the four datasets\. This improvement stems from our training design: in the<tool\_on\>mode, trajectories that correctly invoke tools and produce accurate answers are rewarded, which encourages the model to precisely localize the region of interest\. Conversely, trajectories that invoke tools but yield incorrect answers are penalized, reducing the likelihood of erroneous localizations\. In contrast, DeepEyes may still rewards trajectories where tool\-based localization is incorrect but the final answer happens to be correct\. Our introduction of the<tool\_off\>mode mitigates this issue by encouraging reasoning without relying on potentially misleading tool\-based cues\. All models are evaluated using the same prompt, and the detailed prompt specifications are provided in the supplementary material\.

Hallucination and reasoning benchmark\.As illustrated in[Table3](https://arxiv.org/html/2605.19852#S4.T3), our model demonstrates improved performance in reducing hallucinations\. The adaptive tool invocation capability is also effective for hallucination tasks: when determining whether a target object is present in the image, AutoTool carefully inspects similar objects or potential regions in the<tool\_on\>mode\. Consistent with the perception benchmarks, all models first generate internal reasoning before providing a final answer, with DeepEyes and AutoTool leveraging tool invocation\. Our model maintains robust reasoning capabilities and achieves excellent performance across six benchmarks encompassing a diverse range of reasoning tasks\. All models are evaluated under the same prompt setting, where each model first conducts internal reasoning before producing the final answer\.

We further showcase visual examples on the test benchmarks, as detailed in the supplementary material\.

### 4\.4Ablation and Analysis

The influence of each module\.As shown in[Table4](https://arxiv.org/html/2605.19852#S4.T4),Tool ondenotes the reasoning process assisted by the zoom\-in tool, whileTool offrepresents pure text\-based reasoning\.MSPOpenalty\\text\{MSPO\}\_\{\\text\{penalty\}\}refers to the negative rewardRtool=−0\.5R\_\{\\text\{tool\}\}=\-0\.5applied when the model invokes a tool but produces an incorrect answer\.AMBfree\\text\{AMB\}\_\{\\text\{free\}\}indicates that the AMB constraint is removed in the later training stage, allowing the model to freely explore dual reasoning modes\.

In the \#1 setting, Qwen2\.5\-VL\-7B is trained on the training dataset via pure\-text GRPO, which substantially improves performance over the base model\. In \#2, DeepEyes always employs the zoom\-in tool for every query, leading to further improvement compared with \#1 setting\. In \#3, a carefully designed prompt with RL training is adopted under mode\-ratio constraints, guiding the model to fully explore both modes\. Compared with \#2 setting, this flexible reasoning mode mitigates the negative impact of incorrect tool usage\. In the \#4 setting, a penalty is introduced when the model invokes a tool but produces an incorrect answer, enforcing more precise grounding behavior\. In \#5, the mode\-balancing constraint is removed in the later training stage encouraging free exploration and yielding further performance gains\. Finally, the \#6 setting integrates all these advantageous components and achieves the best overall performance\.

Effect of mode\-balancing removal step\.[Table5](https://arxiv.org/html/2605.19852#S4.T5)reports the impact of removing the mode\-balancing constraint at different training steps\. We observe that disabling the AMB constraint from the beginning \(*i\.e\.*, step 0\) leads to a premature dominance of the<tool\_off\>mode, resulting in inferior performance on fine\-grained perception tasks\. As training progresses, the model benefits from maintaining the constraint for a sufficient period, which promotes balanced exploration between the two reasoning modes\. The best overall results are achieved for HRBench\-4K, HRBench\-8K, and V \* benchmarks when the constraint is removed at around 60 iterations\. Further delaying the removal \(*e\.g\.*, step 70\) yields a slight performance decline, likely because the model becomes overly constrained and less adaptive to problem\-specific reasoning strategies in later stages\.

Effect of the coefficientλtoolbase\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}\.We further analyze the sensitivity of the initial tool invocation reward coefficientλtoolbase\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}, including both moderate and extreme settings\. The results are shown in[Table6](https://arxiv.org/html/2605.19852#S4.T6)\. The model achieves stable performance around the default value \(λtoolbase=1\.2\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}=1\.2\) and remains robust within a moderate range \(0\.5∼3\.00\.5\\sim 3\.0\), suggesting that AMB is not sensitive to precise hyperparameter tuning\.

However, extreme values lead to clear performance degradation due to reward imbalance\. Whenλtoolbase\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}is too small, the contribution of the tool reward becomes negligible, weakening supervision on tool\-usage behavior and causing reward hacking\. For example, whenλtoolbase=0\.0\\lambda\_\{\\text\{tool\}\}^\{\\text\{base\}\}=0\.0, the model frequently selects the<tool\_on\>mode while performing pure text reasoning without valid tool calls, exploitingRtoolon=0R\_\{\\text\{tool\}\}^\{\\text\{on\}\}=0while benefiting from a largerλtooloff\\lambda\_\{\\text\{tool\}\}^\{\\text\{off\}\}\. In contrast, excessively large values reduce the relative influence of the task reward and bias the policy toward the foundation model preference, resulting in collapse to the<tool\_off\>mode\. We additionally observe degraded adherence to the required reasoning format under such settings\. Overall, the default coefficient provides a good balance between adaptive tool usage and reasoning quality\.

![Refer to caption](https://arxiv.org/html/2605.19852v1/x4.png)Figure 3:The outer ring shows the proportion of the dual reasoning modes on three datasets, while the inner ring presents their distribution across different splits within each dataset\.Sensitivity to the efficiency penalty\.The efficiency penalty in the mode\-specific reward is introduced to discourage unnecessary tool usage, particularly when tool invocation leads to incorrect answers\. Specifically, for the<tool\_on\>mode, trajectories with incorrect answers after tool invocation receive a negative reward\. To evaluate the sensitivity of this design, we vary the efficiency penalty term within\{0,−0\.2,−0\.5,−0\.8\}\\\{0,\-0\.2,\-0\.5,\-0\.8\\\}while keeping all other hyperparameters unchanged\. The results are shown in[Table7](https://arxiv.org/html/2605.19852#S4.T7)\. Empirically, the performance varies only marginally across different penalty values, indicating that the proposed reward design maintains a robust operating range and does not require careful dataset\-specific tuning\.

Time efficiency and tool mode analysis\.As shown in[Table8](https://arxiv.org/html/2605.19852#S4.T8), we report the training and inference time costs of existing visual grounding–based reasoning models such as DeepEyes and our AutoTool\. Under the same data and number of training iterations, our method reduces the total training time by approximately 9 hours\. The inference time across all three datasets is also significantly shortened, with a 44\.9% reduction observed on the POPE dataset\.

In addition, we analyze the occurrence frequency of the dual reasoning modes across different benchmarks\. As illustrated in[Figure3](https://arxiv.org/html/2605.19852#S4.F3), the proportion of these two modes is not fixed as in the training stage but rather dynamically varies depending on the characteristics of the dataset\. For high\-resolution datasets such as HRbench and V\*, where target objects often occupy a small region of the image, the<tool\_on\>mode appears more frequently\. In contrast, POPE contains relatively smaller images with larger target objects, leading to a notably higher proportion of<tool\_off\>mode during inference\. The ratio of the dual reasoning modes during training is illustrated in[Figure1](https://arxiv.org/html/2605.19852#S1.F1)\(c\)\. In the early and middle stages of training, we adaptively control the reward factor to encourage sufficient exploration of both reasoning modes, resulting in a roughly balanced distribution of about 50% for each\. In the later stage, we remove this constraint to allow the model to freely choose its preferred reasoning strategy, where a slight increase in the proportion of the<tool\_on\>mode can be observed\.

## 5Conclusion

In this work, we reveal that tool\-augmented reasoning is not always beneficial for MLLMs\. To address this limitation, we propose AutoTool, a model that dynamically determines whether to invoke zoom\-in tools based on the characteristics of each query\. This design significantly improves both training and inference efficiency while mitigating the adverse effects of unnecessary or incorrect tool usage\. Based on the reinforcement learning framework, our approach optimizes dual reasoning modes with carefully designed reward functions and guides the model to fully explore both\. Extensive experiments on various benchmarks demonstrate that AutoTool achieves superior reasoning capability and efficiency compared to existing models\.

## Acknowledgements

This work was supported by NSFC Project \(62536005, 62192783, 62506162\), Jiangsu Science and Technology Project \(BG2024031, BK20251241\), Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China \(No\. JYB2025XDXM118\), “111 Center” \(No\. B26023\), and Fundamental Research Funds for the Central Universities \(KG202508\)\.

## Impact Statement

This work studies adaptive tool invocation for multimodal large language models \(MLLMs\)\. By enabling models to selectively determine whether external tool assistance is necessary, our method improves reasoning efficiency while reducing redundant computation and unnecessary tool interactions\. Our approach also reduces hallucinations caused by inappropriate tool usage, potentially improving the reliability of multimodal reasoning systems\. However, the proposed method does not eliminate risks associated with MLLMs, such as incorrect reasoning or failures in complex scenarios\. Careful evaluation and appropriate human oversight remain important for real\-world deployment\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.1.1.1.1)\.
- J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds,et al\.\(2022\)Flamingo: a visual language model for few\-shot learning\.Advances in Neural Information Processing Systems35,pp\. 23716–23736\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- J\. Bai, S\. Bai, S\. Yang, S\. Wang, S\. Tan, P\. Wang, J\. Lin, C\. Zhou, and J\. Zhou \(2023\)Qwen\-vl: a versatile vision\-language model for understanding, localization, text reading, and beyond\.arXiv preprint arXiv:2308\.12966\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang,et al\.\(2025\)Qwen2\. 5\-vl technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[Appendix J](https://arxiv.org/html/2605.19852#A10.p1.1),[Appendix F](https://arxiv.org/html/2605.19852#A6.p1.1),[Appendix H](https://arxiv.org/html/2605.19852#A8.p1.1),[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.3.3.3.1),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.4.4.4.1),[Table 2](https://arxiv.org/html/2605.19852#S4.T2.12.12.15.3.1),[Table 3](https://arxiv.org/html/2605.19852#S4.T3.12.12.15.3.1)\.
- H\. Caesar, J\. Uijlings, and V\. Ferrari \(2018\)Coco\-stuff: thing and stuff classes in context\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 1209–1218\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p3.1)\.
- H\. Chen, H\. Tu, F\. Wang, H\. Liu, X\. Tang, X\. Du, Y\. Zhou, and C\. Xie \(2025\)Sft or rl? an early investigation into training r1\-like reasoning large vision\-language models\.arXiv preprint arXiv:2504\.11468\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2605.19852#S2.SS3.p1.1)\.
- Z\. Chen, J\. Wu, W\. Wang, W\. Su, G\. Chen, S\. Xing, M\. Zhong, Q\. Zhang, X\. Zhu, L\. Lu,et al\.\(2024\)Internvl: scaling up vision foundation models and aligning for generic visual\-linguistic tasks\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 24185–24198\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- M\. Cherti, R\. Beaumont, R\. Wightman, M\. Wortsman, G\. Ilharco, C\. Gordon, C\. Schuhmann, L\. Schmidt, and J\. Jitsev \(2023\)Reproducible scaling laws for contrastive language\-image learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 2818–2829\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- C\. Deng, D\. Zhu, K\. Li, C\. Gou, F\. Li, Z\. Wang, S\. Zhong, W\. Yu, X\. Nie, Z\. Song,et al\.\(2025\)Emerging properties in unified multimodal pretraining\.arXiv preprint arXiv:2505\.14683\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- H\. Diao, Y\. Cui, X\. Li, Y\. Wang, H\. Lu, and X\. Wang \(2024\)Unveiling encoder\-free vision\-language models\.Advances in Neural Information Processing Systems37,pp\. 52545–52567\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- Y\. Duan, T\. Chen, L\. Qi, and Y\. Shi \(2025\)Divide\-and\-conquer for enhancing unlabeled learning, stability, and plasticity in semi\-supervised continual learning\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 583–593\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1)\.
- Z\. Fa, Y\. Duan, J\. Zhang, L\. Qi, and Y\. Shi \(2026\)One token, two fates: a unified framework via vision token manipulation against mllms hallucination\.arXiv preprint arXiv:2603\.10360\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- K\. Fan, K\. Feng, H\. Lyu, D\. Zhou, and X\. Yue \(2025\)SophiaVL\-r1: reinforcing mllms reasoning with thinking reward\.arXiv preprint arXiv:2505\.17018\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- W\. Fan, T\. Rahman, and L\. Sigal \(2024\)MMFactory: a universal solution search engine for vision\-language tasks\.arXiv preprint arXiv:2412\.18072\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p2.1)\.
- R\. Fang, C\. Duan, K\. Wang, L\. Huang, H\. Li, S\. Yan, H\. Tian, X\. Zeng, R\. Zhao, J\. Dai,et al\.\(2025\)Got: unleashing reasoning capability of multimodal large language model for visual generation and editing\.arXiv preprint arXiv:2503\.10639\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[§1](https://arxiv.org/html/2605.19852#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2605.19852#S2.SS3.p1.1)\.
- J\. Hong, C\. Zhao, C\. Zhu, W\. Lu, G\. Xu, and X\. Yu \(2025\)DeepEyesV2: toward agentic multimodal model\.arXiv preprint arXiv:2511\.05271\.Cited by:[Appendix M](https://arxiv.org/html/2605.19852#A13.p1.1)\.
- Y\. Hu, W\. Shi, X\. Fu, D\. Roth, M\. Ostendorf, L\. Zettlemoyer, N\. A\. Smith, and R\. Krishna \(2024\)Visual sketchpad: sketching as a visual chain of thought for multimodal language models\.Advances in Neural Information Processing Systems37,pp\. 139348–139379\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- S\. Kazemzadeh, V\. Ordonez, M\. Matten, and T\. Berg \(2014\)Referitgame: referring to objects in photographs of natural scenes\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,pp\. 787–798\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p3.1)\.
- A\. Kirillov, E\. Mintun, N\. Ravi, H\. Mao, C\. Rolland, L\. Gustafson, T\. Xiao, S\. Whitehead, A\. C\. Berg, W\. Lo,et al\.\(2023\)Segment anything\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 4015–4026\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in Neural Information Processing Systems35,pp\. 22199–22213\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1)\.
- M\. Komeili, K\. Shuster, and J\. Weston \(2021\)Internet\-augmented dialogue generation\.arXiv preprint arXiv:2107\.07566\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p2.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles,pp\. 611–626\.Cited by:[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3)\.
- X\. Lai, J\. Li, W\. Li, T\. Liu, T\. Li, and H\. Zhao \(2025\)Mini\-o3: scaling up reasoning patterns and interaction turns for visual search\.arXiv preprint arXiv:2509\.07969\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- X\. Lai, Z\. Tian, Y\. Chen, Y\. Li, Y\. Yuan, S\. Liu, and J\. Jia \(2024\)Lisa: reasoning segmentation via large language model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9579–9589\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p3.1)\.
- W\. Lei, J\. Wang, H\. Wang, X\. Li, J\. H\. Liew, J\. Feng, and Z\. Huang \(2025\)The scalability of simplicity: empirical analysis of vision\-language learning with a single transformer\.arXiv preprint arXiv:2504\.10462\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu,et al\.\(2024a\)Llava\-onevision: easy visual task transfer\.arXiv preprint arXiv:2408\.03326\.Cited by:[Appendix H](https://arxiv.org/html/2605.19852#A8.p1.1),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.7.7.7.1),[Table 3](https://arxiv.org/html/2605.19852#S4.T3.12.12.17.5.1)\.
- G\. Li, J\. Xu, Y\. Zhao, and Y\. Peng \(2025\)Dyfo: a training\-free dynamic focus visual search for enhancing lmms in fine\-grained visual understanding\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9098–9108\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.9.9.9.1)\.
- J\. Li, D\. Li, S\. Savarese, and S\. Hoi \(2023a\)Blip\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.InInternational Conference on Machine Learning,pp\. 19730–19742\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- J\. Li, D\. Li, C\. Xiong, and S\. Hoi \(2022\)Blip: bootstrapping language\-image pre\-training for unified vision\-language understanding and generation\.InInternational Conference on Machine Learning,pp\. 12888–12900\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- L\. Li, Y\. Wang, R\. Xu, P\. Wang, X\. Feng, L\. Kong, and Q\. Liu \(2024b\)Multimodal arxiv: a dataset for improving scientific comprehension of large vision\-language models\.arXiv preprint arXiv:2403\.00231\.Cited by:[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3)\.
- Y\. Li, Z\. Peng, J\. Zhang, J\. Guo, Y\. Duan, and Y\. Shi \(2026\)When shared knowledge hurts: spectral over\-accumulation in model merging\.arXiv preprint arXiv:2602\.05536\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- Y\. Li, Y\. Du, K\. Zhou, J\. Wang, W\. X\. Zhao, and J\. Wen \(2023b\)Evaluating object hallucination in large vision\-language models\.arXiv preprint arXiv:2305\.10355\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p4.1)\.
- H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee \(2024\)Improved baselines with visual instruction tuning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 26296–26306\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.Advances in Neural Information Processing Systems36,pp\. 34892–34916\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- Y\. Liu, B\. Peng, Z\. Zhong, Z\. Yue, F\. Lu, B\. Yu, and J\. Jia \(2025\)Seg\-zero: reasoning\-chain guided segmentation via cognitive reinforcement\.arXiv preprint arXiv:2503\.06520\.Cited by:[§A\.2](https://arxiv.org/html/2605.19852#A1.SS2.p1.1)\.
- P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao \(2023\)Mathvista: evaluating mathematical reasoning of foundation models in visual contexts\.arXiv preprint arXiv:2310\.02255\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p5.1)\.
- G\. Luo, X\. Yang, W\. Dou, Z\. Wang, J\. Liu, J\. Dai, Y\. Qiao, and X\. Zhu \(2025\)Mono\-internvl: pushing the boundaries of monolithic multimodal large language models with endogenous visual pre\-training\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 24960–24971\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- Q\. Ma, J\. Zhang, Z\. Li, L\. Qi, Q\. Yu, and Y\. Shi \(2025a\)Steady progress beats stagnation: mutual aid of foundation and conventional models in mixed domain semi\-supervised medical image segmentation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 5175–5185\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- Q\. Ma, J\. Zhang, L\. Qi, Q\. Yu, Y\. Shi, and Y\. Gao \(2024\)Constructing and exploring intermediate domains in mixed domain semi\-supervised medical image segmentation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11642–11651\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- Q\. Ma, J\. Zhang, L\. Qi, Q\. Yu, Y\. Shi, and Y\. Gao \(2025b\)Unleashing the power of intermediate domains for mixed domain semi\-supervised medical image segmentation\.IEEE Transactions on Medical Imaging\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- Z\. Ma, J\. Zhang, Z\. Liu, J\. Zhang, J\. Tan, M\. Shu, J\. C\. Niebles, S\. Heinecke, H\. Wang, C\. Xiong,et al\.\(2025c\)TACO: learning multi\-modal models to reason and act with synthetic chains\-of\-thought\-and\-action\.InWorkshop on Reasoning and Planning for Large Language Models,Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- A\. Meta \(2024\)Llama 3\.2: revolutionizing edge ai and vision with open, customizable models\.Meta AI Blog\. Retrieved December20,pp\. 2024\.Cited by:[Appendix F](https://arxiv.org/html/2605.19852#A6.p1.1)\.
- OpenAI \(2025\)Thinking with images\.Note:[https://openai\.com/index/thinking\-with\-images/](https://openai.com/index/thinking-with-images/)Accessed: 2025\-10\-26Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.2.2.2.1)\.
- J\. Qi, M\. Ding, W\. Wang, Y\. Bai, Q\. Lv, W\. Hong, B\. Xu, L\. Hou, J\. Li, Y\. Dong,et al\.\(2024\)Cogcom: train large vision\-language models diving into details through chain of manipulations\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p2.1)\.
- R\. Qiao, Q\. Tan, G\. Dong, M\. Wu, C\. Sun, X\. Song, Z\. GongQue, S\. Lei, Z\. Wei, M\. Zhang,et al\.\(2024\)We\-math: does your large multimodal model achieve human\-like mathematical reasoning?\.arXiv preprint arXiv:2407\.01284\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p5.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational Conference on Machine Learning,pp\. 8748–8763\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- N\. Ravi, V\. Gabeur, Y\. Hu, R\. Hu, C\. Ryali, T\. Ma, H\. Khedr, R\. Rädle, C\. Rolland, L\. Gustafson,et al\.\(2024\)Sam 2: segment anything in images and videos\.arXiv preprint arXiv:2408\.00714\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.Advances in Neural Information Processing Systems36,pp\. 68539–68551\.Cited by:[Appendix C](https://arxiv.org/html/2605.19852#A3.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2605.19852#S2.SS3.p1.1),[§3\.3](https://arxiv.org/html/2605.19852#S3.SS3.p1.1),[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3)\.
- H\. Shen, K\. Zhao, T\. Zhao, R\. Xu, Z\. Zhang, M\. Zhu, and J\. Yin \(2024\)Zoomeye: enhancing multimodal llms with human\-like zooming capabilities through tree\-based image exploration\.arXiv preprint arXiv:2411\.16044\.Cited by:[Table 1](https://arxiv.org/html/2605.19852#S4.T1.10.10.10.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3)\.
- A\. Su, H\. Wang, W\. Ren, F\. Lin, and W\. Chen \(2025a\)Pixel reasoner: incentivizing pixel\-space reasoning with curiosity\-driven reinforcement learning\.arXiv preprint arXiv:2505\.15966\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p2.1),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.11.11.11.1)\.
- Z\. Su, L\. Li, M\. Song, Y\. Hao, Z\. Yang, J\. Zhang, G\. Chen, J\. Gu, J\. Li, X\. Qu,et al\.\(2025b\)Openthinkimg: learning to think with images via visual tool reinforcement learning\.arXiv preprint arXiv:2505\.08617\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p2.1),[§1](https://arxiv.org/html/2605.19852#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- Z\. Su, P\. Xia, H\. Guo, Z\. Liu, Y\. Ma, X\. Qu, J\. Liu, Y\. Li, K\. Zeng, Z\. Yang,et al\.\(2025c\)Thinking with images for multimodal reasoning: foundations, methods, and future frontiers\.arXiv preprint arXiv:2506\.23918\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- O\. Thawakar, D\. Dissanayake, K\. P\. More, R\. Thawkar, A\. Heakl, N\. Ahsan, Y\. Li, I\. Z\. M\. Zumri, J\. Lahoud, R\. M\. Anwer,et al\.\(2025\)Llamav\-o1: rethinking step\-by\-step visual reasoning in llms\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 24290–24315\.Cited by:[Appendix F](https://arxiv.org/html/2605.19852#A6.p1.1)\.
- P\. Tong, E\. Brown, P\. Wu, S\. Woo, A\. J\. V\. IYER, S\. C\. Akula, S\. Yang, J\. Yang, M\. Middepogu, Z\. Wang,et al\.\(2024\)Cambrian\-1: a fully open, vision\-centric exploration of multimodal llms\.Advances in Neural Information Processing Systems37,pp\. 87310–87356\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- H\. Wang, X\. Li, Z\. Huang, A\. Wang, J\. Wang, T\. Zhang, J\. Zheng, S\. Bai, Z\. Kang, J\. Feng,et al\.\(2025a\)Traceable evidence enhanced visual grounded reasoning: evaluation and methodology\.arXiv preprint arXiv:2507\.07999\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.19852#S2.SS3.p1.1)\.
- K\. Wang, J\. Pan, W\. Shi, Z\. Lu, H\. Ren, A\. Zhou, M\. Zhan, and H\. Li \(2024a\)Measuring multimodal mathematical reasoning with math\-vision dataset\.Advances in Neural Information Processing Systems37,pp\. 95095–95169\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p5.1)\.
- P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024b\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- W\. Wang, L\. Ding, M\. Zeng, X\. Zhou, L\. Shen, Y\. Luo, W\. Yu, and D\. Tao \(2025b\)Divide, conquer and combine: a training\-free framework for high\-resolution image perception in multimodal large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 7907–7915\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p2.1)\.
- X\. Wang, J\. Zhang, L\. Qi, and Y\. Shi \(2025c\)Balanced direction from multifarious choices: arithmetic meta\-learning for domain generalization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 30577–30587\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- X\. Wang, Z\. Yang, C\. Feng, H\. Lu, L\. Li, C\. Lin, K\. Lin, F\. Huang, and L\. Wang \(2025d\)Sota with less: mcts\-guided sample selection for data\-efficient visual reasoning self\-improvement\.arXiv preprint arXiv:2504\.07934\.Cited by:[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1)\.
- P\. Wu and S\. Xie \(2024\)V?: guided visual search as a core mechanism in multimodal llms\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13084–13094\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p2.1),[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.8.8.8.1)\.
- Y\. Xiao, E\. Sun, T\. Liu, and W\. Wang \(2024\)Logicvista: multimodal llm logical reasoning benchmark in visual contexts\.arXiv preprint arXiv:2407\.04973\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p5.1)\.
- Y\. Xu, C\. Li, H\. Zhou, X\. Wan, C\. Zhang, A\. Korhonen, and I\. Vulić \(2025\)Visual planning: let’s think only with images\.arXiv preprint arXiv:2505\.11409\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, Z\. Qiu, and … \(2024a\)Qwen2\.5 technical report\.Note:arXiv preprint arXiv:2412\.15115Version 2, submitted 3 Jan 2025Cited by:[Appendix J](https://arxiv.org/html/2605.19852#A10.p1.1),[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3)\.
- L\. Yang, B\. Kang, Z\. Huang, X\. Xu, J\. Feng, and H\. Zhao \(2024b\)Depth anything: unleashing the power of large\-scale unlabeled data\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 10371–10381\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- L\. Yang, B\. Kang, Z\. Huang, Z\. Zhao, X\. Xu, J\. Feng, and H\. Zhao \(2024c\)Depth anything v2\.Advances in Neural Information Processing Systems37,pp\. 21875–21911\.Cited by:[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1)\.
- M\. Yang, Z\. Li, J\. Zhang, L\. Qi, and Y\. Shi \(2025\)Taste more, taste better: diverse data and strong model boost semi\-supervised crowd counting\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 24440–24451\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- H\. Yao, Q\. Yin, J\. Zhang, M\. Yang, Y\. Wang, W\. Wu, F\. Su, L\. Shen, M\. Qiu, D\. Tao,et al\.\(2025\)R1\-sharevl: incentivizing reasoning capability of multimodal large language models via share\-grpo\.arXiv preprint arXiv:2505\.16673\.Cited by:[§2\.1](https://arxiv.org/html/2605.19852#S2.SS1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh International Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2605.19852#A3.p1.1)\.
- R\. Zhang, D\. Jiang, Y\. Zhang, H\. Lin, Z\. Guo, P\. Qiu, A\. Zhou, P\. Lu, K\. Chang, Y\. Qiao,et al\.\(2024\)Mathverse: does your multi\-modal llm truly see the diagrams in visual math problems?\.InEuropean Conference on Computer Vision,pp\. 169–186\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p5.1)\.
- Y\. Zhang, X\. Lu, S\. Yin, C\. Fu, W\. Chen, X\. Hu, B\. Wen, K\. Jiang, C\. Liu, T\. Zhang,et al\.\(2025\)Thyme: think beyond images\.arXiv preprint arXiv:2508\.11630\.Cited by:[§1](https://arxiv.org/html/2605.19852#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.19852#S2.SS3.p1.1)\.
- Z\. Zheng, M\. Yang, J\. Hong, C\. Zhao, G\. Xu, L\. Yang, C\. Shen, and X\. Yu \(2025\)DeepEyes: incentivizing” thinking with images” via reinforcement learning\.arXiv preprint arXiv:2505\.14362\.Cited by:[Appendix H](https://arxiv.org/html/2605.19852#A8.p1.1),[Figure 1](https://arxiv.org/html/2605.19852#S1.F1),[Figure 1](https://arxiv.org/html/2605.19852#S1.F1.3.2),[§1](https://arxiv.org/html/2605.19852#S1.p2.1),[§1](https://arxiv.org/html/2605.19852#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.19852#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.19852#S2.SS3.p1.1),[§4\.2](https://arxiv.org/html/2605.19852#S4.SS2.p1.3),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.12.12.12.1),[Table 2](https://arxiv.org/html/2605.19852#S4.T2.12.12.16.4.1),[Table 3](https://arxiv.org/html/2605.19852#S4.T3.12.12.18.6.1)\.
- J\. Zhu, W\. Wang, Z\. Chen, Z\. Liu, S\. Ye, L\. Gu, H\. Tian, Y\. Duan, W\. Su, J\. Shao,et al\.\(2025\)Internvl3: exploring advanced training and test\-time recipes for open\-source multimodal models\.arXiv preprint arXiv:2504\.10479\.Cited by:[Appendix H](https://arxiv.org/html/2605.19852#A8.p1.1),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.5.5.5.1),[Table 1](https://arxiv.org/html/2605.19852#S4.T1.6.6.6.1),[Table 3](https://arxiv.org/html/2605.19852#S4.T3.12.12.16.4.1)\.
- C\. Zou, X\. Guo, R\. Yang, J\. Zhang, B\. Hu, and H\. Zhang \(2024\)Dynamath: a dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models\.arXiv preprint arXiv:2411\.00836\.Cited by:[§4\.1](https://arxiv.org/html/2605.19852#S4.SS1.p5.1)\.

## Appendix APrompt Design for Training and Evaluation

To ensure consistent interaction between the model and the environment, we carefully design the system and user prompts for both the training and inference stages\. The prompts aim to explicitly distinguish between the dual reasoning modes, guiding the model to generate reasoning trajectories accordingly\.

### A\.1Training Prompts

During training, the system prompt defines the model’s role and available reasoning modes\. The user prompt provides multimodal inputs \(text and images\) and instructs the model to reason under the specified mode\.

System Prompt:

You are a helpful assistant\.At the beginning of your first response, you must output either <tool\_on\>or <tool\_off\>to indicate whether tools will be used to assist with subsequent answers\.\- <tool\_on\>means that you may call tools to help answer the query\.\- <tool\_off\>means that you will answer entirely without tool usage\.\# When to choose <tool\_on\>Use <tool\_on\>if the question requires close inspection or verification of fine details in an image, such as:\- identifying a specific object among multiple objects,\- checking small or unclear regions, sub\-tables, or fine textures,\- verifying visual details that may affect the correctness of the answer\.In these cases, call the zoom\-in tool as needed to focus on the relevant region\.\# When to choose <tool\_off\>Use <tool\_off\>if:\- the question needs global or overall image understanding \(scene, layout, general context\), or the relevant region or object is already clear enough without zooming in,\- zooming in would not provide useful additional information\.\# Tool calling formatYou may call one or more functions to assist with the user query\.You are provided with function signatures within <tools\></tools\>XML tags:<tools\>\{“type”: “function”, “function”: \{“name”: “image\_zoom\_in\_tool”, “description”: “Zoom in on a specific region of an image by cropping it based on a bounding box \(bbox\) and an optional object label\.”, “parameters”: \{“type”: “object”, “properties”: \{“bbox\_2d”: \{“type”: “array”, “items”: \{“type”: “number”\}, “minItems”: 4, “maxItems”: 4, “description”: “The bounding box of the region to zoom in, as \[x1, y1, x2, y2\], where \(x1, y1\) is the top\-left corner and \(x2, y2\) is the bottom\-right corner\.”\}, “label”: \{“type”: “string”, “description”: “The name or label of the object in the specified bounding box \(optional\)\.”\}\}, “required”: \[“bbo\_2d”\]\}\}\}</tools\>\# How to call a toolReturn a json object with function name and arguments within <tool\_call\></tool\_call\>XML tags:<tool\_call\>\{“name”: <function\-name\>, “arguments”: <args\-json\-object\>\}</tool\_call\>\*\*Example\*\*:<tool\_call\>\{“name”: “image\_zoom\_in\_tool”, “arguments”: \{“bbox\_2d”: \[10, 20, 100, 200\], “label”: “the apple on the desk”\}\}<tool\_call\>

User Prompt:

Question:\{question\}Please follow these instructions strictly:1\. First, determine whether you will use a tool by outputting <tool\_on\>or <tool\_off\>\.2\. Then, show your reasoning inside <think\>… </think\>\.3\. If tool usage is required \(<tool\_on\>\), call the image\_zoom\_in\_tool using <tool\_call\>…</tool\_call\>, and DO NOT provide an <answer\>yet — wait for the zoomed image in the next round\.4\. If no tool is needed \(<tool\_off\>\), provide your final answer inside <answer\>… </answer\>\.Format strictly as:<tool\_on\> <think\> … </think\> <tool\_call\> … </tool\_call\> OR <tool\_off\> <think\> … </think\> <answer\> … </answer\>

### A\.2Evaluation Prompts

For the Perception and Hallucination benchmarks, we use the same prompt as in the training phase to evaluate the model’s ability in adaptive tool invocation\. For the Reasoning datasets, we adopt the official prompts provided by each benchmark\. For the Grounding benchmark, following Seg\-Zero\(Liuet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib85)\), we employ the user prompt template as:

System Prompt:

You are a helpful assistant\.

User Prompt:

Please find “\{Question\}” with bboxs\.Compare the difference between object\(s\) and find the most closely matched object\(s\)\.Output the thinking process in <think\></think\>and final answer in <answer\></answer\>tags\.Output the bbox\(es\) inside the interested object\(s\) in JSON format\.i\.e\. <think\>thinking process here </think\> <answer\>\{\{“bbox\_2d”: \[10,100,200,210\] \} \}</answer\>

## Appendix BReward Function Details

Accuracy rewardRaccR\_\{\\text\{acc\}\}:We evaluate whether the predicted answer is semantically equivalent to the ground truth\. The reward takes values in\{0,0\.8\}\\\{0,0\.8\\\}, where0corresponds to an incorrect answer and0\.80\.8to a correct one\. Evaluation is performed using a combination of rule\-based metrics and an online reward model \(*e\.g\.*Qwen2\.5\-72B\-Instruct\)\. Specifically, we first perform an exact string matching between the model output and the ground\-truth answer\. If the two strings are identical, the prediction is directly regarded as correct\. Otherwise, we further evaluate semantic equivalence via an online reward model\. The reward model is prompted to judge whether the predicted answer conveys the same meaning as the ground truth under a fixed system prompt, as detailed below\.

User Prompt:

Below are two answers to a question\. Question is \[Question\], \[Standard Answer\] is the standard answer to the question, and \[Model\_answer\] is the answer extracted from a model’s output to this question\. Determine whether these two answers are consistent\.Note that \[Model Answer\] is consistent with \[Standard Answer\] whenever they are essentially the same\. If the meaning is expressed in the same way, it is considered consistent, for example, ’pink’ and ’it is pink’\.If they are consistent, Judement is 1; if they are different, Judement is 0\. Just output Judement and don’t output anything else\.\[Question\]: Is the countertop tan or blue?\[Standard Answer\]: The countertop is tan\.\[Model\_answer\] : tanJudgement: 1\[Question\]: On which side of the picture is the barrier?\[Standard Answer\]: The barrier is on the left side of the picture\.\[Model\_answer\] : leftJudgement: 1\[Question\]: Is the kite brown and large?\[Standard Answer\]: Yes, the kite is brown and large\.\[Model\_answer\] : YesJudgement: 1\[Question\]: Are the spots on a giraffe?\[Standard Answer\]: No, the spots are on a banana\.\[Model\_answer\] : noJudgement: 1\[Question\]: Who is wearing pants?\[Standard Answer\]: The boy is wearing pants\.\[Model\_answer\] : The person in the picture is wearing pants\.Judgement: 1\[Question\]: Is the man phone both blue and closed?\[Standard Answer\]: Yes, the man phone is both blue and closed\.\[Model\_answer\] : No\.Judgement: 0\[Question\]: What color is the towel in the center of the picture?\[Standard Answer\]: The towel in the center of the picture is blue\.\[Model\_answer\] : The towel in the center of the picture is pink\.Judgement: 0\[Question\]: \{question\}\[Standard Answer\]: \{ground\_truth\}\[Model\_answer\] : \{predict\_str\}Judgement:

As shown above, the prompt provides seven illustrative examples covering both consistent and inconsistent cases\. The target question is placed at the end\. The reward model is instructed to output a binary judgment, whereJudgement = 1indicates semantic consistency between the prediction and the standard answer, andJudgement = 0otherwise\. This hybrid evaluation strategy combines strict rule\-based verification with flexible semantic evaluation, enabling reliable supervision for both factual and open\-ended responses\.

Format rewardRformatR\_\{\\text\{format\}\}:This reward ensures that the reasoning process and final answer adhere to the prescribed output format, i\.e\., enclosed within<think\></think\>and<answer\></answer\>tags\. The reward takes values in\{−0\.2,0\}\\\{\-0\.2,0\\\}, where−0\.2\-0\.2indicates a format violation and0indicates correct formatting\.

Mode\-specific tool rewardRtoolR\_\{\\text\{tool\}\}:The computation ofRtoolR\_\{\\text\{tool\}\}follows the procedure described in[Section3\.4](https://arxiv.org/html/2605.19852#S3.SS4)\. The reward is further modulated byλtoolmode\\lambda\_\{\\text\{tool\}\}^\{\\text\{mode\}\}as defined in[Equation5](https://arxiv.org/html/2605.19852#S3.E5)\.

## Appendix CComparison with Prior Tool\-Use Methods in LLMs

Prior work on adaptive tool use in LLMs, such as ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.19852#bib.bib89)\)and Toolformer\(Schicket al\.,[2023](https://arxiv.org/html/2605.19852#bib.bib90)\), primarily relies on prompt structures or local training signals to guide tool invocation\. ReAct interleaves reasoning and actions through carefully designed prompts, enabling the model to decide whether and how to call tools during generation\. Toolformer introduces a self\-supervised objective that retains tool calls based on changes in prediction cross\-entropy with and without tool usage\.

While effective, these approaches determine tool usage based on local or proxy signals, rather than directly assessing whether tool invocation is necessary for producing a correct final answer\. In contrast, our method rolls out complete reasoning trajectories under different reasoning modes \(*e\.g\.*<tool\_on\>and<tool\_off\>\) and rewards trajectories based on answer correctness\. This outcome\-driven formulation allows AutoTool to learn when tool invocation is genuinely beneficial, without relying on intermediate heuristics\.

## Appendix DDetailed Training Information

To gain a deeper understanding of how AutoTool learns to balance and adapt its reasoning behaviors, we further analyze training dynamics, focusing on the distribution of reasoning modes, the trends in tool invocation frequency and response length as shown in[Figure4](https://arxiv.org/html/2605.19852#A4.F4)\.

![Refer to caption](https://arxiv.org/html/2605.19852v1/x5.png)Figure 4:Detailed training\-phase analysis\. \(a\) Dual reasoning trajectories, general reasoning data, and overall average reward curves\. \(b\) Average number of tool invocations under the<tool\_on\>mode\. \(c\) Response length variations throughout training\. The shaded regions denote the standard deviation across multiple runs\.For each batch of training data, we add tool\-invocation prompts to the samples from the V\* \(where the V\* here is distinct from the V\* used in evaluation benchmarks\) and ArxivQA datasets, while the samples from the ThinkLite\-VL dataset adopt purely text\-centric reasoning and answer generation to preserve general reasoning capability\.[Figure4](https://arxiv.org/html/2605.19852#A4.F4)\(a\) presents the average reward curves for three reasoning types, along with the overall average reward for all samples\. All rewards show a steady upward trend, demonstrating the effectiveness of our training strategy\.[Figure4](https://arxiv.org/html/2605.19852#A4.F4)\(b\) illustrates the average number of tool invocations in<tool\_on\>reasoning trajectories during training\. The model quickly learns the correct invocation format in the early stage, and the average number of tool calls gradually stabilizes just above one per query, reflecting a more deliberate and efficient tool\-usage behavior\.[Figure4](https://arxiv.org/html/2605.19852#A4.F4)\(c\) shows the curve of the average number of generated tokens, which gradually decreases and stabilizes around 150\. Combined with the increasing reward trend, this indicates that our method enables the model to produce more accurate answers with lower reasoning cost\.

## Appendix EAccuracy Analysis under Different Reasoning Modes

We report the accuracy under different reasoning modes across three benchmarks in Figure[5](https://arxiv.org/html/2605.19852#A5.F5)\. As shown, the error rate of the<tool\_off\>mode is consistently lower than that of the tool\-on mode on all benchmarks \(10\.8% vs\. 13\.8%, 1\.6% vs\. 8\.4%, and 4\.4% vs\. 6\.7%, respectively\)\.

This observation is expected, as AutoTool tends to select the<tool\_off\>mode for relatively simple queries that can be reliably solved using the model’s internal knowledge alone\. In such cases, invoking external tools may introduce unnecessary operations or error propagation, leading to higher failure rates\. In contrast, the<tool\_on\>mode is predominantly activated for more complex or visually challenging questions, where the overall task difficulty is inherently higher\.

![Refer to caption](https://arxiv.org/html/2605.19852v1/x6.png)Figure 5:Accuracy comparison between both reasoning modes across three benchmarks\.
## Appendix FComparison with other Baselines

We further compare AutoTool with prompt\-based baselines on Qwen2\.5\-VL\-7B\(Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25)\), where reasoning modes are controlled solely through prompt design\. We additionally include LlamaV\-o1\(Thawakaret al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib87)\)for comparison, which is built upon Llama\-3\.2\-11B\-Vision\-Instruct\(Meta,[2024](https://arxiv.org/html/2605.19852#bib.bib88)\)and trained via SFT to follow a fixed reasoning pattern, generating detailed reasoning steps and final answers after producing summaries and descriptions\. As shown in[Table9](https://arxiv.org/html/2605.19852#A6.T9), although prompt engineering can partially affect tool usage, it lacks reliable and stable control\. In contrast, our RL\-based method \(MSPO\+AMB\) consistently outperforms prompt\-only baselines\. Notably, AutoTool outperforms LlamaV\-o1 even with a smaller model size, highlighting the efficiency and adaptability of RL\-based strategy\.

Table 9:Comparison with other Baselines\.
## Appendix GResults on Other Base Models

To evaluate the robustness of our method across different base models, we conduct additional experiments using Qwen2\.5\-VL\-3B as the foundation model\. Training is performed on four H200 GPUs, while an additional two H200 GPUs are used to deploy the reward model\.[Table10](https://arxiv.org/html/2605.19852#A7.T10)reports results on a diverse set of benchmarks\. Across all benchmarks,AutoTool3B\\text\{AutoTool\}\_\{\\text\{3B\}\}consistently outperforms the corresponding base model\. These results indicate that the proposed method is not tied to a specific model scale and generalizes well to other base models\.

Table 10:Results on Other Base ModelsExpHRbench\-4KHRbench\-8KV\*POPEFSPFCPFSPFCPAttributeSpatialAdversarialPopularRandomBase Model\-3B87\.860\.382\.558\.587\.081\.680\.180\.380\.8AutoTool\-3B92\.561\.388\.060\.091\.388\.286\.188\.492\.3

ExprefCOCOrefCOCOgrefCOCO\+testtestAtestBvaltestvaltestAtestBvalBase Model\-3B82\.046\.077\.342\.852\.256\.673\.173\.574\.0AutoTool\-3B86\.554\.081\.849\.379\.161\.885\.687\.491\.77

ExpReasonSegMathVistaMathVerseMathVisionWeMathDynaMathLogicVistatestvaltesttestminiBase Model\-3B28\.434\.056\.533\.211\.714\.123\.147\.040\.6AutoTool\-3B41\.952\.062\.536\.012\.817\.433\.550\.141\.7

## Appendix HAdditional Comparison on Hallucination Benchmarks

HallusionBench is a benchmark that evaluates both visual illusion and knowledge hallucination in MLLMs\. We conduct experiments on its image split to complement POPE, which mainly focuses on object existence hallucination\. For all baseline models, including Qwen2\.5\-VL\-7B\(Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25)\), InternVL3\-8B\(Zhuet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib61)\), LLaVA\-OneVision\-7B\(Liet al\.,[2024a](https://arxiv.org/html/2605.19852#bib.bib60)\), and DeepEyes\-7B\(Zhenget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib33)\), we use the same prompting strategy as in the POPE experiments\. As shown in[Table11](https://arxiv.org/html/2605.19852#A8.T11), AutoTool achieves the best performance on HallusionBench, demonstrating that our method generalizes beyond POPE and effectively reduces both visual and knowledge hallucinations\.

Table 11:Comparison of different models on HallusionBench\.
## Appendix IFurther Ablation Studies

To further analyze the design choices of our method, as shown in[Table12](https://arxiv.org/html/2605.19852#A9.T12), we conducted additional ablation experiments\.AutoToolSFT\\text\{AutoTool\}\_\{\\text\{SFT\}\}performs an additional SFT stage before GRPO using a small amount of data that matches the dual reasoning mode\. Although the model learns both reasoning types, this rigid and forced training procedure disrupts the model’s inherent knowledge, leading to a significant performance drop\.

AMBlinear\\text\{AMB\}\_\{\\text\{linear\}\}linearly decreases the influence ofFo​nF\_\{on\}onλtoolmode\\lambda\_\{\\text\{tool\}\}^\{\\text\{mode\}\}during training, followingλtoolmode=λtool±ttmax​\(0\.5−Fon\),\\lambda\_\{\\text\{tool\}\}^\{\\text\{mode\}\}=\\lambda\_\{\\text\{tool\}\}\\pm\\frac\{t\}\{t\_\{\\text\{max\}\}\}\(0\.5\-F\_\{\\text\{on\}\}\),wherettdenotes the current training step andtmaxt\_\{\\text\{max\}\}represents the total number of training steps\. This schemes still impose a residual constraint throughout training, merely reducing its strength over time without granting the model full freedom\.AutoToolw/o AMB\\text\{AutoTool\}\_\{\\text\{w/o AMB\}\}conduct an ablation without the AMB module, leaving the rollout proportions of the two reasoning modes uncontrolled\. Due to the inherent reasoning bias of the foundation model, the policy strongly favors<tool\_off\>, converging to pure text\-based reasoning\. Results show a clear performance drop compared to AutoTool, highlighting the importance of balanced mode constraint\. In contrast, our method focuses on balanced exploration of dual reasoning modes during the early and middle stages of training, and completely removes the constraint in the later stage, allowing the model to freely explore and consolidate its preferred reasoning strategy\.

Table 12:Further Ablation experiments\.To study whether the first\-step tool decision is overly restrictive, we also evaluate a variant that delays the generation of<tool\_on\>or<tool\_off\>until after an explicit thinking phase\. Specifically, the model follows<think\> … </think\> <tool\_on\> <tool\_call\> … </tool\_call\>or<think\> … </think\> <tool\_off\> <answer\> … </answer\>\. The results are reported asAutoTooldelay\\text\{AutoTool\}\_\{\\text\{delay\}\}\. We observe no significant performance improvement from delaying the decision token\. This suggests that deciding whether to invoke a tool can be reliably determined from the image–question pair alone, without requiring extended intermediate reasoning\. Moreover, the delayed design complicates inference\-time control, as enforcing a specific reasoning mode requires multi\-stage decoding\. Overall, the first\-step decision provides a simpler and more practical design without sacrificing performance\. We further study the training strategy without explicit<tool\_on\>/<tool\_off\>tokens, where the model follows<think\> … </think\> <tool\_call\> … </tool\_call\>or<think\> … </think\> <answer\> … </answer\>\. The corresponding results are reported asAutoToolnotoken\\text\{AutoTool\}\_\{\\text\{notoken\}\}\. Similar toAutoTooldelay\\text\{AutoTool\}\_\{\\text\{delay\}\}, the decision of whether to invoke a tool is made after the thinking phase\. Although the overall performance is comparable to that of our AutoTool, this design remains less flexible at test time as it does not allow direct control over the reasoning mode\.

In our method, we do not include a KL regularization term, allowing the model to freely explore and converge faster\. For comparison,AutoToolw/ KL\\text\{AutoTool\}\_\{\\text\{w/ KL\}\}reports results with a KL coefficient of 0\.01\. Introducing KL restricts the model’s exploration, making it harder to learn more optimal reasoning strategies\.

## Appendix JEffect of Reward Model Scale

In our main experiments, we adopt Qwen2\.5\-72B\-Instruct\(Yanget al\.,[2024a](https://arxiv.org/html/2605.19852#bib.bib43)\)as the reward model\. To study the impact of reward model capacity, we conduct an ablation using smaller models from the same Qwen2\.5\-Instruct family, including 32B, 14B, and 7B\. The quantitative results are summarized in[Table13](https://arxiv.org/html/2605.19852#A10.T13)\. As shown in the results, larger reward models consistently lead to better downstream performance\. We attribute this improvement to their stronger ability to provide more accurate feedback for open\-ended responses, which is particularly important in reinforcement learning with verifiable rewards\. In contrast, smaller reward models tend to produce noisier or less discriminative reward signals, making it harder for the policy to distinguish between subtly different reasoning outcomes\. Despite these differences, our method consistently improves performance compared with base model \(*i\.e\.*Qwen2\.5\-VL\-7B\(Baiet al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib25)\)\) across all reward model scales, demonstrating its robustness to the choice of reward model\.

Table 13:Effect of Reward Model Scale\.
## Appendix KMode\-forced Evaluation

In addition to adaptive tool invocation, our model also allows manually constraining its reasoning behavior by inserting special tokens or prompt instructions that enforce a specific reasoning mode\. As shown in Table[14](https://arxiv.org/html/2605.19852#A11.T14), both the fully tool\-assisted \(<tool\_on\>\) and tool\-free \(<tool\_off\>\) variants achieve competitive performance, demonstrating that each reasoning mode is well trained under our training strategy\. The fully tool\-assisted mode achieves slightly higher accuracy on certain splits but incurs additional inference overhead\. By contrast, the adaptive mode selection of AutoTool achieves the best overall performance by dynamically choosing the most suitable reasoning strategy according to the characteristics of each query\. We further explore the effect of forcing AutoTool to use a reasoning mode opposite to its preferred choice at test time\. Specifically, we enforce<tool\_on\>for samples where AutoTool originally predicts<tool\_off\>, and vice versa\. The corresponding results are reported asAutoToolreverse\\text\{AutoTool\}\_\{\\text\{reverse\}\}\. The majority of samples are forced into unsuitable reasoning modes, leading to the worst overall performance among all variants\. This observation further highlights the importance of selecting an appropriate reasoning mode for each instance\.

Table 14:Mode\-forced evaluation results\.
## Appendix LTest Performance over Training Progress

To illustrate the evolution of downstream performance during RL training, we report test accuracy on three representative benchmarks \(HRBench\-4K, HRBench\-8K and V\*\) measured every 10 training steps over the full 80\-step schedule\. Figure[6](https://arxiv.org/html/2605.19852#A12.F6)plots the performance curves for each benchmark\. Each curve reports the performance of Overall, FSP, and FCP \(or Attribute/Spatial for V\*\) as the model progresses from 10 to 80 training steps\. The results demonstrate that AutoTool steadily improves across all benchmarks throughout training\.

![Refer to caption](https://arxiv.org/html/2605.19852v1/x7.png)Figure 6:Test accuracy across different training steps on HRBench\-4K, HRBench\-8K, and V\*\.
## Appendix MExtension to the Multi\-tool Setting

Although the main experiments focus on a zoom\-in tool for clarity and controlled analysis, our method is not restricted to a single tool type\. To evaluate its generality, we further study our method in a multi\-tool setting based on DeepEyesV2\(Honget al\.,[2025](https://arxiv.org/html/2605.19852#bib.bib86)\), a recently proposed framework that supports interleaved invocation of heterogeneous tools, including programmatic code execution and web retrieval, going well beyond simple image cropping\.

In DeepEyesV2, multi\-tool capabilities are primarily elicited through curated SFT data, which encourages the model to invoke appropriate tools during reasoning\. During the subsequent RL stage, only accuracy and format rewards are applied, deliberately avoiding explicit tool\-use rewards\. While this design partially alleviates excessive tool invocation, the heavy reliance on tool\-centric SFT data still biases the model toward frequent tool usage\. As a result, the model exhibits a strong preference for invoking tools, and its tool\-free reasoning capability remains under\-optimized\.

Table 15:Performance in Multi\-tool Setting\.ExpSizeTrainingHRbench\-4KHRbench\-8KV\*FSPFCPInferenceFSPFCPInferenceAttributeSpatialInferenceDeepEyesV27B50\.3 h90\.562\.055\.75 min8760\.863\.12 min86\.182\.92\.62 minDeepEyesV2AMB\+MSPO\\text\{DeepEyesV2\}\_\{\\text\{AMB\+MSPO\}\}7B40\.4 h92\.362\.837\.52 min88\.861\.542\.25 min88\.784\.21\.82 min

![Refer to caption](https://arxiv.org/html/2605.19852v1/x8.png)Figure 7:The outer ring shows the proportion of the dual reasoning modes on two datasets, while the inner ring presents their distribution across different splits within each dataset\. The left two plots correspond to DeepEyesV2, and the right two plots correspond to DeepEyesV2 integrated with AMB and MSPO\.This SFT\-then\-RL training paradigm shares a closely aligned objective with our AMB: both aim to first establish tool\-use competence and subsequently enable more flexible exploration\. However, AMB explicitly balances the relative importance of tool\-based and tool\-free reasoning modes, preventing premature collapse into tool\-dominant behaviors\. We integrate AMB and MSPO into the DeepEyesV2 training pipeline, demonstrating that our method is plug\-and\-play with existing reinforcement learning with verifiable rewards \(RLVR\) algorithms\. Quantitative results are reported in[Table15](https://arxiv.org/html/2605.19852#A13.T15), showing that our approach achieves higher overall task performance while significantly reducing both training and inference overhead\.

[Figure7](https://arxiv.org/html/2605.19852#A13.F7)further illustrates the tool invocation ratios across different benchmarks for DeepEyesV2 with and without AMB\+MSPO\. The results indicate that our method generalizes well to heterogeneous multi\-tool settings and effectively mitigates tool over\-reliance beyond the single zoom\-in tool studied in the main paper\.

## Appendix NMore Cases

We provide several representative question–answer examples generated by our AutoTool, covering various task types including perception, hallucination, grounding, and reasoning, as illustrated in[Figure8](https://arxiv.org/html/2605.19852#A15.F8),[Figure9](https://arxiv.org/html/2605.19852#A15.F9), and[Figure10](https://arxiv.org/html/2605.19852#A15.F10)\. These examples provide qualitative evidence of the model’s capability across different dimensions\.

We also present several failure cases in Fig\.[11](https://arxiv.org/html/2605.19852#A15.F11)\. In the first example, which involves counting the number of computers, the model incorrectly assesses the task as simple during mode selection and therefore chooses not to invoke the tool, failing to detect the second computer\. In the second example, the model selected the correct reasoning mode but localized the wrong region, resulting in an incorrect answer\. The third example is more deceptive\. At a glance, the image appears to contain three dogs, making the question seem straightforward\. However, careful inspection reveals an additional small white dog located between a black and a yellow dog\. In this case, fine\-grained visual inspection is required for accurate counting\. These failure cases highlight the challenges of reliable reasoning mode selection and precise visual localization\.

## Appendix OLimitations and Future Work

Our method explicitly controls whether the model invokes tools in subsequent reasoning by predicting special tokens, and has been effectively validated in both single\-tool and multi\-tool settings \(Section[M](https://arxiv.org/html/2605.19852#A13)\)\. However, in more complex scenarios involving sequential tool calls, both redundant and insufficient tool usage may lead to incorrect final answers\. Accurately identifying ineffective steps within a tool\-call chain therefore becomes a key challenge\. A natural direction for future work is to extend the reward formulation to account for tool\-chain quality, enabling more fine\-grained supervision over the contribution of each tool invocation\. Possible strategies include measuring the marginal utility of individual tool calls, or incorporating trajectory\-wise evaluation rewards that assess the overall efficiency and coherence of the tool\-use sequence\. Exploring such reward designs in a stable and scalable manner remains an open problem\.

![Refer to caption](https://arxiv.org/html/2605.19852v1/x9.png)Figure 8:Qualitative examples of perception benchmark generated by AutoTool\.![Refer to caption](https://arxiv.org/html/2605.19852v1/x10.png)Figure 9:Qualitative examples of hallucination benchmark generated by AutoTool\.![Refer to caption](https://arxiv.org/html/2605.19852v1/x11.png)Figure 10:Qualitative examples of grounding and reasoning benchmark generated by AutoTool\.![Refer to caption](https://arxiv.org/html/2605.19852v1/x12.png)Figure 11:Failure cases\. Orange boxes denote the ground\-truth regions of interest that the model should attend to, while the red boxes show the regions actually selected for zoom\-in\.

Similar Articles

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Hugging Face Daily Papers

This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

arXiv cs.AI

This paper introduces a model-adaptive definition of tool necessity for LLMs, revealing a substantial mismatch between when a model should use a tool and when it actually does. The authors decompose tool use into cognition and action stages, finding that the majority of errors occur in translating recognition into action, identifying a 'knowing-doing gap' in LLM tool use.

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

arXiv cs.CL

This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.

@omarsar0: Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recog…

X AI KOLs Following

This paper introduces a model-adaptive definition of tool necessity and finds a 26-54% mismatch between LLMs' internal recognition that a tool is needed and their actual tool-call actions, concentrated in the cognition-to-action transition. It reveals a 'knowing-doing gap' where the model often knows it should call a tool but fails to do so due to late-layer geometry rotating the signal nearly orthogonal to the action.