Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

arXiv cs.CL 06/10/26, 04:00 AM Papers
physical-tool-use embodied-ai multimodal benchmark perception planning mllm
Summary
This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.
arXiv:2606.10803v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:12 AM
# Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use
Source: [https://arxiv.org/html/2606.10803](https://arxiv.org/html/2606.10803)
Zhixin Ma1\\equalYutong Zhou2\\equalYongqi Li2\\advisorChong\-Wah Ngo1Wenjie Li2 1Singapore Management University2The Hong Kong Polytechnic University \{zhixinma97, yutongzhou714, liyongqi0\}@gmail\.com

###### Abstract

Multimodal Large Language Models \(MLLMs\) excel at utilizing digital APIs and increasingly serve as the “brain” of embodied AI, instructing robots to interact with the physical world\. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs’ ability to assist humans in real\-world tasks\. Despite the importance, MLLMs’ proficiency in physical tool use remains largely unexplored\. To address this gap, we introducePhysTool\-Bench, the first physical tool\-use benchmark designed to evaluate MLLMs’ ability to comprehend real\-world scenarios, identify physical tools, and plan their use\.PhysTool\-Benchcomprises 2,510 queries over 2,678 real\-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare\. Concretely, models are evaluated along two primary dimensions: 1\) recognizing all physical tools present in the scene, and 2\) planning the tool selection and use sequence based on the instruction and visual context\. Across 13 leading MLLMs, even the strongest model \(Gemini\-3\.1\-Pro\) identifies only 58\.7% of tools in a scene and completes merely 21\.0% of queries end\-to\-end\. Our analysis reveals a two\-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.10803v1/x1.png)Figure 1:The capability divide between digital and physical tool use\. MLLMs solve structured digital tasks reliably via APIs \(left\), but struggle with the visual reasoning and physical commonsense required to select and sequence tools in real\-world scenes \(right\)\.PhysTool\-Benchevaluates exactly this physical\-world capability\. Target tools are highlighted only for illustration\.The ability to use tools has long been a core capability of intelligence, and Large Language Models \(LLMs\) have recently shown remarkable progress along this dimension\. State\-of\-the\-art LLMs now function effectively as autonomous digital agents, using software APIs to book flights, query databases, and navigate the web\[yun23apibench,qin24toolllm\]\. However, these successes are confined to the digital world with APIs\. As an essential step toward deploying AI to assist human society, the capability of these models to follow instructions and utilize tools in the physical world must also be rigorously assessed\.

Multimodal LLMs \(MLLMs\) are increasingly regarded as the reasoning core of embodied AI\[li23manipllm\]\. By integrating visual perception with language comprehension, MLLMs empower embodied agents to ground high\-level instructions, such as “bring me the red mug on the kitchen counter”, into actions that robots can execute\. Recent systems have shown strong performance on indoor navigation\[danny23palme\]and object manipulation\[liu24robomamba\], and the benchmarks driving this progress have largely focused on the same two capabilities\[xiang20sapien,mu21maniskill\]\. Yet tool use in the physical world, arguably the next frontier for embodied AI, has received far less attention\. Specifically, how well current MLLMs can recognize, comprehend, and utilize physical tools remains an open question\.

To answer this question, we introducePhysTool\-Bench, a benchmark dedicated to evaluating physical tool use\.PhysTool\-Benchcontains 2,510 queries over 2,678 distinct physical tools, drawn from manufacturing, electrical work, agriculture, healthcare, and beyond\. Each query pairs a natural\-language instruction with an image of a realistic environment, such as a workshop or kitchen, where the model must identify the appropriate tools for the task\. As illustrated in Figure[1](https://arxiv.org/html/2606.10803#S1.F1), given an instruction such as ”prepare a wooden shelf…”, the model must select the correct tools \(a handsaw, plane, and vacuum cleaner\) in the right order, while rejecting visually or functionally similar alternatives\. We evaluate MLLMs on two progressive tasks: Task I \(Physical Tool Recognition\) asks the model to enumerate every tool visible in the scene; Task II \(Tool Selection and Action Planning\) further requires it to select the necessary tools and place them in the correct execution order, given the instruction\. Together, the two tasks disentangle what the model can see from what it can reason about\.

PhysTool\-Benchmirrors the visual and conceptual complexity of real\-world environments\. Each scene contains on average 8\.6 tools, of which only 3\.1 are required by the instruction; the remaining items are everyday tools that may be visually or functionally related to the targets\. 86\.9% of queries further require multiple tools to be applied in a specific order, jointly evaluating selection and sequential planning\. To capture both axes of difficulty, we report set\-level F1 for tool selection and strict Exact Match \(EM\), which requires the predicted tools to match the ground\-truth set*and*execution order\. The full dataset is curated through a multi\-stage quality\-control pipeline \(§[3\.2](https://arxiv.org/html/2606.10803#S3.SS2)\), and a human reference study confirms its quality: on queries rated highly familiar by an annotator, human EM reaches75%, indicating that the ground truth aligns with informed human judgment\.

We benchmark 13 leading MLLMs onPhysTool\-Bench, spanning commercial models \(GPT\-4o, GPT\-5\.2, Gemini\-3\.1\-Pro, Qwen3\-VL\-Plus\) and open\-source models \(Qwen3\-VL, InternVL, Kimi\-VL, DeepSeek\-VL, and others\)\. Four findings stand out\. \(i\)Recognition is non\-trivial\.Even the strongest model identifies only 58\.7% F1 of the tools in a scene; most open\-source models miss more than half\. \(ii\)Action planning is far harder\.Gemini\-3\.1\-Pro succeeds on merely 21\.0% \(EM\) of queries, with EM collapsing from 34\.5% on two\-tool queries to 0\.5% on queries requiring six or more\. \(iii\)Functional confusion drives failures\.42–61% of errors stem from substituting target tools with functionally similar alternatives that are visible in the scene; a specialized open\-vocabulary detector \(Grounding DINO\) even outperforms the best MLLM in recall by 13\.4 pp, indicating that the bottleneck is physical commonsense, not perception\. \(iv\)The model gap is real\.Averaged across all familiarity levels \(including unfamiliar domains\), the human annotator reaches38% EM, far exceeding the best MLLM \(21\.0%\), confirming the gap reflects model capability rather than task ambiguity\.

In summary, our contributions are as follows:

- •A new dimension for evaluating MLLMs\.We introducePhysTool\-Bench, the first benchmark dedicated to physical tool use\. This capability bridges digital tool mastery and real\-world embodied deployment, yet has remained largely unexamined despite recent progress in Embodied AI\.
- •A diagnostic evaluation framework\.Our two\-task design, separating recognition from instruction\-conditioned selection and planning, isolates failures along the perception\-to\-reasoning pipeline\. The benchmark provides verified ground truth across2,510 queriesspanning2,678 toolsin everyday domains from manufacturing to healthcare\.
- •A pointed empirical diagnosis\.Across 13 state\-of\-the\-art MLLMs, we find that the bottleneck in physical tool use is not raw perception butfunctional commonsense: even when models correctly perceive a scene, they fail to map tools onto task semantics\. This points to physical commonsense as the central research direction for practical embodied AI\.

## 2Related Work

### 2\.1Benchmarks for Digital Tool Learning

Recent studies have demonstrated the power of LLMs to master the use of external tools to solve complex problems\[schick23toolformer,ReAct\]\. Early methods have confirmed the potential of tool learning in overcoming limitations of LLMs as a language processor while maintaining its generality\[schick23toolformer\]\.

Encouraged by the promising future of tool learning, a variety of benchmark and evaluation studies have been established to systematically define the problem\. General benchmarks typically evaluate LLMs’ ability in tool selection and tool calling across various APIs and their diverse use cases\[patil24gorilla\]\. Subsequent studies expand the scope to include action planning and response generation stages\[qin24toolllm\], while later version has evolved to balance between stability and reality via a virtual API server\[guo24stabletoolbench\]\. However, these existing benchmarks are predominantly confined to textual modalities and digital API environments\. They fail to assess how agents visually perceive real\-world scenarios and manipulate physical tools\.

### 2\.2Evaluations for Embodied Action Planning

The transition from digital assistants to physical robots necessitates the evaluations of how well high\-level reasoning can be grounded in robotic affordances\. \. Since the advent of SayCan\[ahn22saycan\]which introduced pre\-trained robotic value functions to assess the feasibility of each planned step, researchers have been working on bridging the gap between LLM’s high\-level semantic knowledge and long\-horizon task planning and completion in real world\. While both PaLM\-E\[driess23palme\]and RT\-2\[brohan23rt2\]have achieved a tighter integration of perception and action planning, but still primarily focus on fundamental “pick\-and\-place” tasks or spatial rearrangements and inherently treat objects as passive targets without investigating into “tools”, which play important roles in complex tasks and plans\. BEHAVIOR\-1K\[li24behavior1k\]challenges agents with realistic physics and demanding interaction with rigid bodies, deformable materials, and complex thermal states\. Yet it did not explicitly assess the zero\-shot cognitive capacity of multimodal foundation models to comprehend and plan with specialized equipment\.

More recently, studies have explicitly begun to explore the intersection of LLMs and robotic tool use\. For example, RoboTool\[xu24robotool\]leverages a multi\-agent LLM pipeline to generate executable code, enabling robots to utilize objects creatively to overcome implicit physical constraints\. Furthermore, its evaluation is severely limited in scale, encompassing a mere six task scenarios, which falls drastically short of providing a comprehensive assessment of tool\-use capabilities\. Because these frameworks often bypass the raw visual perception challenge by relying on predefined states or simplified environments, they fundamentally fail to evaluate an agent’s capability to visually recognize diverse, professional physical tools from complex real\-world scenes\.

## 3The Physical Tool Bench

![Refer to caption](https://arxiv.org/html/2606.10803v1/x2.png)Figure 2:Overview of thePhysTool\-Benchconstruction pipeline\. Gemini generates each query \(task instruction, target tools, distractors\) from the tool bank, with novel distractors recycled back via*Tool Bank Extension*; Nano Banana Pro then renders the scene\. Three quality\-control stages follow:QC\-Irefines targets and assigns step labels;QC\-IIverifies tool\-description alignment;QC\-IIIapplies human review for visual realism\.This section details the construction and characteristics of our proposed benchmark\. We first outline the definitions of two primary tasks \(§[3\.1](https://arxiv.org/html/2606.10803#S3.SS1)\)\. Next, we describe the annotation pipeline and quality assurance procedures for benchmark construction, which encompass target tool combination, instruction design, the injection of confounding tools, and the generation of visual scenarios \(§[3\.2](https://arxiv.org/html/2606.10803#S3.SS2)\)\. Finally, we present an analysis of the dataset statistics \(§[3\.3](https://arxiv.org/html/2606.10803#S3.SS3)\)\.

### 3\.1Problem Formulation

Each evaluation instance \(a*query*\) is a tuple\(I,L\)\(I,L\), whereIIis an image depicting a physical scenario with a set of available tools andLLis a natural language instruction \(e\.g\., “bond the cracked ceramic fragments”\)\. Let𝒞=\{c1,…,cN\}\\mathcal\{C\}=\\\{c\_\{1\},\\dots,c\_\{N\}\\\}denote the complete set of tools visible inII, which includes both task\-relevant targets and other items present in the scene\. We evaluate MLLMsfθf\_\{\\theta\}on two progressive tasks\.

Task I: Physical Tool Recognition\.Given the imageIIand a recognition promptPrecP\_\{rec\}, the model produces a predicted tool set𝒞^=fθ\(I,Prec\)\\hat\{\\mathcal\{C\}\}=f\_\{\\theta\}\(I,P\_\{rec\}\), and the goal is to recover𝒞\\mathcal\{C\}\. This task isolates the model’s ability to enumerate fine\-grained physical tools from cluttered scenes, independent of any task instruction\.

Task II: Tool Selection and Action Planning\.GivenIIandLL, the model outputs an ordered sequenceY^=fθ\(I,L\)=\(y1,…,yK\)\\hat\{Y\}=f\_\{\\theta\}\(I,L\)=\(y\_\{1\},\\dots,y\_\{K\}\)with eachyi∈𝒞y\_\{i\}\\in\\mathcal\{C\}\. The ground truth is𝒯∗=\{\(tj,sj\)\}j=1M\\mathcal\{T\}^\{\*\}=\\\{\(t\_\{j\},s\_\{j\}\)\\\}\_\{j=1\}^\{M\}, wheretj∈𝒞t\_\{j\}\\in\\mathcal\{C\}andsj∈ℤ≥1s\_\{j\}\\in\\mathbb\{Z\}\_\{\\geq 1\}is the execution\-step index oftjt\_\{j\}\. Tools sharing the samessare interchangeable, while tools with differentssvalues must follow their precedence \(sj<sk⇒tjs\_\{j\}<s\_\{k\}\\Rightarrow t\_\{j\}precedestkt\_\{k\}\)\. This unifies ordered and order\-free queries under one formulation\. A predictionY^\\hat\{Y\}is correct iff \(i\) it forms a bijection with\{tj\}j=1M\\\{t\_\{j\}\\\}\_\{j=1\}^\{M\}\(no missing targets, no extras, no duplicates\) and \(ii\) the step labels of its matched tools form a non\-decreasing sequence\.

### 3\.2Dataset Construction Pipeline

ConstructingPhysTool\-Benchrequires balancing two goals: covering the diversity of real\-world physical tools while ensuring each query admits a clear, verifiable ground\-truth solution\. We achieve this through a three\-stage pipeline \(Figure[2](https://arxiv.org/html/2606.10803#S3.F2)\): tool bank initialization \(§[3\.2\.1](https://arxiv.org/html/2606.10803#S3.SS2.SSS1)\), query generation \(§[3\.2\.2](https://arxiv.org/html/2606.10803#S3.SS2.SSS2)\), and multi\-stage quality assurance \(§[3\.2\.3](https://arxiv.org/html/2606.10803#S3.SS2.SSS3)\)\. All the prompts utilized in this section are elaborated in Appendix[C](https://arxiv.org/html/2606.10803#A3)\.

#### 3\.2\.1Tool Bank Initialization and Extension

We begin from a manually curated seed set of 310 commonly used physical tools and iteratively expand it during dataset construction along two complementary paths\. First, we prompt the LLM to propose new tools across diverse application domains and functional categories, enforcing breadth across the bank\. Second, novel distractor tools introduced by Gemini\-3\.1\-Pro\[gemini25gemini\]during query generation \(§[3\.2\.2](https://arxiv.org/html/2606.10803#S3.SS2.SSS2)\) are recycled back into the bank \(see the*Tool Bank Extension*loop in Figure[2](https://arxiv.org/html/2606.10803#S3.F2)\), systematically capturing functionally adjacent and visually similar confounders rather than only canonical task\-completing tools\. The expansion terminates at saturation, yielding 2,678 distinct tools\.

#### 3\.2\.2Query Generation

Target tool combinations\.We prompt Gemini\-3\-Pro\[gemini25gemini\]to formulate physical tool combinations restricted to the tool bank\. Each query requires 1–3 target tools at generation time: 310 single\-tool queries \(one per tool, ensuring coverage\), 500 two\-tool queries, and 500 three\-tool queries\. When sampling these target tool combinations, we strictly control the selection frequency to prevent any specific tool from being overrepresented or underrepresented, thereby maintaining a balanced distribution\.

Step labeling\.For multi\-tool combinations, we also assign each tool an execution\-step indexsjs\_\{j\}: tools that must be used before others receive earlier step indices, while functionally interchangeable tools share the same step\. This step structure forms part of the ground truth𝒯∗\\mathcal\{T\}^\{\*\}defined in §[3\.1](https://arxiv.org/html/2606.10803#S3.SS1)\.

Instruction and addition\-tool injection\.For each target combination, GPT\-4o\[openai24gpt4o\]derives two distinct query scenarios\. Instructions are phrased to describe the objective without naming required tools\. Each scenario also receives 3–10 additional tools \(i\.e\., distractors\) selected for visual similarity, functional proximity, or domain relevance to the targets, reflecting the visual complexity of real\-world environments\.

Image description\.To prepare each query for visual rendering, we synthesize a detailed image descriptiondid\_\{i\}that specifies the scene composition\. The description explicitly lists every candidate tool \(both targets and additional tools\) and instructs target tools to be randomly placed or partially obstructed to mimic real\-world clutter\.

Image rendering\.Using each descriptiondid\_\{i\}, we render the corresponding scene imageIi=ImgGen\(di\)I\_\{i\}=\\text\{ImgGen\}\(d\_\{i\}\)with Nano Banana Pro111We additionally validate that our findings generalize beyond synthetically generated images on a real\-world image subset in §[5\.4](https://arxiv.org/html/2606.10803#S5.SS4)\., supplemented with prompts enforcing adherence to physical laws\.

#### 3\.2\.3Multi\-Stage Quality Assurance

To ensure the correctness of the ground truth and eliminate ambiguities, as illustrated in Figure[2](https://arxiv.org/html/2606.10803#S3.F2), we implement three Quality Control \(QC\) checkpoints\.

QC\-I: Ground Truth Verification\.We refine each ground\-truth target set with Gemini\-3\.1\-Pro\. Given the instruction, the scene description, and a shuffled list of all candidate tools, the model evaluates each tool against three criteria: \(1\) essential and professional for the task, \(2\) consistent with the scenario state, and \(3\) supporting a valid execution sequence\. Based on this audit, tools may be reassigned between the target and distractor sets to eliminate cases where a distractor could substitute for a target\.

QC\-II: Image Description Alignment\.For each query, we run a programmatic check ensuring that every tool in the candidate set𝒞\\mathcal\{C\}appears as a literal mention in the image descriptiondid\_\{i\}, preventing missing or hallucinated tools at rendering time\.

QC\-III: Visual Quality Verification\.Each rendered image undergoes a final human verification stage to filter out: \(1\) physically unrealistic scenarios, \(2\) images where candidate tools are not clearly visible, and \(3\) critically, images containing artificial cues such as unnatural highlighting or centralizing of target tools, which would allow models to bypass physical reasoning\. After filtering, the final dataset contains 2,510 verified scenarios\.

### 3\.3Dataset Statistics

ThePhysTool\-Benchencompasses 2,510 distinct evaluation scenarios over a diverse pool of 2,678 unique physical tools, comprising 1,168 target \(positive\) tools and 1,519 tools that only appear as confounders\. All tools are classified into 57 segments based on the United Nations Standard Products and Services Code \(UNSPSC\), spanning manufacturing, electrical work, healthcare, agriculture, and beyond\. To evaluate resistance to visual distractors, each scenario presents a complex environment densely populated with candidate items, containing on average 8\.62 tools \(3\.11 positive, 5\.51 distractors\)\. 86\.9% of scenarios require a strict sequential execution order, while the remainder evaluate order\-free combinations\. Query instructions are concise \(avg\. 103 characters\), whereas the synthesized image descriptions used for rendering are highly detailed \(avg\. 1,736 characters\), ensuring physical realism and exact alignment with the candidate tool constraints\.

## 4Experimental Setup

Table 1:Quantitative results on the proposed benchmark across various MLLMs\.Order\-Agnosticreports Task I \(visual recognition: identify every available tool in the image\) with Precision, Recall, F1\.Order\-Awarereports Task II \(selection / planning\) with Exact Match, Task\-Completable Rate, Success Rate @kk\. Subscripts on Task II cells are the Wilson 95% confidence half\-widths over the scenario sample\. “I” and “T” denote the Instruct and Thinking model\. Best results are bolded\.Order\-Agnostic — Task I \(%\)Order\-Aware — Task II \(%\)ModelPrecisionRecallF1\-scoreOverall EMTCRSR @ 1SR @ 2SR @ 3GPT\-4o\[openai24gpt4o\]65\.1555\.0858\.545\.62±\\pm0\.9023\.04±\\pm1\.6538\.53±\\pm2\.0415\.14±\\pm1\.503\.99±\\pm0\.82Qwen3\-VL\-Plus\[bai25qwen3vl\]61\.9365\.4162\.375\.66±\\pm0\.9120\.81±\\pm1\.5939\.05±\\pm2\.0516\.52±\\pm1\.564\.59±\\pm0\.88GPT\-5\.2\[openai26gpt52\]63\.7659\.8660\.2610\.66±\\pm1\.2124\.72±\\pm1\.6947\.59±\\pm2\.1022\.07±\\pm1\.746\.80±\\pm1\.06Gemini\-3\.1\-Pro\[gemini25gemini\]64\.9856\.4258\.6820\.96±\\pm1\.5932\.12±\\pm1\.8355\.83±\\pm2\.0833\.35±\\pm1\.9813\.90±\\pm1\.45Deepseek\-VL2\[wu24deepseekvl2\]51\.3143\.7444\.480\.44±\\pm0\.2712\.48±\\pm1\.2916\.01±\\pm1\.544\.50±\\pm0\.870\.78±\\pm0\.38MiniCPM\[yu25minicpmv\]48\.3956\.9049\.861\.00±\\pm0\.4015\.23±\\pm1\.4126\.24±\\pm1\.856\.93±\\pm1\.071\.79±\\pm0\.56mPLUG\-Owl3\[ye24mplugowl3\]43\.3222\.1827\.601\.12±\\pm0\.4211\.56±\\pm1\.2516\.97±\\pm1\.583\.99±\\pm0\.820\.73±\\pm0\.37Qwen3\-VL\-32B\-I\[bai25qwen3vl\]47\.5557\.1749\.831\.24±\\pm0\.4419\.97±\\pm1\.5630\.46±\\pm1\.9311\.56±\\pm1\.343\.07±\\pm0\.73OpenFlamingo\[awadalla23openflamingo\]19\.4819\.7918\.371\.79±\\pm0\.523\.59±\\pm0\.734\.54±\\pm0\.880\.69±\\pm0\.360\.00±\\pm0\.09InternVL3\.5\-38B\[wang25internvl35\]50\.8742\.4144\.702\.51±\\pm0\.6213\.71±\\pm1\.3527\.02±\\pm1\.868\.67±\\pm1\.181\.70±\\pm0\.55OVis 2\.6\[lu24ovis\]64\.8349\.2553\.186\.02±\\pm0\.9315\.46±\\pm1\.4133\.76±\\pm1\.9812\.57±\\pm1\.393\.03±\\pm0\.72Kimi\-VL\-A3B\-T\[bai26kimik25\]58\.6050\.8252\.916\.78±\\pm0\.9814\.39±\\pm1\.3731\.56±\\pm1\.9511\.47±\\pm1\.342\.61±\\pm0\.67Qwen3\-VL\-32B\-T\[bai25qwen3vl\]64\.1647\.8753\.159\.33±\\pm1\.1418\.17±\\pm1\.5140\.50±\\pm2\.0616\.79±\\pm1\.574\.63±\\pm0\.89

To establish a comprehensive baseline for physical tool use, we rigorously evaluate a suite of state\-of\-the\-art Multimodal Large Language Models \(MLLMs\)\. This section details the selected models, the prompting strategies, and the specific metrics used to quantify performance across our two primary evaluation tasks\.

### 4\.1Implementation Details

We select a representative set of leading MLLMs, encompassing both proprietary \(closed\-source\) and open\-weight architectures\. For proprietary models, we evaluate GPT\-4o, GPT\-5\.2, Gemini 3\.1 Pro, and Qwen3\-VL\-Plus\[openai24gpt4o,openai26gpt52,gemini25gemini,bai25qwen3vl\]\. For open\-weight models, we include Qwen3\-VL, InternVL3\.5, Kimi\-VL, DeepSeek\-VL, mPLUG\-Owl3, OpenFlamingo, MiniCPM, and Ovis 2\.6\[bai25qwen3vl,wang25internvl35,bai26kimik25,wu24deepseekvl2,ye24mplugowl3,awadalla23openflamingo,yu25minicpmv,lu24ovis\]to assess the capabilities of publicly available architectures\.

All evaluations are conducted in azero\-shotsetting to test the models’ inherent physical reasoning and zero\-shot generalization capabilities without relying on query\-specific fine\-tuning or few\-shot demonstrations\. To ensure standardized outputs, we utilize a standardized prompt template that instructs the models to first analyze the visual scene before outputting the required tool list or sequence\. The prompt templates are exact in the Appendix\.

### 4\.2Evaluation Metrics

We define quantitative metrics for the two tasks formulated in §[3\.1](https://arxiv.org/html/2606.10803#S3.SS1)\. For Task I, we evaluate the predicted tool set𝒞^\\hat\{\\mathcal\{C\}\}against the ground truth𝒞\\mathcal\{C\}using standardPrecision,Recall, andF1\-score\. For Task II, we evaluate both*which*tools are selected and*whether*they are arranged in the correct order\.

Selection \(Order\-Agnostic\)\.We apply Precision, Recall, and F1 to compare the predicted tool set against the ground\-truth target set\{tj\}j=1M\\\{t\_\{j\}\\\}\_\{j=1\}^\{M\}, ignoring order\. This isolates selection accuracy from sequential planning\.

Exact Match \(EM\)\.EM is a strict criterion that requires a prediction to perfectly match the ground truth\. A prediction scores 1 only if \(i\) its selected tools exactly match the target set\{tj\}\\\{t\_\{j\}\\\}, with no missing or extra tools, and \(ii\) the tools appear in an order consistent with their step labelssjs\_\{j\}, i\.e\., tools assigned to earlier steps precede those assigned to later steps\. Tools sharing the same step may appear in any order\. Any deviation yields a score of 0, and EM is reported as the average across all queries\.

Task\-Completable Rate \(TCR\)\.TCR relaxes EM by allowing additional tools beyond the ground truth\. A prediction scores 1 if all target tools appear in a step\-consistent order, even if extra unnecessary tools are included\. TCR thus reflects whether an agent could still complete the task, while EM additionally requires a*minimal*plan\.

Success Rate @kk\(SR@kk\)\.SR@kk\(k∈\{1,2,3\}k\\in\\\{1,2,3\\\}\) measures EM restricted to the firstkktools in the predicted sequence\. SR@kkcaptures how early in the sequence a model begins to fail and complements the all\-or\-nothing nature of EM\.

## 5Results and Analysis

We empirically evaluate the suite of MLLMs introduced in §[1](https://arxiv.org/html/2606.10803#S4.T1)on our benchmark\. Our analysis proceeds from overall performance \(§[5\.1](https://arxiv.org/html/2606.10803#S5.SS1)\) to fine\-grained breakdowns \(§[5\.2](https://arxiv.org/html/2606.10803#S5.SS2)\), targeted probing studies \(§[5\.3](https://arxiv.org/html/2606.10803#S5.SS3)\), validation on real\-world images \(§[5\.4](https://arxiv.org/html/2606.10803#S5.SS4)\), and a fine\-grained error analysis \(§[5\.5](https://arxiv.org/html/2606.10803#S5.SS5)\)\.

### 5\.1Main Results

Table[1](https://arxiv.org/html/2606.10803#S4.T1)reports overall performance across all evaluated MLLMs on both Task I \(Physical Tool Recognition over all available tools in the scene\) and Task II \(Tool Selection and Action Planning conditioned on the task instruction\)\. Three findings stand out\.

##### Recognition is non\-trivial, even for SOTA models\.

When asked to enumerate every tool visible in a real scene \(Task I\), no model exceeds 63% F1: the best score is Qwen3\-VL\-Plus at 62\.37%, and the majority fall below 50%\. Smaller open\-weight models such as mPLUG\-Owl3 \(27\.60%\) and OpenFlamingo \(18\.37%\) miss more than 70% of the tools present\. Adding the task instruction \(Task II\) does not consistently help: only 4 of 13 models improve over their Task I F1, with the rest performing comparably or worse \(full comparison in Appendix[B](https://arxiv.org/html/2606.10803#A2)\)\. This is because Task II requires not only perceiving the tools, but reasoning about their*functional relevance*to the instruction\. Many MLLMs recognize tools in Task I yet fail to map them onto task semantics in Task II, pointing to a more cognitive bottleneck that we examine in §[5\.3](https://arxiv.org/html/2606.10803#S5.SS3)\.

##### A large gap separates recognition from planning\.

Despite the perceptual advantage afforded by task instructions, the highest overall Exact Match \(EM\) on Task II is only 20\.96% \(Gemini\-3\.1\-Pro\)\. The order\-aware metrics deteriorate even more sharply: the best Success Rate atk=3k\{=\}3is 13\.90% \(Gemini\-3\.1\-Pro\), and no model exceeds 56% even atk=1k\{=\}1\. This decoupling suggests that current MLLMs may*see*the right tools without being able to reason about which subset to use, in what order, and why\.

##### Closed\-source models lead, but the gap is narrowing\.

Proprietary models \(GPT\-4o, GPT\-5\.2, Gemini\-3\.1\-Pro\) consistently outperform their open\-source counterparts on Task II EM, with Gemini\-3\.1\-Pro leading on every order\-aware metric\. Nevertheless, the strongest open\-source reasoning models, Qwen3\-VL\-32B\-Thinking \(9\.33% EM\) and Kimi\-VL\-A3B\-Thinking \(6\.78% EM\), match or exceed GPT\-4o \(5\.62% EM\) on several order\-aware metrics, narrowing the gap on planning\-style tasks\.

### 5\.2Fine\-grained Analysis

#### 5\.2\.1Effect of Query Complexity

![Refer to caption](https://arxiv.org/html/2606.10803v1/x3.png)Figure 3:Exact\-match performance of Qwen3\-VL\-32B\-Thinking on Task II across the number of target toolskk\. SR@jjrequires the firstjjpredicted tools to match the ground truth prefix; EM additionally forbids extra tools beyond the ground truth\. SR@3 is undefined whenk=2k\{=\}2\. While SR@1 stays at 54–57% across all complexities, EM collapses from 34\.5% atk=2k\{=\}2to 0\.5% atk=6\+k\{=\}6\{\+\}, exposing a sharp degradation in multi\-step planning\.Figure[3](https://arxiv.org/html/2606.10803#S5.F3)presents Gemini\-3\.1\-Pro’s Task II performance by the number of target tools\. SR@1 remains nearly constant across complexities \(54–57%\), indicating that selecting the first appropriate tool is largely insensitive to query length\. In sharp contrast, EM collapses from 34\.5% atk=2k\{=\}2to 0\.5% atk=6\+k\{=\}6\{\+\}, with SR@3 falling below 20% once the query requires four or more tools\. The widening gap between SR@1 and EM thus directly quantifies the model’s failure to maintain a globally consistent execution plan: even when individual tools are correctly identified at the start, the probability of completing the full sequence decays super\-linearly with complexity\. This pattern indicates that the dominant source of difficulty is multi\-step physical planning rather than single\-step tool recognition\.

#### 5\.2\.2Performance Across UNSPSC Domains

We further disaggregate Task II EM across the seven broad UNSPSC domains \(see Appendix[A](https://arxiv.org/html/2606.10803#A1)for the full breakdown\)\. Models perform substantially better on*Healthcare*and*Office*scenarios, where procedures are well\-defined and tool sets are small, but degrade markedly on*Manufacturing*and*Electrical Work*, where ordering constraints are strict and confounding tools share both visual and functional similarities\. This pattern points to a systematic deficit in domain\-specific physical commonsense rather than a uniform recognition limitation\.

### 5\.3Probing Studies

#### 5\.3\.1Perception Ceiling

To localize the MLLM perception bottleneck, we evaluate state\-of\-the\-art open\-vocabulary object detectors on the same scenes\. Given the candidate tool list as a text prompt, Grounding DINO achieves a recall of 70\.53% — exceeding the best MLLM \(Gemini\-3\.1\-Pro at 57\.09% on Task I\) by 13\.44 percentage points\. This indicates that the visual evidence required for tool recognition is present in the images, and MLLM failures are not driven by raw perception but by the inability to enumerate visible tools or to ground them in the task instruction\.

#### 5\.3\.2Human Reference

To contextualize model performance, one annotator from our research team completed a stratified sample of 100 queries, rating their domain familiarity from 1 to 5 per query\. On items rated highly familiar \(confidence 5\), the annotator achieves75% EM,75% TCR, and95% F1, indicating that the benchmark admits well\-defined answers aligned with informed human judgment\. Across all familiarity levels, the annotator reaches38% EM,49% TCR, and80\.6% F1, still substantially exceeding the best MLLM \(Gemini\-3\.1\-Pro at 21\.0% EM\)\. The model deficits thus reflect capability limitations rather than task ambiguity\. We leave a multi\-annotator study to future work\.

### 5\.4Real\-World Image Validation

A natural concern is whether our findings generalize beyond synthetically generated images\. To address this, we construct a real\-world image subset of 201 queries collected from web sources, manually matching the images to task instructions from the benchmark while preserving the original target labels\.

Base on the evaluated results, precision drops by 8\.95 percentage points on the real\-world subset, whereas EM remains nearly unchanged \(19\.9% on generated images vs\. 19\.4% on real\-world images\)\. The degradation in the order\-agnostic metrics is consistent with the lower image quality of in\-the\-wild photographs, which exhibit varied resolution, lighting conditions, and motion blur\. These results indicate that synthetic image generation, provides a charitable testbed: the capability gap exposed by our benchmark is not an artifact of the synthetic distribution and would likely be more pronounced under real\-world deployment\.

### 5\.5Error Analysis

![Refer to caption](https://arxiv.org/html/2606.10803v1/x4.png)Figure 4:Task II failure decomposition across seven representative MLLMs, sorted by Exact Match \(EM\)\. Each prediction is assigned to one of five mutually exclusive outcomes\. The first two \(EM, Extra Only\) are task\-completable; the remaining three are task\-blocking\. The dashed line marks the expected Out\-of\-Order rate under random tool selection and ordering \(33\.5%\)\.Figure[4](https://arxiv.org/html/2606.10803#S5.F4)decomposes each Task II prediction into five mutually exclusive outcomes, of which the first two \(Exact Match, Extra Only\) aretask\-completableand the remaining three aretask\-blocking\. To probe the underlying causes, we additionally annotate 100 failure cases from Gemini\-3\.1\-Pro\. Three observations stand out, with qualitative examples for each error category provided in Appendix[D](https://arxiv.org/html/2606.10803#A4)\.

Substitution dominates, with functional confusion as the primary driver\.As shown in Figure[4](https://arxiv.org/html/2606.10803#S5.F4), Substitute—where at least one target tool is replaced by a distractor—is the largest failure mode for every model\. Our manual annotation reveals that the missing\-target component of these failures is rarely caused by perception: only 22% of missed tools are visually occluded or too small to recognize, while41\.3% are*functional omissions*—tools correctly identified in Task I but excluded from the Task II plan because the model fails to recognize their functional relevance to the instruction\. A further 36\.7% are tools clearly visible in the scene but not recognized in either task, reflecting a fine\-grained recognition gap rather than visual difficulty\. On the spurious\-selection side, 60% of incorrectly selected tools are distractors actually present in the scene, and 40% are hallucinated tools not visible at all\. Together, these results indicate that the bottleneck is task\-conditioned functional reasoning rather than raw perception\.

Ordering competence exists but is fragile\.Out\-of\-Order rates \(Figure[4](https://arxiv.org/html/2606.10803#S5.F4)\) sit well below the 33\.5% random baseline, indicating non\-trivial sequencing ability\. However, root\-cause analysis shows that 50% of OoO failures stem from misinterpreting the task instruction rather than generic ordering weakness, suggesting that improving instruction grounding may directly reduce ordering errors\.

Failure profiles diverge across model families\.Thinking models \(Qwen3\-VL\-32B\-Thinking, Kimi\-VL\-A3B\-Thinking\) trade lower ordering errors for higher Substitute rates, while GPT\-4o and Qwen3\-VL\-Plus show the opposite pattern—high OoO with comparatively lower Substitute\. Explicit reasoning thus improves sequential planning yet leaves the functional\-disambiguation gap untouched\. These contrasts are invisible at the aggregate EM level but become apparent in the per\-category decomposition shown in Figure[4](https://arxiv.org/html/2606.10803#S5.F4)\.

## 6Conclusion

We introducePhysTool\-Bench, the first benchmark dedicated to evaluating physical tool use in MLLMs\. Across 13 leading models, we find a substantial gap between digital and physical tool use: even the strongest MLLMs complete only a small fraction of queries end\-to\-end, and most failures arise from substituting target tools with functionally similar alternatives that are visible in the scene\.

This bottleneck is not raw perception\. Specialized detectors and humans both substantially outperform current MLLMs, and recognition recall persists on real\-world images\. The deficit lies in the functional commonsense required to map perceived tools onto task semantics\. Closing this gap is unlikely to come from scaling visual encoders alone; we believe progress will require explicit grounding in multi\-step physical reasoning, particularly for the long tail of specialized domains where embodied AI is most likely to be deployed\.

## Limitations

Coverage of tool categories\.WhilePhysTool\-Benchspans 57 UNSPSC segments and 2,678 distinct tools, certain specialized domains are underrepresented due to the difficulty of obtaining realistic visual references\. Expanding to these long\-tail domains is a natural direction for future iterations\.

##### Static visual contexts\.

PhysTool\-Benchevaluates tool use from a single static scene image, without modeling dynamic state changes \(e\.g\., the workpiece evolving as it is processed\) or interactive feedback \(e\.g\., the model querying additional viewpoints\)\. Extending to multi\-turn, interactive tool\-use evaluation is a promising direction for future work\.

## Ethical Considerations

Data sources and licensing\.Synthetic images were generated by Nano Banana Pro in compliance with its terms of service\. The real\-world image subset \(§[5\.4](https://arxiv.org/html/2606.10803#S5.SS4)\) was collected from publicly available web sources under fair use for academic research;

Model access\.All evaluated proprietary models were accessed through their official APIs in accordance with each provider’s terms of use\.

Human reference\.The human reference \(§[5\.3\.2](https://arxiv.org/html/2606.10803#S5.SS3.SSS2)\) and QC\-III visual verification were conducted by a research team member who consented to the task and were informed of the research purpose\. The task involves only assessing visual scenes of physical tools and contains no personally identifiable information or sensitive content\.

## References

![Refer to caption](https://arxiv.org/html/2606.10803v1/x5.png)Figure 5:Per\-segment Task\-Completable Rate \(TCR, %\) across 28 UNSPSC functional segments, for six representative MLLMs\. Segments are sorted left\-to\-right by mean TCR across models \(descending\)\. All models exhibit a consistent decline from left to right, with the cross\-model ranking of segments highly correlated, indicating that category\-level difficulty is intrinsic to the task rather than model\-specific\.Table 2:Comparison of Task I \(visual recognition: identify every tool visible in the image\) and Task II \(tool selection / planning\) on the same 2,510 scenarios\. Task I GT =shuffled\_available\_tools; Task II GT = task\-relevant target tools\.Δ\\DeltaF1 = Task II F1−\-Task I F1 — positive values indicate the model is stronger at closed\-set selection than open\-set recognition\. Best results per column are bolded\.Task I — Recognition \(%\)Task II — Selection \(%\)ModelPrecisionRecallF1PrecisionRecallF1Δ\\DeltaF1GPT\-4o\[openai24gpt4o\]65\.1555\.0858\.5453\.5663\.9055\.86−2\.68\-2\.68Qwen3\-VL\-Plus61\.9365\.4162\.3756\.7861\.0456\.26−6\.11\-6\.11GPT\-5\.263\.7659\.8660\.2659\.4464\.8259\.50−0\.76\-0\.76Gemini\-3\.1\-Pro64\.9856\.4258\.6871\.6167\.8767\.32\+8\.64\\mathbf\{\+8\.64\}Deepseek\-VL251\.3143\.7444\.4834\.8651\.0339\.34−5\.14\-5\.14MiniCPM48\.3956\.9049\.8637\.9654\.4542\.25−7\.61\-7\.61mPLUG\-Owl343\.3222\.1827\.6029\.2850\.0933\.35\+5\.75\+5\.75Qwen3\-VL\-32B\-Instruct47\.5557\.1749\.8340\.8163\.3147\.32−2\.51\-2\.51OpenFlamingo19\.4819\.7918\.3715\.3711\.4212\.05−6\.32\-6\.32InternVL3\.5\-38B50\.8742\.4144\.7046\.8949\.6445\.99\+1\.29\+1\.29OVis 2\.664\.8349\.2553\.1854\.6650\.2649\.64−3\.54\-3\.54Kimi\-VL\-A3B\-Thinking58\.6050\.8252\.9154\.1749\.8049\.11−3\.80\-3\.80Qwen3\-VL\-32B\-Thinking64\.1647\.8753\.1560\.8653\.5354\.39\+1\.24\+1\.24

## Appendix APer\-Category Performance Analysis

To understand whether MLLMs exhibit uniform competence in physical tool use or whether their performance varies by tool category, we disaggregate the Task\-Completable Rate \(TCR\) across the 28 UNSPSC functional segments covered byPhysTool\-Bench\. Figure[5](https://arxiv.org/html/2606.10803#A0.F5)reports the TCR of six representative MLLMs \(Gemini\-3\.1\-Pro, GPT\-4o, GPT\-5\.2, Qwen3\-VL\-Plus, Qwen3\-VL\-32B\-Thinking, and Qwen3\-VL\-32B\-Instruct\), with segments sorted by mean score across models\.

##### Overall trend\.

Performance varies dramatically across categories, spanning from above 30% TCR on the easiest segments to near\-zero on the hardest\. Gemini\-3\.1\-Pro maintains the highest TCR on the leftmost segments \(e\.g\., 30\.8% on Farming Machinery, 27\.5% on Cleaning Equipment\) but collapses on the rightmost categories, falling to 4\.8–9\.1% on segments such as Industrial Cleaning Services, Sports & Recreation, and Electronic Components\. This pattern is consistent across all six evaluated models, with cross\-model correlation of segment rankings exceeding 0\.85: the same segments are easy or hard for every model\.

##### What makes a category easy or hard?

The leftmost \(easiest\) segments — Farming Machinery, Cleaning Equipment, Construction Machinery, Vehicles, Power Generation — share two properties: \(i\) tools are*visually distinctive*\(e\.g\., a tractor or a leaf blower is unlikely to be confused with another tool\), and \(ii\) the mapping from instruction to tool is largely*one\-to\-one*\(e\.g\., “mow the lawn” unambiguously implies a lawn mower\)\. Under these conditions, even moderate functional reasoning suffices to recover the correct tool set\.

In contrast, the rightmost \(hardest\) segments — Industrial Cleaning Services, Sports & Recreation, Apparel & Personal, Electronic Components — exhibit*fine\-grained functional overlap*among candidates\. For instance, distinguishing between an arbor press, a hydraulic press, and a punch press in Industrial Mfg Services requires specialized domain knowledge that current MLLMs do not reliably possess\. Similarly, Electronic Components scenarios frequently demand multi\-step procedures involving visually similar instruments \(e\.g\., a multimeter, an oscilloscope probe, and a logic analyzer\), which our analysis in §[5\.5](https://arxiv.org/html/2606.10803#S5.SS5)identifies as a primary trigger of functional substitution errors\.

## Appendix BTask I vs\. Task II: Recognition Under Instruction Conditioning

To better understand whether task instructions help MLLMs identify relevant tools, we compare model performance on Task I \(visual recognition: identify every tool visible in the image\) and Task II \(selection: identify only the tools required by the instruction\) on the same 2,510 scenarios\. Both tasks are evaluated with set\-level Precision, Recall, and F1, computed against their respective ground\-truth sets: Task I against the full set of available tools in the scene, and Task II against the task\-relevant target tools\. Table[2](https://arxiv.org/html/2606.10803#A0.T2)reports the per\-model breakdown along withΔF1=Task II F1−Task I F1\\Delta\\text\{F1\}=\\text\{Task\\penalty 10000\\ II F1\}\-\\text\{Task\\penalty 10000\\ I F1\}, where positive values indicate that the model benefits from instruction conditioning\.

##### Aggregate trend\.

Across 13 evaluated models, only 5 exhibit a positiveΔF1\\Delta\\text\{F1\}, while the remaining 8 perform comparably or worse on Task II\. The largest gains are observed on OVis 2\.6 \(\+10\.80\+10\.80pp\) and Gemini\-3\.1\-Pro \(\+8\.64\+8\.64pp\), suggesting that these models are able to leverage the instruction as an effective attentional prior\. In contrast, most other models, including GPT\-4o \(−2\.68\-2\.68pp\), GPT\-5\.2 \(−0\.76\-0\.76pp\), Qwen3\-VL\-Plus \(−6\.11\-6\.11pp\), and several open\-source models, show no improvement or a slight degradation\.

##### Why does instruction conditioning not uniformly help?

The result is initially counterintuitive: one might expect the instruction to narrow the model’s attention to a smaller, task\-relevant subset of tools and thereby simplify the problem\. However, Task II imposes an additional reasoning demand on top of perception\. The model must not only*see*the tools, but also judge their*functional relevance*to the instruction \(e\.g\., recognizing that epoxy resin, rather than duct tape, is the appropriate adhesive for repairing ceramic\)\. For models that lack robust physical commonsense, this additional reasoning step introduces errors that outweigh the benefit of a narrower target set: they may drop correct targets whose relevance is not obvious, or substitute them with functionally adjacent alternatives\. Stronger models such as Gemini\-3\.1\-Pro and OVis 2\.6 appear better able to exploit the instruction without incurring these costs, whereas others are essentially neutralized or pulled down by the added cognitive burden\.

##### Implications\.

This pattern reinforces our central diagnosis: the bottleneck in physical tool use is not raw visual perception, but the higher\-level reasoning required to ground perceived tools in task semantics\. We provide complementary evidence for this view through a perception\-ceiling experiment with open\-vocabulary detectors \(§[5\.3](https://arxiv.org/html/2606.10803#S5.SS3)\) and an error decomposition that identifies functional substitution as the dominant failure mode \(§[5\.5](https://arxiv.org/html/2606.10803#S5.SS5)\)\.

## Appendix CPrompt Templates

This appendix lists the prompt templates used throughout the dataset construction pipeline \(Sections[C\.1](https://arxiv.org/html/2606.10803#A3.SS1)–[C\.5](https://arxiv.org/html/2606.10803#A3.SS5)\) and the evaluation procedure \(Sections[C\.6](https://arxiv.org/html/2606.10803#A3.SS6)–[C\.8](https://arxiv.org/html/2606.10803#A3.SS8)\)\. For brevity, system messages and minor formatting tokens are omitted; full versions are released with the dataset\.

### C\.1Target Tool Combination Generation

We further expand the tool bank by prompting Gemini\-3\.1\-Pro to include common combination of 2 or 3 tools that usually worked together to complete a task\. Notably, the task instructions generated here only serve as a guide to ensure the tool combination is feasible and commonly acknowledged in real\-world, avoiding getting random combinations\. These task instructions are dropped after we obtained the tool combinations\.

Prompt: Tool Combination Generation\[You are an expert in tool selection and tool usage across diverse real\-world domains\. I have attached a set of tools\. Your goal is to propose 100 distinct combinations of exactly 2 tools from this set\. For each combination, design a specific, realistic target task that requires the usage of all tools to be successfully completed\. For each combination, output a single JSON object containing exactly the following two fields: \(1\) task\_instruct: A clear task instruction written in English\. The task must require the use of all the 2 target tools to be completed\. Do NOT mention or imply any specific tools, including any target tools listed in tools\_target in \(2\)\. \(2\) tools\_target: 2 required tools needed to complete the task\. The tools must be exactly from the attached tool list\. If the tools are used in a specific order, list them in the correct operational sequence\.\]

### C\.2Task Instruction Generation

The task instruction is generated by prompting GPT\-4o with the pre\-determined initial target tools \(in form of single tool or tool combination\)\.

Prompt: Task Instruction Generation\["task\_instruct": A clear task instruction in English\. The task must require ALL the target tools \[tools\_list\] to be completed\. Do NOT mention or imply any specific tool or contain part of the tool name word\.\]

### C\.3Distractor Selection

The distractors are selected by prompting GPT\-4o with the target tool, along with the generation of task instruction and image descriptions\. For each initial target tool\(s\), two distinct task scenarios will be constructed, and the two different numbers of distractors to include in each task scenario are randomly chosen between 3 and 10\.

Prompt: Distractor Selection\["tools\_negative": A list of tools that are NOT required for this task\. \- Scenario 1 must have exactly neg\_counts\[0\] items\. \- Scenario 2 must have exactly neg\_counts\[1\] items\. These tools should be confusing or misleading \- they might: \- Look similar to the target tools \- Have similar functions to the target tools \- Be used on similar objects but be wrong choices \- Be commonly associated with the same work domain Make these negative tools realistic distractors\.\]

### C\.4Image Description Generation

The image description is generated simultaneously with the task instruction and distractors by prompting GPT\-4o\. Before a piece of task scenario \(including target tool\(s\), task instruction, distractors and image description\) is saved, we would check and ensure that all the tools are clearly mentioned and addressed in the corresponding image description\.

Prompt: Image Description for nano\-banana\-pro\["img\_desc": A detailed English description of a single image depicting the scenario\. The image must: \- Clearly imply the task to be completed \- Show ALL tools from both tools\_target and tools\_negative \- Make the correct target tools look randomly placed and partially hidden; they should NOT be highlighted, should not be placed conspicuously, and should not appear ready to complete the task\. \- When describing technical or professional workspaces, ensure that tools adhere to their mechanical function\. \- Include specific details about environment, lighting, angles, tool placement, and scene context \- Be detailed enough to generate a realistic, plausible image\. \]

### C\.5QC\-I: Target Tool Verification

In QC\-I, we refine the target tools and determine the chronological step orders in a more rigorous way by prompt Gemini\-3\.1\-Pro with ’Task Instruction’ , ’Current Scene’ \(the first paragraph of the image description\), and ’Available Tools’ \(the combined set of initial target tool and distractors\)\.

Prompt: QC\-I Target/Distractor Audit\[You are an expert AI agent orchestrator evaluating tool selection capabilities across diverse professional domains\. I will provide a ’Task Instruction’, a ’Current Scene’ description, and ’Available Tools’\. Your objective is to identify the ABSOLUTE MINIMAL REQUIRED SET of professional tools and sequence them based on scene progress\.THE THREE LAWS OF TOOL ORCHESTRATION: 1\. THE UNIFIED VIABILITY TEST: A tool is strictly REQUIRED only if its removal causes the task to physically fail, violate safety, or violate professional industry standards\. \- Implicit Constraints: You must consider implicit constraints\. \(e\.g\., studying animals "without disturbing habitat" standardly requires an unattended tool like a ’Wildlife Camera Trap’ to avoid human presence, making it professional necessity\)\. \- Technical Standards: You must prioritize professional\-grade methods over amateur workarounds \(e\.g\., prefer ’Heat Gun’ over ’Electrical Tape’ for professional automotive wiring\)\. \- Nice\-to\-Haves: Reject any tool that merely provides convenience but isn’t required for success \(e\.g\., GPS, Tripods\)\. 2\. SHARP REDUNDANCY ELIMINATION: If multiple tools overlap in fulfilling the requirement of Law 1 \(e\.g\., Telescope vs\. Field Binoculars for mobile observation\), you MUST select ONLY the single most contextually appropriate tool and move all alternatives and their specific accessories to ’negative\_tools’\. 3\. TASK LIFECYCLE TRACKING: Evaluate ‘<img\_desc\>‘ to determine what has already been completed\. \- REJECT tools meant ONLY for phases already finished in the image\. \- RETAIN and DELAY tools needed for remaining phases, final reassembly, or closing up to the LAST steps of the sequence\.Rules for Output \(Strictly Follow JSON Schema\): 1\. ’tool\_analysis’: Step\-by\-step evaluation of EVERY available tool\. \- ’viability\_and\_standard\_justification’: Explain why this tool is a professional and physical necessity based on Law 1 and 2\. Write ’Failed’ if it is non\-essential, amateurish, or redundant\. \- ’status’: "Target" OR "Negative" \(state exact reason: Non\-Essential / Substandard / Redundant / Already Completed\)\. \- ’sequence\_logic’: Timing rationale based on scene progress, or ’None’\. 2\. ’target\_tools’: List of selected tools\. 3\. ’target\_steps’: Integers representing the execution order \(starting at 1, continuous, same number for parallel tools\)\. 4\. ’negative\_tools’: List of rejected tools\. \]

### C\.6Evaluation Prompt — Task I \(Tool Recognition\)

We test MLLM’s ability in recognizing all available tools in the scene by the following prompt\.

Prompt: Task I — Tool Recognition\[List all tools in this image\. Please provide only the names of the tools, separated by commas\. Do not include any explanations or extra text\.\]

### C\.7Evaluation Prompt — Task II \(Tool Selection and Planning\)

We further evaluate MLLM’s ability to address the task in the provided scene by the following prompt\.

Prompt: Task II — Tool Selection and Action Planning\[Given the following TASK, which tool\(s\) in the image are most appropriate to complete the task? Please list the name\(s\) of the selected tools in the order they should be used and separate them by commas\. No explanation needed\. TASK: task\_instruct\. SELECTED TOOL\(S\) \(in order of use\):\]

### C\.8LLM\-as\-Judge Prompt

To match model predictions against the ground truth, we employ a hybrid pipeline combining case\-insensitive string matching with the following LLM judge\.

Prompt: LLM\-as\-Judge\[You are an expert evaluator\. I have a list of ’Identified Tools’ predicted by a model\. Your task is to map each ’Identified Tool’ to the correct ’Target Tool’ name \(if applicable\) for the provided task, while ensuring it does not refer to any ’Negative Tools’ \(distractors\)\.Rules: 1\. Only match if the Identified Tool is clearly the same tool as a Target Tool\. 2\. If the Identified Tool is ambiguous and could potentially refer to a Negative Tool, DO NOT match it\. 3\. Use the exact string from the Target Tools list for the value in your mapping\. 4\. Return ONLY a valid JSON object where keys are the Identified Tool strings and values are the corresponding Target Tool strings\. 5\. DO NOT map multiple Identified Tools to the same Target Tool – each target can appear at most once\.\]

## Appendix DQualitative Examples

### D\.1Successful Cases

Figure[6](https://arxiv.org/html/2606.10803#A4.F6)shows three queries solved correctly by the strongest model, illustrating the capabilities currently within reach\.

### D\.2Failure Cases by Error Type

Figures[7](https://arxiv.org/html/2606.10803#A4.F7)–[10](https://arxiv.org/html/2606.10803#A4.F10)present representative failure cases for each of the four error categories identified in §[5\.5](https://arxiv.org/html/2606.10803#S5.SS5)\.

![Refer to caption](https://arxiv.org/html/2606.10803v1/x6.png)Figure 6:The example and analysis of the "Exact Match" case\.![Refer to caption](https://arxiv.org/html/2606.10803v1/x7.png)Figure 7:The example and analysis of the "Extra Only" error\.![Refer to caption](https://arxiv.org/html/2606.10803v1/x8.png)Figure 8:The example and analysis of the "Missing Only" error\.![Refer to caption](https://arxiv.org/html/2606.10803v1/x9.png)Figure 9:The example and analysis of the "Substitute" error\.![Refer to caption](https://arxiv.org/html/2606.10803v1/x10.png)Figure 10:The example and analysis of the "Out\-of\-Order" error\.
Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Similar Articles

@omarsar0: Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recog…

PerceptionBench: Evaluating Atomic Visual Perception in Multimodal Large Language Models

Can Multimodal Large Language Models Understand OCT?

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

Beyond the Leaderboard: A Synthesis of Tool-Use, Planning, and Reasoning Failures in Large Language Model Agents

Submit Feedback

Similar Articles

@omarsar0: Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recog…
PerceptionBench: Evaluating Atomic Visual Perception in Multimodal Large Language Models
Can Multimodal Large Language Models Understand OCT?
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs
Beyond the Leaderboard: A Synthesis of Tool-Use, Planning, and Reasoning Failures in Large Language Model Agents