TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

arXiv cs.AI Papers

Summary

TOBench is a new benchmark for evaluating AI agents on real-world, task-oriented tool use with multimodal inputs and closed-loop verification. Experiments show top models like Qwen 3.5 Plus achieve only 41% success, far below the 94% human benchmark, highlighting a significant gap.

arXiv:2605.16909v1 Announce Type: new Abstract: Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:37 AM

# TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
Source: [https://arxiv.org/html/2605.16909](https://arxiv.org/html/2605.16909)
Zhiqiang Liu1,2Wenhui Dong1\* Yilang Tan3Yuwen Qu1Haochen Yin4Chenyang Si1 1Nanjing University2Huazhong University of Science and Technology 3Southwest Jiaotong University4The Chinese University of Hong Kong [https://github\.com/Pi3AI/TOBench](https://github.com/Pi3AI/TOBench)

###### Abstract

Tool\-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result\. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end\-to\-end omni\-modal tool use in the real world\. To address this gap, we introduce TOBench, a benchmark and evaluation harness for*task\-oriented omni\-modal tool use*\. TOBench contains 100 executable tasks from two macro task families,*Customer Service*and*Intelligent Creation*, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools\. The central design of TOBench is*closed\-loop multimodal verification*: agents must execute tools, inspect rendered or transformed artifacts, and self\-correct when outputs fail task\-specific requirements\. To make such evaluation scalable and verifiable, TOBench couples MCP\-based execution with task\-specific grounded evaluators and a semi\-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit\. Experiments on 15 contemporary agentic models show that TOBench remains highly challenging: Claude Opus 4\.6, commonly regarded as one of the strongest coding\-agent models, achieves only 32\.0% task success, far below the 94\.0% human benchmark\. We envision TOBench as a practical foundation for evaluating and advancing next\-generation omni\-modal tool\-using agents through closed\-loop multimodal verification\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.16909v1/fig1_performance.png)Figure 1:Performance of leading LLMs on TOBench\. Qwen 3\.5 Plus achieves the highest model score of 41\.0%, while most other models fall below 30\.0%\. The human benchmark is 94\.0%, highlighting a significant performance gap and indicating substantial room for improvement in real\-world multimodal tool\-invocation tasks\.Tool\-using language agents, powered by foundation models such as GPT\-4\[[1](https://arxiv.org/html/2605.16909#bib.bib58)\]and GPT\-4o\[[10](https://arxiv.org/html/2605.16909#bib.bib59)\], are moving from isolated function invocation toward practical interaction with web services, office software, knowledge sources, and external applications\. The Model Context Protocol \(MCP\) further accelerates this shift by providing a standard interface for connecting agents to diverse tools and services\. As a result, recent benchmarks have made substantial progress in evaluating API use, function calling, planning, and MCP\-based tool interaction, includingτ\\tau\-bench\[[30](https://arxiv.org/html/2605.16909#bib.bib4)\], ToolBench\[[22](https://arxiv.org/html/2605.16909#bib.bib2)\], BFCL\[[20](https://arxiv.org/html/2605.16909#bib.bib11)\], ToolTalk\[[5](https://arxiv.org/html/2605.16909#bib.bib3)\], Toolathlon\[[14](https://arxiv.org/html/2605.16909#bib.bib16)\], MCP\-RADAR\[[6](https://arxiv.org/html/2605.16909#bib.bib20)\], MCP\-Bench\[[28](https://arxiv.org/html/2605.16909#bib.bib23)\], and MCP\-Universe\[[18](https://arxiv.org/html/2605.16909#bib.bib22)\]\.

Despite this progress, existing benchmarks still leave a critical gap for real\-world professional workflows\. Many practical tasks are not purely textual or purely API\-based: an agent may need to read screenshots or documents, extract information from audio or video, edit a spreadsheet or presentation, render the output, inspect whether the result satisfies visual and semantic constraints, and then revise the artifact if necessary\. This diversity gap goes beyond adding more tool names or longer tool lists\. The difficulty lies in coordinating tool execution with multimodal perception, artifact transformation, and iterative verification over changing workspace states\.

Multimodal and computer\-use benchmarks such as OSWorld\[[29](https://arxiv.org/html/2605.16909#bib.bib8)\], VitaBench\[[8](https://arxiv.org/html/2605.16909#bib.bib19)\], M3\-Bench\[[35](https://arxiv.org/html/2605.16909#bib.bib25)\], and OmniGAIA\[[15](https://arxiv.org/html/2605.16909#bib.bib28)\]broaden evaluation beyond text, but multimodal perception and tool use are still often evaluated as separate capabilities\. Tool\-use benchmarks typically emphasize schema fidelity, tool selection, or final\-state checking, while multimodal benchmarks often focus on perception, GUI control, or final\-answer quality\. Realistic omni\-modal workflows require all of these capabilities simultaneously: agents must perceive heterogeneous inputs, act through executable tools, inspect intermediate artifacts, and self\-correct under task\-specific constraints\.

To address this gap, we introduce TOBench, a benchmark and evaluation harness for*task\-oriented omni\-modal tool use*\. TOBench contains 100 executable tasks across two macro task families,*Customer Service*and*Intelligent Creation*, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools\. Tasks are designed around realistic user needs and professional roles rather than synthetic tool combinations, and many of them require cross\-tool composition over documents, images, audio, video, spreadsheets, slides, search, browser automation, and file operations\.

The central design of TOBench is*closed\-loop multimodal verification*\. Instead of treating tool use as a one\-shot action sequence followed by final\-answer matching, TOBench requires agents to execute tools, inspect rendered or transformed artifacts, and revise their behavior when the artifact does not satisfy the task\. Each task is paired with a grounded verifier that combines code\-based checks, tool\-call constraints, format constraints, and multimodal artifact inspection\. This makes the benchmark an executable harness for evaluating the full perceive–act–inspect–revise loop\.

Experiments on 15 contemporary agentic models show that TOBench is far from saturated\. As shown in Figure[1](https://arxiv.org/html/2605.16909#S1.F1), the strongest evaluated model, Qwen3\.5\-Plus, achieves only 41\.0% task success, while the human benchmark reaches 94\.0%\. Our failure analysis shows that errors concentrate in unreliable tool execution, incorrect tool parameters, multimodal reasoning failures, and missing self\-verification before stopping\. These results suggest that closed\-loop multimodal verification is an indispensable evaluation primitive for next\-generation omni\-modal tool\-using agents\.

## 2Related Work

Table 1:Comparison of TOBench with representative related benchmarks across scale and capability dimensions\. Symbols:✓= supported,✗= not supported\. Detailed explanations for each trait are provided in[AppendixB\.1](https://arxiv.org/html/2605.16909#A2.SS1)\.### 2\.1Tool\-Use, Long\-Horizon, MCP Benchmarks

Foundational work on tool\-augmented LLMs established external tool use as a core capability\[[25](https://arxiv.org/html/2605.16909#bib.bib30),[31](https://arxiv.org/html/2605.16909#bib.bib29),[21](https://arxiv.org/html/2605.16909#bib.bib32),[26](https://arxiv.org/html/2605.16909#bib.bib31)\]\. Subsequent agent frameworks and benchmarks expanded evaluation toward multi\-step execution, planning, and reproducibility, including ToolBench, BFCL, ToolTalk, Toolathlon,τ\\tau\-bench, GAIA,τ2\\tau^\{2\}\-Bench,and related suites\[[22](https://arxiv.org/html/2605.16909#bib.bib2),[20](https://arxiv.org/html/2605.16909#bib.bib11),[5](https://arxiv.org/html/2605.16909#bib.bib3),[14](https://arxiv.org/html/2605.16909#bib.bib16),[30](https://arxiv.org/html/2605.16909#bib.bib4),[19](https://arxiv.org/html/2605.16909#bib.bib14),[3](https://arxiv.org/html/2605.16909#bib.bib15)\]\. Recent MCP\-oriented benchmarks such as MCP\-RADAR, MCPToolBench\+\+, MCP\-Universe, MCP\-Bench, and OSWorld\-MCP\[[6](https://arxiv.org/html/2605.16909#bib.bib20),[4](https://arxiv.org/html/2605.16909#bib.bib21),[18](https://arxiv.org/html/2605.16909#bib.bib22),[28](https://arxiv.org/html/2605.16909#bib.bib23),[11](https://arxiv.org/html/2605.16909#bib.bib24)\]further emphasize live tool ecosystems\. These works reveal key challenges in tool selection, schema fidelity, and long\-horizon execution, but most remain primarily textual and do not explicitly evaluate inspection\-and\-revision loops over multimodal artifacts\.

### 2\.2Multimodal and Computer\-Use Agent Benchmarks

OSWorld, AndroidWorld, VisualWebArena, VitaBench,τ\\tau\-Voice, MMDR\-Bench, VisualAgentBench, ProSoftArena, M3\-Bench, Tool\-LMM, UniVA, and OmniGAIA broaden evaluation toward GUI grounding and multimodal interaction\[[29](https://arxiv.org/html/2605.16909#bib.bib8),[23](https://arxiv.org/html/2605.16909#bib.bib50),[13](https://arxiv.org/html/2605.16909#bib.bib35),[8](https://arxiv.org/html/2605.16909#bib.bib19),[24](https://arxiv.org/html/2605.16909#bib.bib64),[9](https://arxiv.org/html/2605.16909#bib.bib69),[17](https://arxiv.org/html/2605.16909#bib.bib63),[2](https://arxiv.org/html/2605.16909#bib.bib74),[35](https://arxiv.org/html/2605.16909#bib.bib25),[27](https://arxiv.org/html/2605.16909#bib.bib10),[16](https://arxiv.org/html/2605.16909#bib.bib26),[15](https://arxiv.org/html/2605.16909#bib.bib28),[33](https://arxiv.org/html/2605.16909#bib.bib75),[7](https://arxiv.org/html/2605.16909#bib.bib76),[12](https://arxiv.org/html/2605.16909#bib.bib77),[36](https://arxiv.org/html/2605.16909#bib.bib78),[32](https://arxiv.org/html/2605.16909#bib.bib79)\]\. TOBench is closest to this line, but differs in three ways: it targets realistic professional task completion, uses a unified MCP\-based tool ecosystem, and centers evaluation on iterative artifact inspection with task\-specific grounded verifiers\. Table[1](https://arxiv.org/html/2605.16909#S2.T1)summarizes this comparison from the perspective of benchmark scale, ecosystem assumptions, and multimodal execution requirements\.

## 3TOBench

TOBench evaluates whether an agent can complete realistic omni\-modal tasks with executable tools\. Each task instance specifies the user instruction, task assets, available tool environment, and grounded verifier used to determine success\. Together, these components define a professional role, multimodal inputs, an executable tool ecosystem, and a task\-specific verification path\.

### 3\.1Task Formalism

![Refer to caption](https://arxiv.org/html/2605.16909v1/x1.png)Figure 2:Overview of the TOBench construction pipeline\. TOBench organizes tasks into two macro families,Customer ServiceandIntelligent Creation, instantiates them with user instructions, multimodal inputs, executable MCP tools, and domain rules, and validates each task through a generate\-inspect\-revise loop grounded in actual tool execution\.We formalize each TOBench instance as an executable harness

h=\(ℐ,ℰ,𝒮,𝒜,𝒪,𝒯,𝒞,𝒱\),h=\(\\mathcal\{I\},\\mathcal\{E\},\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},\\mathcal\{T\},\\mathcal\{C\},\\mathcal\{V\}\),\(1\)whereℐ\\mathcal\{I\}denotes the task instruction package,ℰ\\mathcal\{E\}the executable MCP environment,𝒮\\mathcal\{S\}the latent execution state,𝒜\\mathcal\{A\}the action space,𝒪\\mathcal\{O\}the observation space,𝒯\\mathcal\{T\}the transition dynamics,𝒞\\mathcal\{C\}the approved evaluation criteria, and𝒱\\mathcal\{V\}the grounded verifier\.

The instruction package is

ℐ=\(u,r,d,x\),\\mathcal\{I\}=\(u,r,d,x\),\(2\)whereuuis the user request,rris the professional role assigned to the agent,dddenotes concise domain rules that the agent is required to follow, andxxcollects multimodal input assets\. Unlike static QA benchmarks, the environmentℰ\\mathcal\{E\}includes both callable tools and mutable artifacts in the workspace\.

At turntt, the latent state is decomposed as

st=\(sttool,stws,stext,ht\)∈𝒮,s\_\{t\}=\\bigl\(s\_\{t\}^\{\\mathrm\{tool\}\},s\_\{t\}^\{\\mathrm\{ws\}\},s\_\{t\}^\{\\mathrm\{ext\}\},h\_\{t\}\\bigr\)\\in\\mathcal\{S\},\(3\)wheresttools\_\{t\}^\{\\mathrm\{tool\}\}captures tool\-side runtime state,stwss\_\{t\}^\{\\mathrm\{ws\}\}the current workspace artifacts,stexts\_\{t\}^\{\\mathrm\{ext\}\}any external world state exposed through tools, andhth\_\{t\}the interaction history\. This decomposition is important for TOBench because many tasks require modifying files, rendering intermediate artifacts, and grounding against time\-sensitive information\.

The agent action space contains both tool use and natural\-language interaction:

𝒜=𝒜tool∪𝒜text,attool=\(mt,θt\),\\mathcal\{A\}=\\mathcal\{A\}^\{\\mathrm\{tool\}\}\\cup\\mathcal\{A\}^\{\\mathrm\{text\}\},\\qquad a\_\{t\}^\{\\mathrm\{tool\}\}=\(m\_\{t\},\\theta\_\{t\}\),\(4\)wheremtm\_\{t\}is an available MCP tool andθt\\theta\_\{t\}denotes its arguments\. Observations likewise mix tool outputs, rendered artifacts, and textual feedback:

𝒪=𝒪tool∪𝒪render∪𝒪text\.\\mathcal\{O\}=\\mathcal\{O\}^\{\\mathrm\{tool\}\}\\cup\\mathcal\{O\}^\{\\mathrm\{render\}\}\\cup\\mathcal\{O\}^\{\\mathrm\{text\}\}\.\(5\)The execution dynamics are governed by

\(st\+1,ot\+1\)∼𝒯​\(st,at\),\(s\_\{t\+1\},o\_\{t\+1\}\)\\sim\\mathcal\{T\}\(s\_\{t\},a\_\{t\}\),\(6\)so a tool call may update files or external state and then return structured outputs, while a rendering or inspection action exposes multimodal evidence that can trigger a corrective follow\-up step\.

This leads to a trajectory

τ=\(ℐ,s0,a1,o1,…,aT,oT\),\\tau=\(\\mathcal\{I\},s\_\{0\},a\_\{1\},o\_\{1\},\\ldots,a\_\{T\},o\_\{T\}\),\(7\)which makes explicit that TOBench evaluates the full perceive–act–inspect–revise loop rather than only the final answer string\. In particular, many creation tasks require a closed\-loop pattern in which an agent first produces an artifact, then obtainsotrender∈𝒪rendero\_\{t\}^\{\\mathrm\{render\}\}\\in\\mathcal\{O\}^\{\\mathrm\{render\}\}by rendering or inspecting it, and only then decides whether revision is needed\.

![Refer to caption](https://arxiv.org/html/2605.16909v1/fig3_statis.png)Figure 3:Distribution of tools, task difficulty, and task scenarios in TOBench\. Left: the number of calls for each tool, showing a long\-tailed tool\-usage distribution\. Middle: the distribution of task difficulty levels\. Right: the scenario distribution of tasks across two high\-level categories, Custom Service and Intelligent Creation\.
### 3\.2TOBench Construction Framework

Figure[2](https://arxiv.org/html/2605.16909#S3.F2)illustrates the overall construction pipeline of TOBench\. We build tasks from realistic professional scenarios by selecting omni\-modal MCP tools, discovering user\-centered scenarios, instantiating executable tasks, and curating multimodal assets for closed\-loop verification\.

Omni MCP Tool Selection\.We extend the Toolathlon\[[14](https://arxiv.org/html/2605.16909#bib.bib16)\]MCP stack toward omni\-modal workflows\. We retain broadly useful tools for browser automation, retrieval, office editing, filesystems, and search, and add multimodal servers for PPT editing, text\-to\-speech, speech recognition, and video or audio processing\. We also implement two benchmark\-specific servers, Image Generation Server and Image Processing Toolkits, to support creation tasks and closed\-loop visual inspection\. The final benchmark integrates 27 MCP servers and 324 tools in total\. This diversity is necessary because many tasks require cross\-tool composition rather than a single API\. The full inventory appears in Appendix[A](https://arxiv.org/html/2605.16909#A1)\.

Omni\-modal Scenario Discovery\.We begin from realistic user needs rather than synthetic tool combinations\. Our scenario\-discovery prompt takescategory,subcategory, and the available MCP servers as input, and asks a language model to produce 10 candidate scenarios in JSON format\. Each candidate contains a scenario name, a vivid description that couples*user need*with an appropriate*agent role*, and a candidate MCP set\. The prompt explicitly enforces four constraints that mirror our design goals: \(1\) each scenario must be expressed as “user need \+ agent role”, \(2\) multimodal evidence must arise naturally in the input, \(3\) the required workflow must be feasible under the provided tools, and \(4\) the scenario should rely on simple and commonly verifiable domain rules rather than niche expert knowledge\. The prompt also prefers image\-based inputs over unnecessarily long videos unless temporal information is essential, which improves realism and keeps benchmark execution efficient\. Across 20 subcategories, this process yields roughly 200 candidate scenarios in total\.

Omni\-modal Benchmark Task InstantiationGiven a discovered scenario, we instantiate executable tasks through a structured task\-generation prompt framed as a user–agent role\-play\. Each generated task is serialized as a fixed JSON object containingtask\_name,task\_difficulty,turn\_mode,required\_mcp,agent\_config,user\_request, andinput\_files, which makes the result directly runnable and auditable\. The prompt requires the user request to remain natural and free of tool\-name leakage, while the agent is assigned a professional role with concise but verifiable domain rules, as elaborated in Appendix[C](https://arxiv.org/html/2605.16909#A3)\. Difficulty is controlled primarily by requirement complexity, ambiguity, and workflow length rather than by artificially large assets\. The prompt further enforces tool feasibility, everyday realism, resource efficiency, and flexible single\-turn or multi\-turn interaction, followed by a final reflection step that revises unsupported or incomplete tasks before they are admitted into the benchmark\. For each scenario, we generate three task candidates corresponding to easy, medium, and hard difficulty levels, yielding roughly 600 task candidates overall\.

Multimodal Asset Curation\.We favor compact but information\-dense multimodal artifacts\. In line with the prompts above, images are used whenever they are sufficient, while video or audio is reserved for cases in which temporal reasoning is genuinely necessary\. Assets may come from public web content or controlled generation pipelines when needed, and we normalize them for privacy, reproducibility, and practical execution cost at benchmark scale\. Asset curation required substantial manual effort:two AI PhD students spent approximately one month collecting realistic cases and corresponding input files from real\-world workflows\.During this process, we filtered out scenarios that were unrealistic, weakly grounded, or difficult to support with suitable input artifacts\. In total, roughly two\-thirds of the initially collected cases were discarded, leaving about 200 high\-quality cases for subsequent task instantiation and benchmark construction\.

Since some MCP tools did not provide sufficiently reliable execution capabilities to support task completion, our final benchmark contains 100 tasks organized into two macro families:

- •Customer Service\(67 tasks\): service\-oriented scenarios such as education, e\-commerce, government services, medicine, insurance, technical support, and travel\.
- •Intelligent Creation\(33 tasks\): artifact\-creation scenarios such as office editing, advertising, social content, game assets, and design\-oriented workflows\.

These two macro categories cover two major application spaces for agentic systems\. We further instantiate 20 subcategory slices in total\. Figure[3](https://arxiv.org/html/2605.16909#S3.F3)summarizes the taxonomy\.

## 4Evaluation Harness

In many TOBench tasks, correctness depends on output structure, multimodal content, role\-specific constraints, intermediate tool usage, and externally grounded information\. Final success therefore cannot be reduced to string matching or a single software\-state check\. In TOBench, evaluation is constructed as a task\-level harness: each task binds an executable environment to a grounded verifier\. Figure[4](https://arxiv.org/html/2605.16909#S4.F4)summarizes the pipeline\.

For taskii, we organize its approved evaluation criteria as

𝒞i=𝒞ifmt∪𝒞ijudge∪𝒞itool,\\mathcal\{C\}\_\{i\}=\\mathcal\{C\}\_\{i\}^\{\\mathrm\{fmt\}\}\\cup\\mathcal\{C\}\_\{i\}^\{\\mathrm\{judge\}\}\\cup\\mathcal\{C\}\_\{i\}^\{\\mathrm\{tool\}\},\(8\)corresponding to format constraints, judge\-based multimodal constraints, and tool/result constraints\. Given the executed trajectoryτi\\tau\_\{i\}, the final workspace snapshotwiw\_\{i\}, and the tool logℓi\\ell\_\{i\}, the grounded evaluator returns a binary vector

𝒱i​\(τi,wi,ℓi\)=\[ei,1,…,ei,mi\],ei,j∈\{0,1\},\\mathcal\{V\}\_\{i\}\(\\tau\_\{i\},w\_\{i\},\\ell\_\{i\}\)=\\bigl\[e\_\{i,1\},\\ldots,e\_\{i,m\_\{i\}\}\\bigr\],\\qquad e\_\{i,j\}\\in\\\{0,1\\\},\(9\)where eachei,je\_\{i,j\}is allowed to depend on auxiliary preprocessing such as document rendering, image conversion, speech transcription, or re\-querying time\-sensitive tools\. This formulation captures why TOBench is a harness: the verifier is an executable program over the realized trajectory and artifacts, not a static answer key\.

![Refer to caption](https://arxiv.org/html/2605.16909v1/x2.png)Figure 4:Overview of the evaluation pipeline in TOBench\. Rather than using one benchmark\-wide evaluator, we generate and audit task\-specific evaluation points and task\-specific evaluation code, and then execute the resulting evaluator under time\-sensitive, grounded conditions\.### 4\.1Task\-Specific Evaluation Point Generation

The first stage generates task\-specific evaluation points from the user request, agent role, domain rules, expected outputs, and ground\-truth workspace\. Rather than using one rubric for the whole benchmark, we derive𝒞i\\mathcal\{C\}\_\{i\}separately for each task\. The resulting points fall into three categories:format constraints,judge\-based multimodal constraints, andtool/result constraints\(Table[5](https://arxiv.org/html/2605.16909#A4.T5)\)\. Because TOBench contains heterogeneous and partially open\-ended tasks, all generated evaluation points are manually reviewed to remove omissions, unsupported assumptions, and duplicate checks\.

### 4\.2Task\-Specific Grounded Evaluator Synthesis and Human Audit

We then generate a grounded evaluation script for each task rather than applying a single benchmark\-wide evaluator\. The synthesized code implements𝒱i\\mathcal\{V\}\_\{i\}by combining deterministic checks, VLM\-based judging\[[34](https://arxiv.org/html/2605.16909#bib.bib57)\], and tool\-aware verification over MCP logs or live external results\. Shared utilities handle common operations such as spreadsheet parsing, document rendering, image conversion, and judge invocation, while task\-specific logic is specialized per criterion\. Each evaluator is manually audited before use\. Representative prompts and reference evaluator code will be released with the benchmark pipeline\.

### 4\.3Execution\-Time Evaluation and Timeliness

TOBench adopts task\-level success as the primary metric: a task is counted as solved only when all relevant evaluation points pass\. If taskiihasmim\_\{i\}approved evaluation points with binary outcomesei,j∈\{0,1\}e\_\{i,j\}\\in\\\{0,1\\\}, we define task success as

Si=∏j=1miei,j,S\_\{i\}=\\prod\_\{j=1\}^\{m\_\{i\}\}e\_\{i,j\},\(10\)so a task passes only when every required criterion passes\. The overall benchmark accuracy overNNtasks is then

Acc=1N​∑i=1NSi\.\\mathrm\{Acc\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}S\_\{i\}\.\(11\)In practice, evaluation is performed at execution time rather than by comparing against a static answer string\. Documents may need to be rendered into images before visual inspection; audio outputs may need transcription; spreadsheets and office files may require structured parsing; and some criteria require re\-querying MCP tools or checking tool\-call traces to confirm that the agent relied on grounded results rather than unsupported generation\. This execution\-time verifier is what makes TOBench a harness rather than a static answer set\.Execution\-time validation is critical for time\-sensitive benchmark tasks involving live data such as search, maps, weather, finance, or changing web content\. Evaluators should run soon after task completion to avoid external changes corrupting ground truth\. Unlike static file checks, tool\-result checks may re\-run MCP queries or inspect tool logs\.

## 5Experiments

### 5\.1Experimental Setting

We evaluate TOBench on all 100 tasks, spanning 67*Customer Service*tasks and 33*Intelligent Creation*tasks, with easy/medium/hard splits\. Each task exposes only its relevant subset of MCP servers and common utilities, and each run is capped at 100 interaction turns\. Table[2](https://arxiv.org/html/2605.16909#S5.T2)reports 15 representative proprietary and openly accessible models together with average tool calls and token usage\. The testing efficiency of TOBench is elaborated in Appendix[B\.2](https://arxiv.org/html/2605.16909#A2.SS2)\.

### 5\.2Main Results

Table 2:Main results on TOBench\. Scores are task success rates \(%\)\. Models are sorted by overall score \(Avg\.\)Tool Calls,Tokens, andCostare average values per task across the entire benchmark\. For some larger open\-source models, we used the official APIs\. For some smaller open\-source models that do not provide APIs, we deployed them ourselves, so the cost was 0\.ModelCustomer ServiceIntelligent CreationAvg\.ToolCallsTokens\(k\)Cost\($\)EasyMediumHardEasyMediumHardClosed Source ModelsClaude\-Opus\-4\.644\.8321\.4320\.0070\.0040\.000\.0032\.0028\.2329\.72\.37Gemini\-3\-Pro48\.2825\.0010\.0060\.0040\.000\.0032\.0018\.01300\.52\.62Gemini\-3\.1\-Pro48\.2814\.2910\.0060\.0040\.007\.6930\.0021\.51506\.63\.03Claude\-Haiku\-4\.541\.0328\.950\.0051\.7228\.570\.0027\.0022\.9244\.00\.27GPT\-535\.7121\.4311\.1170\.0011\.117\.6926\.8024\.3620\.00\.94Claude\-Sonnet\-4\.627\.5921\.4320\.0050\.0020\.000\.0023\.0024\.40370\.101\.67GPT\-5\.227\.5925\.000\.0060\.0010\.000\.0022\.0020\.2322\.70\.58GPT\-4o7\.143\.570\.0020\.0010\.000\.006\.128\.4114\.00\.34Open Source ModelsQwen3\.5\-Plus65\.5228\.570\.0070\.0050\.0015\.3841\.0025\.0559\.10\.17Qwen3\.6\-plus55\.1725\.0010\.0040\.0020\.0015\.3832\.0023\.76589\.20\.52Kimi\-K2\.558\.6228\.5720\.0030\.000\.007\.6931\.0025\.0668\.30\.41GLM\-4\.6V10\.3410\.710\.0030\.0040\.007\.6914\.0016\.7442\.60\.14Gemma\-4\-31B17\.243\.570\.0020\.0010\.000\.009\.0010\.65238\.50MiniMax\-M2\.513\.790\.000\.0020\.0010\.000\.007\.0022\.5605\.40\.21Qwen3\-VL\-30B\-A3B\-Ins6\.907\.140\.0020\.000\.000\.006\.0012\.75549\.70Qwen3\-VL\-8B\-Ins6\.900\.000\.0020\.000\.000\.004\.0010\.28402\.90Qwen3\-VL\-4B\-Ins6\.900\.000\.000\.000\.000\.002\.006\.59618\.40Qwen3\-VL\-2B\-Ins0\.000\.000\.000\.000\.000\.000\.007\.33605\.60

Table[2](https://arxiv.org/html/2605.16909#S5.T2)shows that TOBench is challenging for all tested models\. The strongest model,Qwen3\.5\-Plus, reaches only41\.0%average task success, while the best proprietary result is32\.0%\. Difficulty is the dominant factor: performance is unsaturated even on easy tasks and collapses on hard splits, where the best scores are 20\.00% on*Customer Service\-Hard*and 15\.38% on*Intelligent Creation\-Hard*\. The two macro families stress different capabilities: Customer Service rewards grounded retrieval and faithful tool use, whereas Intelligent Creation is especially sensitive to multimodal editing and final\-result verification\. We also observe a clear decoupling between inference cost and accuracy, suggesting that the main bottlenecks are not context length or budget alone, but reliable tool execution, multimodal reasoning, and verification before stopping\.

### 5\.3Error Analysis

To understand why performance remains low, we manually organize benchmark failures into five top\-level categories:Tool Call Error,Tool Parameter Error,Multimodal Capability Deficit,Self\-Verification Failure, andNon\-Agent Error\. Appendix[D](https://arxiv.org/html/2605.16909#A4)summarizes the full taxonomy and subcategories used in our analysis\.

![Refer to caption](https://arxiv.org/html/2605.16909v1/fig5_error_analysis.png)Figure 5:Heatmap of model\-specific error counts across fine\-grained subcategories, grouped into tool call, tool parameter, multimodal ability, and self\-verification failures\.Tool call and parameter errors remain the most pervasive execution bottleneck\.Many trajectories fail before high\-level reasoning becomes relevant: models choose the wrong tool, omit a required operation, hallucinate unsupported actions, or pass invalid arguments\. These failures show that realistic MCP environments demand stronger tool\-grounded action modeling than simplified function\-calling benchmarks\.

Multimodal reasoning errors become dominant once basic execution succeeds\.When models reach the correct tool family, failures often shift to perception and cross\-modal inference, including fine\-grained visual extraction, spatial reasoning, temporal localization, and evidence alignment across modalities\. More detailed bad cases and analysis for this category are provided in Appendix[E\.1](https://arxiv.org/html/2605.16909#A5.SS1)\.

Missing visual verification is a harness\-specific failure mode\.In many image editing, PPT authoring, and visual\-generation tasks, models perform a plausible edit and stop without inspecting the rendered result, or rely on metadata checks instead of true visual verification\. This directly explains why*Intelligent Creation\-Hard*remains difficult: the harness penalizes open\-loop completion and rewards closed\-loop self\-correction\.

The error heatmap suggests distinct failure regimes across model tiers\. Stronger models reduce low\-level schema mistakes, but their remaining errors concentrate in multimodal reasoning and final verification; weaker models accumulate both low\-level tool failures and high\-level perception errors\. Overall, many current agents still struggle to operate inside a realistic evaluation harness even when they can produce superficially plausible first\-pass outputs\.

## 6Limitations and Discussion

![Refer to caption](https://arxiv.org/html/2605.16909v1/x3.png)Figure 6:Two examples of failed tasks with initial and final states are presented: the first agent skipped visual validation of the generated PPT, leading to layout overlap and formatting errors, while the second failed to align the ASR transcription with the original video, resulting in incorrect frame evidence\. The complete tool execution trajectories for both tasks are provided in[AppendixE\.2](https://arxiv.org/html/2605.16909#A5.SS2)\.Current scope\.Although TOBench covers 100 executable tasks across 20 subcategory slices, it still covers only a subset of real\-world omni\-modal work\.

Evaluator noise\.The mixed evaluation strategy is necessary but imperfect: code\-based checks can miss subtle failures, and multimodal judges may introduce bias\.

MCP dependence\.The MCP\-centric design improves realism for current ecosystems, but benchmark behavior may depend on evolving server descriptions, response formats, and implementation details\.

Benchmark realism versus controllability\.A central trade\-off is realism versus reproducibility: realistic software stacks and multimodal assets better reflect user workloads, but they also increase evaluation complexity and variance\.

## 7Conclusion

We introduced TOBench, a benchmark and evaluation harness for omni\-modal tool\-using agents that emphasizes realistic professional workflows, MCP\-native tool ecosystems, and closed\-loop verification over multimodal intermediate artifacts\. Experiments on 15 contemporary models show that current systems remain far from robust omni\-modal task completion: the best model reaches 41\.0% overall success, hard tasks remain largely unsolved, and failures cluster around tool execution, multimodal reasoning, and missing verification loops\. We hope TOBench serves as a practical testbed for studying how agents perceive, act, verify, and self\-correct in realistic omni\-modal environments\.

## References

- \[1\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1)\.
- \[2\]J\. Ai, Y\. Feng, F\. Zhang, J\. Sun, Z\. Li, C\. Li, Y\. Chang, W\. Wu, R\. Wang, M\. Zhai, and K\. Zhang\(2025\)ProSoftArena: benchmarking hierarchical capabilities of multimodal agents in professional software environments\.arXiv preprint arXiv:2601\.02399\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[3\]V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan\(2025\)τ2\\tau^\{2\}\-Bench: evaluating conversational agents in a dual\-control environment\.arXiv preprint arXiv:2506\.07982\.Cited by:[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[4\]S\. Fan, X\. Ding, L\. Zhang, and L\. Mo\(2025\)Mcptoolbench\+\+: a large scale ai agent model context protocol mcp tool use benchmark\.arXiv preprint arXiv:2508\.07575\.Cited by:[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[5\]N\. Farn and R\. Shin\(2023\)Tooltalk: evaluating tool\-usage in a conversational setting\.arXiv preprint arXiv:2311\.10775\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[6\]X\. Gao, S\. Xie, J\. Zhai, S\. Ma, and C\. Shen\(2025\)Mcp\-radar: a multi\-dimensional benchmark for evaluating tool use capabilities in large language models\.arXiv preprint arXiv:2505\.16700\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[7\]Y\. Guo, W\. Dong, J\. Song, S\. Zhu, X\. Zhang, H\. Yang, Y\. Wang, Y\. Du, X\. Chen, and B\. Zheng\(2025\)Fila\-video: spatio\-temporal compression for fine\-grained long video understanding\.arXiv preprint arXiv:2504\.20384\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[8\]W\. He, Y\. Sun, H\. Hao, X\. Hao, Z\. Xia, Q\. Gu, C\. Han, D\. Zhao, H\. Su, K\. Zhang,et al\.\(2025\)Vitabench: benchmarking llm agents with versatile interactive tasks in real\-world applications\.arXiv preprint arXiv:2509\.26490\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[9\]P\. Huang, Z\. Zhong, Z\. Wan, D\. Zhou, S\. Alam, X\. Wang, Z\. Li, Z\. Dou, L\. Zhu, J\. Xiong,et al\.\(2026\)MMDeepResearch\-bench: a benchmark for multimodal deep research agents\.arXiv preprint arXiv:2601\.12346\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[10\]A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1)\.
- \[11\]H\. Jia, J\. Liao, X\. Zhang, H\. Xu, T\. Xie, C\. Jiang, M\. Yan, S\. Liu, W\. Ye, and F\. Huang\(2025\)Osworld\-mcp: benchmarking mcp tool invocation in computer\-use agents\.arXiv preprint arXiv:2510\.24563\.Cited by:[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[12\]C\. Jiang, W\. Dong, Z\. Zhang, C\. Si, F\. Yu, W\. Peng, X\. Yuan, Y\. Bi, M\. Zhao, Z\. Zhou,et al\.\(2025\)Ivy\-fake: a unified explainable framework and benchmark for image and video aigc detection\.arXiv preprint arXiv:2506\.00979\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[13\]J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried\(2024\)Visualwebarena: evaluating multimodal agents on realistic visual web tasks\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 881–905\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[14\]J\. Li, W\. Zhao, J\. Zhao, W\. Zeng, H\. Wu, X\. Wang, R\. Ge, Y\. Cao, Y\. Huang, W\. Liu,et al\.\(2025\)The tool decathlon: benchmarking language agents for diverse, realistic, and long\-horizon task execution\.arXiv preprint arXiv:2510\.25726\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2),[§3\.2](https://arxiv.org/html/2605.16909#S3.SS2.p2.1)\.
- \[15\]X\. Li, W\. Jiao, J\. Jin, S\. Wang, G\. Dong, J\. Jin, H\. Wang, Y\. Wang, J\. Wen, Y\. Lu,et al\.\(2026\)Omnigaia: towards native omni\-modal ai agents\.arXiv preprint arXiv:2602\.22897\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[16\]Z\. Liang, D\. Zhang, H\. Zhou, R\. Huang, B\. Li, Y\. Zhang, S\. Wu, X\. Wang, J\. Luo, L\. Liao,et al\.\(2025\)UniVA: universal video agent towards open\-source next\-generation video generalist\.arXiv preprint arXiv:2511\.08521\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[17\]X\. Liu, T\. Zhang, Y\. Gu, I\. L\. Iong, Y\. Xu, X\. Song, S\. Zhang, H\. Lai, X\. Li, H\. Zhao,et al\.\(2024\)VisualAgent bench: towards large multimodal models as visual foundation agents\.arXiv preprint arXiv:2408\.06327\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[18\]Z\. Luo, Z\. Shen, W\. Yang, Z\. Zhao, P\. Jwalapuram, A\. Saha, D\. Sahoo, S\. Savarese, C\. Xiong, and J\. Li\(2025\)Mcp\-universe: benchmarking large language models with real\-world model context protocol servers\.arXiv preprint arXiv:2508\.14704\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[19\]G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom\(2023\)Gaia: a benchmark for general ai assistants\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[20\]S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez\(2025\)The berkeley function calling leaderboard \(bfcl\): from tool use to agentic evaluation of large language models\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[21\]S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez\(2024\)Gorilla: large language model connected with massive apis\.Advances in Neural Information Processing Systems37,pp\. 126544–126565\.Cited by:[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[22\]Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2023\)Toolllm: facilitating large language models to master 16000\+ real\-world apis\.InThe twelfth international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[23\]C\. Rawles, S\. Clinckemaillie, Y\. Chang, J\. Waltz, G\. Lau, M\. Fair, A\. Li, W\. Bishop, W\. Li, F\. Campbell\-Ajala,et al\.\(2024\)Androidworld: a dynamic benchmarking environment for autonomous agents\.arXiv preprint arXiv:2405\.14573\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[24\]S\. Ray, K\. Dhandhania, V\. Barres, and K\. Narasimhan\(2026\)τ\\tau\-Voice: benchmarking full\-duplex voice agents on real\-world domains\.arXiv preprint arXiv:2603\.13686\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[25\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.Advances in neural information processing systems36,pp\. 68539–68551\.Cited by:[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[26\]Y\. Shen, K\. Song, X\. Tan, D\. Li, W\. Lu, and Y\. Zhuang\(2023\)Hugginggpt: solving ai tasks with chatgpt and its friends in hugging face\.Advances in Neural Information Processing Systems36,pp\. 38154–38180\.Cited by:[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[27\]C\. Wang, W\. Luo, S\. Dong, X\. Xuan, Z\. Li, L\. Ma, and S\. Gao\(2025\)Mllm\-tool: a multimodal large language model for tool agent learning\.In2025 IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\),pp\. 6678–6687\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[28\]Z\. Wang, Q\. Chang, H\. Patel, S\. Biju, C\. Wu, Q\. Liu, A\. Ding, A\. Rezazadeh, A\. Shah, Y\. Bao,et al\.\(2025\)Mcp\-bench: benchmarking tool\-using llm agents with complex real\-world tasks via mcp servers\.arXiv preprint arXiv:2508\.20453\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[29\]T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[30\]S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan\(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.ArXivabs/2406\.12045\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270562578)Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[31\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§2\.1](https://arxiv.org/html/2605.16909#S2.SS1.p1.2)\.
- \[32\]M\. Zhao, W\. Dong, Y\. Zhang, X\. Zheng, Z\. Zhang, Z\. Zhou, Y\. Guan, L\. Xu, W\. Peng, Z\. Gong,et al\.\(2025\)SpineBench: a clinically salient, level\-aware benchmark powered by the spinemed\-450k corpus\.arXiv preprint arXiv:2510\.03160\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[33\]D\. Zheng, W\. Dong, H\. Hu, X\. Chen, and Y\. Wang\(2023\)Less is more: focus attention for efficient detr\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 6674–6683\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[34\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§4\.2](https://arxiv.org/html/2605.16909#S4.SS2.p1.1)\.
- \[35\]Y\. Zhou, M\. Zhao, Z\. Wang, D\. Gu, B\. Guo, R\. Ye, L\. Han, C\. Jin, and D\. N\. Metaxas\(2025\)Mˆ 3\-bench: multi\-modal, multi\-hop, multi\-threaded tool\-using mllm agent benchmark\.arXiv preprint arXiv:2511\.17729\.Cited by:[§1](https://arxiv.org/html/2605.16909#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.
- \[36\]S\. Zhu, W\. Dong, J\. Song, Y\. Wang, Y\. Guo, and B\. Zheng\(2024\)FILA: fine\-grained vision language models\.arXiv preprint arXiv:2412\.08378\.Cited by:[§2\.2](https://arxiv.org/html/2605.16909#S2.SS2.p1.2)\.

## Appendix ACatalog of MCP Servers

We show all the MCP servers used in the TOBench in Table 3\. Most of these MCP servers are sourced from existing community\-developed projects, providing a broad foundation of realistic services for the benchmark\. For a substantial proportion of these community resources, we have implemented standardized refactoring\. This includes not only optimizing tool output formats for but also strengthening error handling to ensure that agents receive more instructional feedback during long\-horizon tasks\.

To support the coreperceive–act–inspect–reviseloop, we developed several MCP servers for vision and audio modalities\. We constructed theImage\-Processing\-Toolkits, which provides functions like cropping, pixelation, and watermarking, and integrates rendering engines to export PPTX, Excel, and Word documents as images\. This enables agents to visually inspect layout and formatting for autonomous self\-correction\. Furthermore, we built theImage\_Generation\_Serverpowered bySeedream 4\.0to support generative tasks\. For multimedia workflows, we introduced thevideo\_audio\_editor\_toolkits, a tool for omni\-modal agents that enables precise editing of video clips and audio tracks\. This allows agents to perform temporal\-modal operations\. For broader audio interaction, we also integrated Whisper \(ASR\) and Microsoft Edge TTS to bridge the gap in current native omni\-modal models\. We will make these optimized and newly developed MCP servers publicly available to the community\.

Table 3:MCP servers currently integrated into TOBench\.
## Appendix BExperimental Details of TOBench

### B\.1Definition of Benchmark Traits

To provide a clear understanding of the benchmark comparisons in Table[1](https://arxiv.org/html/2605.16909#S2.T1), we define each column header as follows:Tasksrefers to the total number of evaluation instances, whileDom\.andToolsrepresent the count of distinct application domains and available external APIs, respectively\.MCP Eco\.indicates if the tool environment integrates the Model Context Protocol ecosystem, andCross Scen\.denotes whether tasks require operating across multiple scenarios in a single workflow\.Real Env\.signifies that observations derive from real\-world applications with realistic state initialization, andInfo\. Grnd\.ensures actions are supported by external tool evidence rather than internal parametric knowledge\.Fuzzy Promptmeans requests are natural and potentially under\-specified without predefined trajectories\. Regarding modalities,Vis\.andAud\.represent the need to process vision and audio information,Cross Modaldenotes the seamless integration of multiple modalities simultaneously, andGen\.indicates tasks requiring the invocation of multimodal generation models to synthesize new artifacts\.

### B\.2Execution Efficiency and Runtime

The TOBench harness supports both sequential and parallel task execution\. For our evaluation, we conducted tests on an AWS instance equipped with a 4\-core CPU and 16GB RAM, utilizing a parallelism factor of 2\. Under this configuration, the total execution time for the agent to complete all tasks ranges from 3 to 8 hours\. The actual evaluation efficiency depends on factors such as model reasoning complexity, the underlying response latency of MCP servers, and the specific parallelism configuration\.

## Appendix CAgent Context and Interaction Framework

In TOBench, tasks are executed via dialogues between an agent and a simulated user, governed by two independent system prompt architectures characterized by a strategicinformation gap\. Currently, the benchmark focuses onsingle\-turn tasks, where the complete request is delivered in a single interaction\. TheAgent System Promptdefines the ReAct reasoning framework and incorporates task\-specificDomain Rules\. These rules simulate professional standards and platform constraints \(e\.g\., specific formatting styles or quality self\-correction protocols\) that are transparent to the user, requiring the agent to autonomously align with implicit professional requirements without direct guidance\. Conversely, theUser System Promptensures the simulated user acts as thesingle source of truth\. It mandates strict adherence to theUser PersonaandTask Instructions, ensuring complete information transfer while strictly prohibiting the disclosure of task solutions or misleading cues to maintain a rigorous and fair evaluation\. We use theCustomer\_Service/Hotel\_Welcome\_PPTXtask as a representative example to illustrate the specific context configuration below\.

Agent System Prompt\#\# Role You are an intelligentOmni\-ModalityReAct \(Reasoning and Acting\) Agent that solvesuser tasksbyrole\-playinga specified persona and strictly following that persona’sDomain Rules\. Your primary goal is to accurately solve user tasks by engaging in a cycle of reasoning, planning, tool execution, and reflection\. You must dynamically adapt your actions based on tool feedback to ensure high\-quality outcomes\.\#\#\# Persona Creative Experience Designer\#\#\# Domain Rules1\.Visual Consistency: When designing the hotel’s welcome\-themed PPTX, you should first review the template fileexample\.pptxand then design the PPTX according to the template’s layout\. After completing the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed\.2\.If the generated image does not meet the user’s actual requirements, you need to re\-generate the image until it meets the user’s requirements\.3\.PPTx details: The text for theWelcome Titleshould be set to 32 pt\. TheGreetingsection text should be set to 24 pt\. The providedhotel\_logo\.pngmust be inserted at the bottom of the slide, and the image should not be overly eye\-catching\.\#\# Operational Guidelines 1\. Task Processing Protocol•Analyze & Plan: Upon receiving a request, explicitly reason about the requirements and formulate a structured preliminary plan\.•Iterative Execution: Execute your plan step\-by\-step\. After each tool call, analyze the result to decide the next step\. If a step fails or produces unexpected results, reflect on the cause and adjust your plan accordingly\.•Outcome Verification: Merely executing tool calls is insufficient\. You must verify that the actual content generated or retrieved \(e\.g\. cropped image, clipped video\) strictly meets the user’s requirements before considering a step complete\.2\. Multimodal Data Handling•Selective Inspection: You have access to multimodal inputs \(text, images, audio, video\)\. Inspect these assets only when essential for task comprehension or result verification using available viewer tools\.•Document Standards: For document generation tasks \(PPT, LaTeX, Word\), you are responsible for ensuring professional formatting, layout consistency, and visual appeal\.3\. Tool Usage & File Management•Parameter Alignment: When using generation tools \(e\.g\., for media creation\), carefully select parameters that align with the specific context, style, and constraints of the user’s request\.•Absolute Paths Mandatory: You must useabsolute pathsfor all file references, whether writinginput\_file\_pathoroutput\_file\_pathparams for tools\. Relative paths are strictly prohibited\.\#\# Critical Constraints•Workspace Boundary: All file operations must occur within the accessible workspace directory:\!\!<<<<\|\|\|\|workspace\_dir\|\|\|\|\>\>\>\>\!\!\. Output files must be saved here\.•Task Termination: If the user sends the signal"\#\#\#\# STOP", or if you have self\-verified that the task is fully completed, you must conclude the session\. To finish, either call thelocal\-claim\_donetool \(if available\) or respond with a final text message without any tool calls\. This action is irreversible\.

User System Prompt\# Role Setting You represent a user interacting with an intelligent agent\. You must strictlyrole\-playthe specificUser Personadefined below\. Your goal is to convey the task to the agent through dialogue based on the content ininstructions\.Core Principle: All your expressions and demands must be:1\.Faithful to the Persona: Speak and act according to the identity, tone, and context defined inUser Persona\.2\.Faithful to the Instructions: Your requests must strictly align with theTask Configuration\. Do not hallucinate or deviate from the given task details\.\#\# Task Configuration \#\#\# User Persona I am the General Manager of Nebula Heights Resort\.\#\#\# Instructions We have a very important guest checking in today: a boy named Leo who is celebrating his 10th birthday and loves astronauts and space exploration\. First, I would like you to use the photo from Leo’s parents,Leo\.png, to generate an image of Leo wearing a spacesuit, standing on the moon, with a spaceship next to him and a blue starry sky as the background\. This image should be namedLeo\_astronaut\.png\. Then, to welcome Leo, I need you to create awelcome\_tv\.pptxfor the in\-room TV\. The PPT’sWelcome Titleshould be ’Space Explorer’, and the image you just created should be inserted into the position specified by the template\. TheGreetingsection should say ’Happy 10th Birthday, Astronaut Leo\!’\. Finally, please insert the providedhotel\_logo\.pnginto the PPTX\.\#\#\# Turn Mode single\_turn\#\# Guidelines \#\#\# Conversation Style Rules•Context & Needs Combination: When making statements, adopt the “Background Description \+ Specific Request” pattern\. First explain the background situation, then propose the specific task request\.•Statement Pacing:–Insingle\-turnscenarios, please state all content completely at once\.–Inmulti\-turnscenarios, autonomously break down the task and state it gradually across multiple rounds\. Do not dump all information at once\.•Response Principles: When the agent asks for information, provide only the conditions and preferences explicitly stated ininstructions\.Strictly prohibitassumptions, expansions, substitutions, or generalizations\.\#\#\# Information Disclosure Rules•Faithful & Natural: Directly convey the core information frominstructions, but fine\-tune the expression based on the persona context to make it sound natural\.•Full Detail Coverage: You must ensure every detail ininstructionsis mentioned in the conversation \(even seemingly trivial background information\), as this information may affect the agent’s recommendations and decisions\.\#\#\# Information Processing Rules•Fact\-Based Answers: When the agent asks for information, answer strictly based oninstructions\. If the information is not provided ininstructions, explicitly answer “I don’t remember” or “I don’t know”\.Never fabricateinformation not provided\.•Strict Adherence to Needs: Strictly follow the requirements stated ininstructionswhen proposing needs\.•Resistance to Persuasion: When the agent tries to persuade you to change your needs, you must stick to the original requirements ininstructionsand do not compromise easily\.

## Appendix DDetailed Error Taxonomy

Table 4:Detailed error taxonomy used in TOBench analysis\. The main paper discusses the five top\-level categories, while this appendix enumerates the finer\-grained subcategories used for trajectory annotation and error heatmaps\.CategorySubcategoryDefinition and Representative SymptomsTool Call ErrorWrong Tool SelectionThe agent chooses a tool whose functionality does not match the required operation, such as reading PDFs with plain\-text tools or using filesystem writes to fake Office artifacts\.Missing Essential ToolThe workflow omits a required tool invocation or a necessary concluding step, such as skipping search for time\-sensitive facts or forgetting a save/export action\.Tool Planning ErrorThe agent uses tools in an inefficient, illogical, or non\-adaptive way, for example by repeating avoidable tool calls, oversampling videos frame\-by\-frame, or persisting with the same failed plan after an error\.Hallucinated ToolThe agent invokes a tool that does not exist in the available MCP environment\.Tool Parameter ErrorParameter Type ErrorThe agent chooses the correct tool but passes values with invalid data types or basic numeric validity, such as floats where integers are required or non\-positive intervals for frame extraction\.Parameter Range ErrorThe parameter value exceeds objective bounds of the target file, page range, or media length, such as requesting PDF pages beyond the document length\.Parameter Object Selection ErrorThe parameter points to the wrong target object, URL, path, slide, worksheet, or region, such as broken URLs, nonexistent image paths, or editing the wrong worksheet\.Missing ParameterRequired tool arguments are omitted, such as missing content fields or mandatory identifiers\.Hallucinated ParameterThe agent invents unsupported argument names or unsupported option values, such as invalid operation enums or nonexistent transition settings\.Multimodal Capability DeficitFine\-grained PerceptionThe agent sees the asset but fails to extract crucial small details or structured information, such as misreading invoice fields, tables, or screenshots\.Visual\-Spatial UnderstandingThe agent cannot reason accurately about coordinates, geometry, or layout constraints, leading to wrong crops, wrong mosaic targets, overlapping PPT elements, or boundary violations\.Video Temporal Relationship UnderstandingThe agent fails to localize or causally interpret time\-dependent events in video or audio streams, such as choosing the wrong second for an event or missing causally important actions between sampled frames\.Cross\-modal Information FusionThe agent fails to align and jointly reason across text, image, audio, video, or document evidence, such as misaligning spoken descriptions with a video timeline or failing multi\-hop reasoning across text and a map\.Multimodal HallucinationInstead of admitting uncertainty or calling another tool, the agent fabricates visual or cross\-modal content, such as inventing product defects, invoice amounts, or scene descriptions from blurry evidence\.Self\-Verification FailureNo VerificationThe agent edits or generates a visual artifact and declares success without any visual inspection\.Pseudo\-VerificationThe agent performs only non\-visual checks that cannot validate the actual output quality, such as checking file existence, dimensions, or metadata instead of previewing the artifact\.Incomplete VerificationThe agent verifies early or local steps but skips final global inspection after composition, such as inspecting source images but not the final PPT page\.Non\-Agent ErrorFalse Positive on Error TrajectoryThe log contains the word “error” although the tool call itself is not actually an agent mistake, such as success payloads witherror: nullor filenames that naturally contain “error”\.MCP IssueThe failure is caused by the MCP implementation, provider, or surrounding infrastructure rather than by the agent decision, such as external API outages or MCP\-side parsing issues\.Agent Base Model IssueThe base model fails due to system\-level limitations rather than task\-specific reasoning mistakes, such as context\-window overflow on large multimodal inputs\.Table 5:Taxonomy of task\-specific evaluation points in TOBench\. Different tasks mix these criterion types in different ways, which motivates per\-task evaluator synthesis rather than a single benchmark\-wide scoring template\.
## Appendix EExtra Analysis for TOBench

### E\.1Multimodal Capability Deficit

#### E\.1\.1Fine\-grained Perception

Task ContextUser Request: I have a rough floor plan for the Johnson\-Smith wedding inlayout\_sketch\.jpg\. Because the planning was so rushed, one small table and its chairs in the drawing edge was cancelled, and I marked it with pencil asCANCELED\. Please exclude this table and its chairs from all subsequent budget calculations\. Based on your hotel’s standard table and chair pricing, please create an Excel file namedwedding\_budget\.xlsx\. In the Excel sheet, list the items \(tables and chairs\), quantities, unit prices, and total costs\. Be sure to use actual Excel formulas for the calculations so I can update it later\.Domain Rules: Senior Event Coordinator 1\. Inventory & Pricing: The hotel’s standard rates are $25 for round tables, $30 for rectangular tables, and $5 per chair\. 2\. Excel Format: The Excel workbook should contain only one worksheet namedbudget, and all calculations and summaries must be performed on this sheet\.

![Refer to caption](https://arxiv.org/html/2605.16909v1/wedding_layout_sketch.jpg)Figure 7:Rough floor plan for the Johnson\-Smith weddinglayout\_sketch\.jpg\.![Refer to caption](https://arxiv.org/html/2605.16909v1/wedding_excel_gt.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.16909v1/wedding_excel_result.png)\(b\)

Figure 8:Comparison of output results: \(a\) Ground truth for the signature placement task; \(b\) Final image generated by Gemini 3 Pro\.In the task Customer\_Service/Hotel\_Wedding\_Layout, agent \(Gemini 3 Pro\) needs to count the types and quantities of tables and chairs required for the wedding based on the wedding seating sketch provided by the user\. It shall then fill in the unit prices of seats specified in the hotel’s domain rules into an Excel sheet, and use Excel formulas to calculate the final results\. During the execution process, the agent successfully viewed the layout sketch image but failed to accurately count the number of chairs to be included in the budget\. It wrote 28 chairs into the Excel file, whereas the ground truth was 27\.

#### E\.1\.2Visual\-Spatial Understanding

Task ContextUser Request: I am a rental agency intermediary and need to place two electronic signatures onto the correct positions in a lease agreement image\. Placelandlord\.pngonto theLANDLORD’S SIGNATUREline at the bottom ofrental\_agreement\.jpg, and placetenant\.pngonto theTENANT’S SIGNATUREline\. Output a finalized signed contract image\. Domain Rules: 1\. Output Scope: Only generate the final composited image file\. Do not create Word, PDF, or other document formats\. 2\. Placement Requirement: The landlord’s signature must be placed on the line labeledLANDLORD’S SIGNATUREat the bottom of the contract\. The tenant’s signature must be placed on the line labeledTENANT’S SIGNATURE\. 3\. Size and Proportion: Each signature must be proportionally scaled according to the length of the signature line, preserving the original aspect ratio without stretching\. Signature strokes must remain clear and legible\. 4\. No Obstruction: The signatures must not cover the labels \(LANDLORD’S SIGNATURE/TENANT’S SIGNATURE\) or any critical contract text\. They should only overlap the signature line area\. 5\. Output Naming: Save the final output file assigned\_rental\_agreement\.png\.

![Refer to caption](https://arxiv.org/html/2605.16909v1/rental_agreement_input.jpg)\(a\)
![Refer to caption](https://arxiv.org/html/2605.16909v1/rental_agreement_result.png)\(b\)

Figure 9:Initial and output files for the signature placement task: \(a\) Original lease agreement image \(rental\_agreement\.jpg\); \(b\) Output generated by Qwen 3\.5 Plus showing spatial misalignment\.In the task Customer\_Service/Real\_Estate\-rental\_agreement, agent \(Qwen 3\.5 Plus\) needs to use visual recognition to precisely locate the signature labels and lines at the bottom of the agreement within the source image\. Subsequently, the agent must dynamically scale the signature images based on the line length and transparently composite them onto specific pixel coordinates while ensuring no contract text is obstructed\. During the execution process, the agent continuously attempted to adjust the placement and scaling of the images from turn 5 to turn 40\. However, the final results show that the landlord’s signature overlaps the text labels, while the tenant’s signature severely deviates from the designated area and even obscures the footer\. Furthermore, the signature dimensions are oversized, which directly violates domain rules regarding proportional scaling and positional alignment\.

#### E\.1\.3Video Temporal Relationship Understanding

Task ContextUser Request: My name is Alex\. I encountered a very aggressive food delivery courier\. My order number is 35536\. During the delivery, the courier smashed my food and even ate part of my order, causing me a loss\. I have provided a video as evidence and hope the platform can investigate this incident and offer appropriate compensation for my loss\. Domain Rules: 1\. Order and Courier Verification: Based on the order number provided by the user, the agent must search the Excel order table to identify the corresponding courier information \(such as name or courier ID\)\. If the order cannot be found in the table, this must be explicitly stated in the report, and the agent must not speculate about the courier’s identity\. 2\. Video Evidence Review Principle: All judgments regarding the courier’s behavior \(including but not limited to damaging food, consuming customer items, or engaging in a conflict\) must be strictly based on what is directly observable in the video\. The agent must not infer actions that are not clearly shown\. If key details are unclear or not visible, the report must clearly state that the behavior cannot be confirmed\. 3\. Report Output Requirement: The agent must generate a formal investigation report in Word format\. The file name should follow a generic incident\-based naming convention and must not hard\-code the order number\. 4\. Report Structure Specification: The report must contain the following top\-level sections in order: \(1\) Incident Summary \(2\) Courier Information \(sourced from order data\) \(3\) Video Evidence Observations \(objective, factual description only\) \(4\) Conflict Cause Analysis \(5\) Handling Recommendations and Conclusion\. 5\. Penalty Recommendation Rules: A fine or disciplinary action may only be recommended if the video evidence clearly supports that the courier engaged in improper behavior that harmed the customer’s interests\. If a dispute is visible but malicious behavior cannot be confirmed, the recommendation should focus on further investigation, warning, or training, with the reasoning clearly explained\. 6\. Compensation and Redelivery Decision: If it can be confirmed that the food was damaged or consumed, the conclusion must clearly state whether redelivery or refund/compensation is recommended, and the decision must be justified based on verifiable evidence\. 7\. the final report must be named asincident\_report\.docx

turn8: read\_media\_file"path": workspace/video\_frames/sec\_0000\.jpg

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/food_sec_0000.jpg)

turn9: read\_media\_file"path": workspace/video\_frames/sec\_0010\.jpg

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/food_sec_0010.jpg)

turn10: read\_media\_file"path": workspace/video\_frames/sec\_0020\.jpg

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/food_sec_0020.jpg)

turn11: read\_media\_file"path": workspace/video\_frames/sec\_0030\.jpg

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/food_sec_0030.jpg)

turn12: read\_media\_file"path": workspace/video\_frames/sec\_0040\.jpg

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/food_sec_0040.jpg)

turn13: read\_media\_file"path": workspace/video\_frames/sec\_0050\.jpg

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/food_sec_0050.jpg)

turn14: read\_media\_file"path": workspace/video\_frames/sec\_0055\.jpg

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/food_sec_0055.jpg)

In the task Customer\_Service\_TAN/Food\_Delivery\-delivery\_problems\_report, the agent needs to first verify the courier’s identity by retrieving data from an Excel order table based on the provided order number\. Subsequently, the agent is required to perform a frame\-by\-frame analysis of the video evidence to objectively record violations such as food damage or verbal conflicts, reconstruct the cause of the dispute \(e\.g\., a tip\-related confrontation\), and ultimately generate a formal investigation report titled incident\_report\.docx following a predefined structure that includes penalty and compensation recommendations\. During the execution process, due to a fixed\-frame extraction strategy \(every 10 seconds\) and a lack of deep video causality modeling, the agent \(MiniMax M2\.5\) failed to capture or understand the continuous actions of the courier within the temporal dimension\. This led the agent to completely miss critical aggressive behaviors, such as the deliberate throwing of food and the confrontation over a tip, resulting in the false conclusion that there was no evidence of aggressive behavior\.

#### E\.1\.4Cross\-modal Information Fusion

Task ContextUser Request: I am a content creator who wants to share my real experience and explain why I do not recommend working as an UberEats courier on Instagram\. I will provide a selfie\-style video in which I explain three reasons why I believe working for UberEats is not worth it\. Please summarize and refine these three reasons, write an Instagram caption, and organize the written content into a Word document\. When I discuss the first reason, there is a screenshot showing my working hours in the video — please capture that frame and save it as a separate image file\. Domain Rules: 1\. Video Understanding: The agent must fully watch the input video and identify exactly three reasons stated by the speaker explaining why working for UberEats is not recommended\. No additional reasons may be inferred or added\. 2\. Reason Condensation: Each reason must be summarized into one concise, impactful sentence suitable for social media consumption\. 3\. Image Extraction: When the first reason is being discussed, if the video displays any work\-hour or income\-related statistics interface, the agent must capture that frame and save it as a standalone image file namedUberEats\_WorkHours\_Screenshot\.png\. 4\. Final Output: The agent must generate a complete Instagram caption and save all written content into a Word document namedUberEats\_Instagram\_Content\.docx\. The image file must not be embedded in the Word document\.

turn4: audio\_to\_text"path": workspace/UberEats\_audio\.mp3

Tool Response"srt\_text": "Here are the top three reasons not to work for food delivery services like door\-dash and Uber eats\. The first is that unlike a normal job, there’s no overtime\. I work a little over 60 hours a week right now and although I’m definitely making a lot more of my last job, I’m working way longer hours and getting paid the same no matter how much I work\. The second reason is you’re very dependent on tips\. More than half of what I make is in the form of tips, so if you live an area ab… \(truncated\)

turn6: export\_frames\_every\_second"interval\_seconds":1,"output\_dir":"workspace/frames","video\_path":"workspace/UberEats\.mp4"

Tool Response"exported\_count":42, "output\_directory":"workspace/frames", "frames": \["workspace/frames/sec\_0000\.jpg", "workspace/frames/sec\_0001\.jpg", "workspace/frames/sec\_0002\.jpg", … \(truncated\)

turn7: read\_media\_file"path":"workspace/frames/sec\_0005\.jpg"

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/uber_sec_0005.jpg)

turn11: read\_multiple\_files"paths":\["workspace/frames/sec\_0018\.jpg","workspace/frames/sec\_0040\.jpg"\]

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/uber_sec_0018.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/uber_sec_0040.png)

turn13: move\_file"destination":"workspace/UberEats\_WorkHours\_Screenshot\.png", "source":"workspace/frames/sec\_0040\.jpg"

![Refer to caption](https://arxiv.org/html/2605.16909v1/uber_gt.jpg)Figure 10:Ground truth screenshot displaying the work hours record interface\.In the taskCustomer\_Service/Food\_Delivery\-ubereats\_negative, the agent needs to act as a multimodal content assistant by processing a video clip, which first requires using speech recognition technology to accurately extract and condense the author’s three core reasons for not recommending work as an UberEats courier\. These reasons must be transformed into concise social media captions suitable for Instagram and organized into a Word document titledUberEats\_Instagram\_Content\.docx\. Simultaneously, the agent must demonstrate cross\-modal alignment capabilities by identifying and capturing a data interface showing working hours or income statistics from the video during the specific segment where the author discusses the first reason, saving it as a standalone image file namedUberEats\_WorkHours\_Screenshot\.pngwhile strictly adhering to naming conventions and file organization requirements\. During the execution process, Gemini 3\.1 Pro successfully utilized anaudio\_to\_texttool to obtain the transcript and identify the starting point of the first reason’s discussion; however, it exhibited significant deficiencies in multimodal information fusion\. Specifically, the agent failed to effectively map the semantic timeline of the audio to the visual timeline of the video, leading to a failure in the keyframe capture request\. Instead of extracting the statistical interface from the beginning of the video, it erroneously captured a frame at the 40\-second mark showing an unrelated outdoor scene\. This failure in cross\-modal alignment directly caused the task results to deviate severely from the user’s requirements, exposing the model’s insufficient depth in understanding complex audiovisual semantic correlations\.

#### E\.1\.5Multimodal Hallucination

Task ContextUser Request: I am a QA tester preparing a high\-signal bug report for a Minecraft duplication glitch that allows duplicating enchanted gear\. Use the provided gameplay recording to locate the first moment the item duplication glitch is clearly triggered\. Extract three keyframes and write a standard Word bug report containing precise MM:SS timestamps and a step\-by\-step reproduction procedure\. All generated files must comply with official QA file naming specifications\. Domain Rules: Game QA Evidence Specialist1\. Glitch Moment Definition: The glitch moment refers to the earliest timestamp where duplicated enchanted items spontaneously appear in inventory or container interface without legitimate crafting, trading, or natural loot acquisition behavior\. 2\. Timestamp Identification: The agent must review the full video footage to confirm the exact first trigger time of the duplication glitch\. 3\. Keyframes: Extract exactly three PNG keyframes from the original video: \(a\) 2 seconds before the glitch trigger moment, \(b\) the exact glitch trigger moment, \(c\) 2 seconds after the glitch trigger moment\. Save them asglitch\_pre\.png,glitch\_trigger\.png,glitch\_post\.png\. 4\. Report: Createmc\_item\_duplication\_bug\_report\.docxwith fixed sections: Title, Test Environment \(mark unknown items asNot provided\), Numbered Reproduction Steps, Expected Behavior, Actual Behavior, and Evidence Attachment \(list screenshot files with corresponding MM:SS timestamps\)\. All timestamps must follow MM:SS standard format\.

turn4: export\_frames\_every\_second"video\_path":"workspace/bug1\.mp4", "interval\_seconds":1, "output\_dir":"workspace/frames"

Tool Response"exported\_count":9, "output\_directory": "workspace/frames", "frames": \[ "workspace/frames/sec\_0000\.jpg", "workspace/frames/sec\_0001\.jpg", "workspace/frames/sec\_0002\.jpg",… \(truncated\)

turn5: read\_media\_file"path":"workspace/frames/sec\_0000\.jpg"

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/game_sec_0000.jpg)

turn6: export\_frame\_at\_second"second":0, "video\_path":"workspace/bug1\.mp4", "output\_path":"workspace/frame\_before\.png"

Tool Responseworkspace/frame\_before\.png

turn7: create\_document"filename":"workspace/mc\_infinite\_diamonds\_bug\_report\.docx"

In the taskIntelligent\_Creation/Game\-bug\_report, the agent needs to precisely locate the trigger moment of a Minecraft item duplication glitch by reviewing gameplay footage to identify the exact second enchanted gear abnormally appears\. The agent must then extract three PNG keyframes representing the moments before, during, and after the trigger and generate a standardized report titledmc\_item\_duplication\_bug\_report\.docxthat includes precise MM:SS timestamps, detailed reproduction steps, and an evidence attachment list\. During the execution process, the GPT 4o exhibited severe hallucinations by skipping the actual visual reasoning process after viewing only the first frame of the video\. Consequently, it fabricated the glitch trigger moment and all subsequent reproduction steps and timestamp data, failing to fulfill the domain rule of confirming the actual trigger time\.

### E\.2COMPLETE EXAMPLE TASK TRAJECTORIES

#### E\.2\.1Traj 1

Task ContextUser Request: I am the General Manager of Nebula Heights Resort\. We have a very important guest checking in today: a boy named Leo who is celebrating his 10th birthday and loves astronauts and space exploration\. First, I would like you to use the photo from Leo’s parents,Leo\.png, to generate an image of Leo wearing a spacesuit, standing on the moon, with a spaceship next to him and a blue starry sky as the background\. This image should be namedLeo\_astronaut\.png\. Then, to welcome Leo, I need you to create awelcome\_tv\.pptxfor the in\-room TV\. The PPT’sWelcome Titleshould be‘‘Space Explorer’’, and the image you just created should be inserted into the position specified by the template\. TheGreetingsection should say‘‘Happy 10th Birthday, Astronaut Leo\!’’\. Finally, please insert the providedhotel\_logo\.pnginto the PPTX\.Domain Rules: Creative Experience Designer 1\. Visual Consistency: When designing the hotel’s welcome\-themed PPTX, you should first review the template fileexample\.pptxand then design the PPTX according to the template’s layout\. After completing the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed\. 2\. If the generated image does not meet the user’s actual requirements, you need to re\-generate the image until it meets the user’s requirements\. 3\. PPTx details: The text for theWelcome Titleshould be set to32 pt\. TheGreetingsection text should be set to24 pt\. The providedhotel\_logo\.pngmust be inserted at the bottom of the slide, and the image should not be overly eye\-catching\.

ThinkingLet me break down this task:1\. First, I need to generate an image of Leo wearing a spacesuit, standing on the moon, with a spaceship next to him and a blue starry sky as the background\. The image should be named "Leo\_astronaut\.png"\.2\. Then, I need to create a welcome PPTX \("welcome\_tv\.pptx"\) using the template "example\.pptx": \- Welcome Title: "Space Explorer" \- Insert the generated image \("Leo\_astronaut\.png"\) into the position specified by the template \- Greeting section: "Happy 10th Birthday, Astronaut Leo\!" \- Insert "hotel\_logo\.png" at the bottom of the slideLet me start by: 1\. Reading the template file "example\.pptx" to understand its layout 2\. Reading the "Leo\.png" and "hotel\_logo\.png" files to understand what we’re working with 3\. Generating the astronaut imageLet me first check the workspace directory structure and read the necessary files\.

list\_directory"path": "workspace"

Tool Response\[FILE\]Leo\.png \[FILE\]example\.pptx \[FILE\]hotel\_logo\.png

read\_media\_file"path": "workspace/Leo\.png"

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/ppt_Leo.png)

read\_media\_file"path": "workspace/hotel\_logo\.png"

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/ppt_hotel_logo.png)

get\_template\_file\_info"template\_path": "workspace/example\.pptx"

Tool Response\{ "template\_path": "workspace/example\.pptx", "file\_size\_bytes": 34819, "slide\_count": 1, "layout\_count": 11, "slide\_layouts": \[ \{ "index": 0, "name": "Title Slide", "placeholder\_count": 5 \}, \{ "index": 1, "name": "Title and Content", "placeholder\_count": 5 \}, \{ "index": 2, "name": "Section Header", "placeholder\_count": 5 \}, \{ "index": 3, "name": "Two Content",… \(truncated\)

ThinkingNow I have a clear picture of the template and the images\. Let me analyze:1\. \*\*Leo\.png\*\*: Shows a young boy in a blue striped shirt holding a yellow soccer ball, running on grass with a playground in the background\.2\. \*\*hotel\_logo\.png\*\*: The Hilton Hotels & Resorts logo\.3\. \*\*Template \(example\.pptx\)\*\*: Has 11 layouts\. I need to understand the structure better\. The template has layouts like "Title Slide", "Title and Content", etc\.Now let me: 1\. First generate the astronaut image of Leo on the moon with a spaceship and blue starry sky 2\. Then create the PPTX using the templateLet me start by generating the astronaut image\.

image2image\_generate"image\_aspect\_ratio": "16:9", "image\_save\_path": "workspace/Leo\_astronaut\.png", "input\_image\_path": "workspace/Leo\.png", "prompt": "Transform this boy into an astronaut wearing a white spacesuit with helmet, standing on the moon’s gray cratered surface\. Next to him should be a futuristic silver spaceship\. The background should be a beautiful blue starry sky with stars and possibly Earth visible in the distance\. Keep the boy’s face visible and recognizable, maintaining his cheerful expression\. Create a magical space exploration scene suitable for a 10th birthday celebration\."

Tool Response\{ "success": true, "message": "Image generated successfully\.", "content": null, "output\_path": "workspace/Leo\_astronaut\.png" \}

ThinkingThe astronaut image has been generated\. Let me verify it looks good before proceeding with the PPTX creation\.

read\_media\_file"path": "workspace/Leo\_astronaut\.png"

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/ppt_Leo_astronaut.png)

ThinkingThe astronaut image looks great \- Leo is wearing a spacesuit, standing on the moon with a spaceship next to him and a blue starry sky background\. Now I need to create the welcome PPTX using the template\. Let me open the template and create the presentation\.Based on the template info, I’ll use layout index 1 \(Title and Content\) which has placeholders for title and content\. I need to: 1\. Create a presentation from the template 2\. Set the "Welcome Title" to "Space Explorer" \(32 pt\) 3\. Add the greeting "Happy 10th Birthday, Astronaut Leo\!" \(24 pt\) 4\. Insert the Leo\_astronaut\.png image 5\. Insert the hotel\_logo\.png at the bottomLet me open the template and start building the presentation\.

create\_presentation\_from\_template"id": "welcome\_presentation", "template\_path": "workspace/example\.pptx"

Tool Response\{ "presentation\_id": "welcome\_presentation", "message": "Created new presentation from template ’workspace/example\.pptx’ with ID: welcome\_presentation", "template\_path": "workspace/example\.pptx", "slide\_count": 1, "layout\_count": 11 \}

ThinkingGood, the presentation is created\. Now let me get the slide info to understand the placeholders better, and then add the content\.

get\_slide\_info"presentation\_id": "welcome\_presentation", "slide\_index": 0

Tool Response\{ "slide\_index": 0, "layout\_name": "Title Slide", "placeholder\_count": 1, "placeholders": \[ \{ "idx": 1, "type": "SUBTITLE \(4\)", "name": "Subtitle 2" \} \], "shape\_count": 4, "shapes": \[ \{ "index": 0, "name": "Subtitle 2", "shape\_type": "PLACEHOLDER \(14\)", "left": 3742267, "top": 664105, "width": 3953933, "height": 563562 \},… \(truncated\)

ThinkingI see the slide has placeholders\. Let me add the title and greeting text\. I’ll use the manage\_text tool to add the "Space Explorer" title and the greeting\. I also need to add the astronaut image and the hotel logo\.Let me first add the title text, then the greeting, then the images\.

manage\_text"alignment": "center", "font\_size": "32", "operation": "add", "presentation\_id": "welcome\_presentation", "slide\_index": 0, "text": "Space Explorer", "top": 0\.5, "width": 9

Tool Response\{ "message": "Added text box to slide 0", "shape\_index": 4, "text": "Space Explorer" \}

ThinkingGood, the title is added\. Now let me add the greeting text below it\.

manage\_text"alignment": "center", "font\_size": "24", "operation": "add", "presentation\_id": "welcome\_presentation", "slide\_index": 0, "text": "Happy 10th Birthday, Astronaut Leo\!", "top": 2, "width": 9

Tool Response\{ "message": "Added text box to slide 0", "shape\_index": 5, "text": "Happy 10th Birthday, Astronaut Leo\!" \}

ThinkingGood, now I need to add the astronaut image and the hotel logo\. Let me add the Leo\_astronaut\.png image first, then the hotel\_logo\.png at the bottom\.

manage\_image"height": "4", "image\_source": "workspace/Leo\_astronaut\.png", "left": 2\.5, "operation": "add", "presentation\_id": "welcome\_presentation", "slide\_index": 0, "top": 3, "width": "6"

Tool Response\{ "message": "Added image to slide 0", "shape\_index": 6, "image\_path": "workspace/Leo\_astronaut\.png" \}

ThinkingGood, the astronaut image is added\. Now I need to add the hotel logo at the bottom of the slide\. The logo should not be overly eye\-catching, so I’ll make it smaller\.

manage\_image"height": "0\.8", "image\_source": "workspace/hotel\_logo\.png", "left": 4\.5, "operation": "add", "presentation\_id": "welcome\_presentation", "slide\_index": 0, "top": 7\.5, "width": "2"

Tool Response\{ "message": "Added image to slide 0", "shape\_index": 7, "image\_path": "workspace/hotel\_logo\.png" \}

ThinkingGood, now let me save the presentation to the specified file path\.

save\_presentation"file\_path": "workspace/welcome\_tv\.pptx", "presentation\_id": "welcome\_presentation"

Tool Response\{ "message": "Presentation saved to workspace/welcome\_tv\.pptx", "file\_path": "workspace/welcome\_tv\.pptx" \}

ThinkingThe presentation is saved\. Let me verify the final result by extracting the text and checking the slide info to make sure everything is in place\.

extract\_presentation\_text"include\_slide\_info": true, "presentation\_id": "welcome\_presentation"

Tool Response\{ "success": true, "presentation\_id": "welcome\_presentation", "total\_slides": 1, "slides\_with\_text": 1, "total\_text\_shapes": 1, "slides\_text": \[ \{ "slide\_index": 0, "text\_content": \{ "slide\_title": "", "placeholders": \[ \{ "shape\_index": 0, "shape\_name": "Subtitle 2", "shape\_type": "PLACEHOLDER \(14\)", "text": "Welcome Title" \} \] \} \} \] \}

get\_slide\_info"presentation\_id": "welcome\_presentation", "slide\_index": 0

Tool Response\{ "slide\_index": 0, "layout\_name": "Title Slide", "placeholder\_count": 1, "placeholders": \[ \{ "idx": 1, "type": "SUBTITLE \(4\)", "name": "Subtitle 2" \} \], "shape\_count": 8, "shapes": \[ \{ "index": 0, "name": "Subtitle 2", "shape\_type": "PLACEHOLDER \(14\)", "left": 3742267, "top": 664105, "width": 3953933, "height": 563562 \},… \(truncated\)

ThinkingThe presentation looks good\! Let me verify the files are all in place by listing the directory:

list\_directory"path": "workspace"

Tool Response\[FILE\]Leo\.png \[FILE\]Leo\_astronaut\.png \[FILE\]example\.pptx \[FILE\]hotel\_logo\.png \[FILE\]welcome\_tv\.pptx

ThinkingAll files are in place\. Let me summarize what was accomplished:1\. Generated "Leo\_astronaut\.png" \- An image of Leo wearing a spacesuit, standing on the moon, with a spaceship next to him and a blue starry sky background\. 2\. Created "welcome\_tv\.pptx" from the template\. 3\. Added "Space Explorer" as the Welcome Title \(32 pt font\)\. 4\. Added "Happy 10th Birthday, Astronaut Leo\!" as the Greeting \(24 pt font\)\. 5\. Inserted the "Leo\_astronaut\.png" image into the slide\. 6\. Inserted the "hotel\_logo\.png" at the bottom of the slide \(smaller size so it’s not overly eye\-catching\)\.The task is complete\!

local\-claim\_done

Tool Responseyou have claimed the task is done\!

#### E\.2\.2Traj 2

Task ContextUser Request: I’m a software instructor atCreative Tech Academy\. Hello, I have recorded a short video \(Notebook\.mp4\) showing how to use the’pandas’package to read an Excel table\. Could you please listen to my video, and then record the two main parts of the tutorial:Data PreparationandCode Exampleintutorial\.mdfile? Each part should be started with a second\-level header, and describe the content of each step in clear detail\. Also, it’s important to find the key supporting frame related to each part \(data\_preparation\.pngforData Preparation,code\_example\.pngforCode Example\)\. Please save the captured images in theassetsfolder and insert them into the markdown document? The image titles are not needed\.Domain Rules: Technical Writing Assistant 1\. After taking the screenshot, please check that the screenshot content is complete and correct\. 2\. Please use Markdown syntax for inserting images and writing their captions\.

ThinkingThe user wants me to: 1\. Listen to a video \(‘Notebook\.mp4‘\) about using pandas to read an Excel table 2\. Identify two main parts: ‘Data Preparation‘ and ‘Code Example‘ 3\. Create a ‘tutorial\.md‘ file with second\-level headers for each part 4\. Find key supporting frames: \- ‘data\_preparation\.png‘ for Data Preparation \- ‘code\_example\.png‘ for Code Example 5\. Save the images in the ‘assets‘ folder 6\. Insert the images into the markdown documentLet me start by exploring the workspace to find the video file\.

list\_directory\{"path": "workspace/"\}

Tool Response\[FILE\]Notebook\.mp4

ThinkingFound the video file\. Let me first get the video metadata to understand its duration, and then I’ll extract audio to transcribe it\.

get\_video\_metadata\_json\{ "video\_path": "Notebook\.mp4" \}

Tool Response\{ "duration\_seconds": 46\.52, "frame\_size": \[ 1280, 720 \], "fps": 30\.0, "approx\_total\_frames": 1396 \}

ThinkingThe video is about 46 seconds long at 30 fps\. Now let me extract the audio and transcribe it\.

extract\_audio\_from\_video\{ "audio\_codec": "mp3", "output\_audio\_path": "notebook\_audio\.mp3", "video\_path": "Notebook\.mp4" \}

Tool ResponseAudio extracted successfully to notebook\_audio\.mp3

ThinkingNow let me transcribe the audio to understand the content\.

audio\_to\_text\{ "audio\_path": "notebook\_audio\.mp3" \}

Tool Response\{ "srt\_text": " Today, I’m going to show you how to import data contained in a CSV file into a Jupyter notebook\. I have here CSV file that I’ve put into the same folder as my Jupyter notebook, as you can see\. Now, I’m just going to go into my notebook and type Import Pandas\. Then Pandas\.Read under Score CSV, and then in parentheses, and quotes, the name of your CSV file\. Now, I’m going to set a variable equivalent to that\. Let’s call our variable Data Frame\. Now, we’re going to ask Jupyter not… \(truncated\)" \}

ThinkingInteresting\! The audio actually talks about CSV files, not Excel files\. But the user mentioned it’s about reading an Excel table\. Let me watch the video to see what’s actually happening on screen \- perhaps there’s a mistake in the transcription or the user described it differently\. Let me export some frames to understand what’s actually being shown\.Let me export frames at different points in the video to understand the content better\.

export\_frames\_every\_second\{ "interval\_seconds": 5, "output\_dir": "frames", "video\_path": "Notebook\.mp4" \}

Tool Response\{ "exported\_count": 10, "output\_directory": "frames", "frames": \[ "frames/sec\_0000\.jpg", "frames/sec\_0005\.jpg", "frames/sec\_0010\.jpg", "frames/sec\_… \(truncated\)" \}

ThinkingLet me check the frames to see what’s shown in the video\. I’ll look at a few key frames to understand the content better\.

read\_media\_file\{ "path": "frames/sec\_0000\.jpg" \}

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/pandas_sec_0000.jpg)

read\_media\_file\{ "path": "frames/sec\_0010\.jpg" \}

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/pandas_sec_0010.jpg)

ThinkingI can see the video content now\. Let me analyze the frames:1\. First image \(sec\_0000\.jpg\): Shows the file explorer with "data\.csv" and "Untitled\.ipynb" files 2\. Second image \(sec\_0010\.jpg\): Shows the Jupyter notebook with: \- "import pandas" \- "pandas\.read\_csv\(’data\.csv’\)" \- "df = pandas\.read\_csv\(’data\.csv’\)" \- "df" \- The output shows a DataFrame with "kind" column containing "cat" and "dog" values Let me continue viewing more frames to find the key frames for Data Preparation and Code Example sections\.

read\_media\_file\{ "path": "frames/sec\_0020\.jpg" \}

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/pandas_sec_0020.jpg)

read\_media\_file\{ "path": "frames/sec\_0030\.jpg" \}

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/pandas_sec_0030.jpg)

ThinkingBased on the video frames I’ve seen, I can now understand the content:1\. \*\*Data Preparation \(sec\_0000 and sec\_0010 frames\)\*\*: \- Shows a folder with ‘animals\.csv‘ and ‘Jupyter Notebook\.ipynb‘ \- The instruction is to put the CSV file in the same folder as the Jupyter notebook 2\. \*\*Code Example \(sec\_0020 and sec\_0030 frames\)\*\*: \- Shows the Jupyter notebook with the code: \- ‘import pandas‘ \- ‘df = pandas\.read\_csv\("animals\.csv"\)‘ \- ‘df‘ \- Shows the result with cat and dog data \(the audio mentions "cats and dogs"\) Now I need to: 1\. Create the ‘assets‘ folder 2\. Capture the key frames for each part: \- ‘data\_preparation\.png‘ \- should show the folder with CSV file and notebook \- ‘code\_example\.png‘ \- should show the Jupyter notebook code and result 3\. Create the ‘tutorial\.md‘ file Let me identify the best frames: \- For Data Preparation: sec\_0000 or sec\_0010 \- they show the folder with CSV file \- For Code Example: sec\_0020 or sec\_0030 \- they show the code and result Let me capture these specific frames and save them to the assets folder\.

create\_directory\{ "path": "assets" \}

Tool ResponseSuccessfully created directory assets

ThinkingNow let me capture the key frames for each part\. I’ll use the frames I already exported\. Let me save them with the appropriate names\.For Data Preparation, I’ll use sec\_0000\.jpg \(shows the folder with animals\.csv\) For Code Example, I’ll use sec\_0020\.jpg or sec\_0030\.jpg \(shows the code and result\) Actually, I should export specific frames\. Let me check the frame at around 15 seconds for a clearer code example view, and save the appropriate frames\.

export\_frame\_at\_second\{ "output\_path": "assets/data\_preparation\.png", "second": 2, "video\_path": "Notebook\.mp4" \}

Tool Responseassets/data\_preparation\.png

export\_frame\_at\_second\{ "output\_path": "assets/code\_example\.png", "second": 22, "video\_path": "Notebook\.mp4" \}

Tool Responseassets/code\_example\.png

ThinkingNow let me verify the captured images are correct by reading them\.

read\_media\_file\{ "path": "assets/data\_preparation\.png" \}

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/data_preparation.png)

read\_media\_file\{ "path": "assets/code\_example\.png" \}

Tool Response![[Uncaptioned image]](https://arxiv.org/html/2605.16909v1/code_example.png)

ThinkingThe images look good:1\. \*\*data\_preparation\.png\*\* \- Shows the folder with ‘animals\.csv‘ file and ‘Jupyter Notebook\.ipynb‘ \- perfect for the Data Preparation section 2\. \*\*code\_example\.png\*\* \- Shows the Jupyter notebook with: \- Code: ‘import pandas‘ and ‘pandas\.read\_csv\(’animals\.csv’\)‘ and ‘dataframe = pandas\.read\_csv\(’animals\.csv’\)‘ and ‘dataframe‘ \- Output: A table showing data about cats and dogs with columns like ‘id‘, ‘name‘, ‘type‘, ‘age‘, ‘weight‘ Now I can create the tutorial\.md file with: 1\. A title for the tutorial 2\. \#\# Data Preparation section with description and image 3\. \#\# Code Example section with description and image Let me write the markdown file now\.

`write\_file Tool Response Successfully wrote to tutorial\.md Thinking The tutorial\.md file has been created\. Let me verify the content to make sure everything is correct\. read\_file \{ "path": "tutorial\.md" \} Tool Response \# Pandas Tutorial: Reading CSV Data in Jupyter Notebook This tutorial demonstrates how to use the pandas package to read a CSV file into a Jupyter notebook\. \#\# Data Preparation Before writing code, you need to prepare your data files properly: 1\. \*\*Place your CSV file in the same folder as your Jupyter notebook\*\*: Ensure that your data file \(e\.g\., ‘animals\.csv‘\) is located in the same directory as your ‘\.ipynb‘ notebook file\. This makes it easy to reference the file without specifying a full… \(truncated\) Thinking The task is complete\. Let me verify: 1\. I’ve listened to the video and understood the content 2\. Created two main parts in the tutorial\.md: \- \#\# Data Preparation \- describes how to prepare the data files \- \#\# Code Example \- describes the code steps to read CSV with pandas 3\. Captured the key frames: \- ‘data\_preparation\.png‘ for Data Preparation \(shows the folder with animals\.csv and Jupyter Notebook\) \- ‘code\_example\.png‘ for Code Example \(shows the Jupyter notebook with code and output\) 4\. Saved images in the ‘assets‘ folder 5\. Inserted images into the markdown document using Markdown syntax The tutorial\.md file is complete and meets all requirements\. local\-claim\_done Tool Response you have claimed the task is done\!`

Similar Articles

WorkBench Revisited: Workplace Agents Two Years On

arXiv cs.CL

This paper revisits the WorkBench benchmark for workplace agents two years after its initial release, showing that the best agent (Claude Opus 4.8) now completes 89% of tasks with only 2.5% harmful side effects, compared to GPT-4's 43% completion and 26% harm rate in 2024. It finds that capability and safety improve together, open-weight models have drastically lowered costs, and some basic mistakes persist.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Hugging Face Daily Papers

WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.

JobBench: Aligning Agent Work With Human Will

arXiv cs.AI

JobBench is a benchmark built from worker surveys to evaluate AI agents on tasks that workers most want automated, covering 130 tasks across 35 professions with detailed rubrics.