Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

arXiv cs.CL 06/02/26, 04:00 AM Papers
coding-agents omnimodal sandboxed multimodal-llm research benchmarks
Summary
This paper shows that text+image coding agents using sandboxed tool-use can match or outperform native omni-modal models on audio-video benchmarks, converting omni-modal tasks into retrieval and information-processing problems.
arXiv:2606.00579v1 Announce Type: new Abstract: As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:37 PM
# Sandboxed Coding Agents are Competitive Omni-modal Task Solvers
Source: [https://arxiv.org/html/2606.00579](https://arxiv.org/html/2606.00579)
\\correspondingauthor

Tianyi Zhou: david\.tianyi\.zhou@gmail\.com\.

Xuanao Huang\*Zhihan HuQingyuan ShiDianqi LiTianyi Zhou2‡ 1University of Maryland2MBZUAI

###### Abstract

As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models\. We show that this is not always the case: coding agents with only text\+image access and a sandboxed tool\-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio\-video benchmarks\. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information\-processing problems rather than ingesting entire media streams\. We further characterize their limitations through a failure taxonomy and process\-level trace analysis, and show that simple skill injection, including human\-written and self\-distilled skills, substantially improves performance\. To explore open\-source elicitation, we introduceCode\-X, a training recipe with theOmniCodingtrajectory dataset and verifiable reward, and provide baselines on Qwen\-3\.5\-9B and Qwen\-3\.6\-27B\. Finally, we argue that the next frontier is many\-modality processing, and introduceTerminalBench\-O, a process\-level benchmark for real\-world omnimodal processing tasks\. Code will be available at[https://github\.com/Dongping\-Chen/OmniCoding](https://github.com/Dongping-Chen/OmniCoding)\.

![Refer to caption](https://arxiv.org/html/2606.00579v1/x1.png)Figure 1:We discover that coding agents are strong omnimodal processors, achieving competitive performance and even surpassing native omnimodal models with fewer tokens on video and audio content through terminal tool\-use\.## 1Introduction

As AI agents move from screen\-bound workflows toward the broader world around users, they are expected to spend increasing amounts of time participating in everyday life as general\-purpose assistants\. To do so, they must handle multi\-modal information from the embodied environments, especially video and audio\. This immediately raises a question: Is native video and audio understanding necessary to an agent that is expected to process video or audio in real\-world tasks?

In practice, however, many commonly\-adopted agents nowadays, such as OpenClaw\(openclaw2026\)and Hermes\(nousresearch2026hermesagent\), are still text\-only or text\+image systems, while extending them to handle video and audio natively requires substantial additional modeling and engineering effort\. This creates a large gap between real\-world agent deployment and the dominant paradigm for video\-audio processing, namely native omni\-modal models that directly ingest audio and video\(qwen35Omni,googleGemini31Pro\)\. At the same time, modern coding agents already make highly effective use of terminal environments for search, editing, execution, and long\-horizon task solving\(yang2024sweagent,anthropicClaudeCode,openaiCodexHarness\)\. This raises a straightforward question:

Can coding agents, which are not designed for native multimodal perception, solve video and audio tasks by exploiting existing tools likeffmpeg, ASR, or new tools they write in a sandbox?

We investigate this problem using the most native Codex and Claude Code harnesses, without any human\-provided tools or skills\. Text\+image coding agents, namely GPT\-5\.4high/xhighand Claude Opus 4\.6\(openaiGPT54,anthropicOpus46,anthropicClaudeCode\), achieve strong performance on video\-audio benchmarks such as OmniGAIA, SocialOmni, and LVOmniBench\(li2026omnigaia,xie2026socialomni,tao2026lvomnibench\)\. Codex with GPT\-5\.4xhighreaches 75\.0% on OmniGAIA without directly perceiving raw video or audio, outperforming the strongest native omni\-modal model in our study, Gemini 3\.1 Pro\(googleGemini31Pro\), by more than 8%, while Claude Opus 4\.6 exceeds it by 2\.5%\. Trajectory analysis shows that this advantage does not come from end\-to\-end perception\. Instead, coding agents selectively decompose video and audio into transcripts, sampled frames, and metadata through staged tool use, which also consumes substantially fewer tokens than packing raw streams into model context\.

To push coding agents on omni content further, we ask where they still fail and whether such failures can be repaired without retraining the base model\. We sample 200 trajectories across models and benchmarks and propose anOmni Terminal Failure Taxonomytogether with a process\-level evaluation, which reveals six recurring failure modes ranging from premature exploration of multimodal content to brittle tool and environment infrastructure\(kim2025omhbench,yang2025avut,liang2025videobrowsecomp\)\. Guided by this taxonomy, we then study three skill\-injection settings inspired by recent work on reusable agent skills and harness optimization\(li2026skillsbench,lee2026metaharness\): human\-in\-the\-loop skills that encode expert workflows, calibration\-set self\-iteration where the agent revises its skills using only binary correctness feedback, and log\-driven self\-distillation where a separate agent mines reusable skills from raw execution traces\. All three improve over the no\-skill baseline, raising average OmniGAIA accuracy from 61\.4% to 76\.7% in the best setting, with execution\-trace distillation outperforming both binary self\-iteration and human\-written skills\.

The above results are obtained with closed\-source frontier models, which raises a natural follow\-up question: can the same capability be elicited on open\-source models through targeted post\-training? As an exploratory baseline, we proposeCode\-X, a complete recipe for many\-modality terminal agents\. We constructOmniCoding, a 6,035\-example trajectory dataset spanning video, audio, image, and cross\-modal tasks, and train Qwen\-3\.5\-9B and Qwen\-3\.6\-27B with a supervised warm\-start followed by GSPO\-based reinforcement learning under a process\-aware verifiable reward that jointly penalizes modality\-skipping, malformed tool calls, and answer mismatches\. Evaluating on four challenging benchmarks,Code\-Xyields substantial gains over direct MLLM baselines and achieves state\-of\-the\-art results among open\-source models, with the 27B model reaching 43\.3% on OmniGAIA, 60\.0% on LVOmniBench, where it surpasses all existing coding agents, and more than doubling performance on VideoZeroBench\.

Given the maturity of many\-modality understanding, we believe the meaningful frontier lies in many\-modality processing: editing videos, auditing image databases, producing spec\-compliant multimodal outputs, where coding agents enjoy a structural advantage over end\-to\-end omni models\. To draw attention to this second half of many\-modality, we introduceTerminalBench\-O, the first process\-level benchmark for coding agents on omni\-modal processing, with 50 real\-world tasks across video, audio, image, document, and cross\-modal categories, each evaluated along weighted dimensions over intermediate artifacts and final output compliance\. Despite strong omni\-understanding results, these tasks remain hard for current agents as they demand long\-horizon planning, reliable tool chaining, cross\-modal grounding, artifact verification, and strict output adherence; even the strongest GPT\-5\.5 Codex reaches only 24% onTerminalBench\-O\. We hope our analysis, training recipe, and benchmark open new directions for many\-modality processing\.

## 2Analysis of Sandboxed Agent Performance on Omni\-Modal Tasks

Preliminary: Adapting Sandboxed Agents for Omni\-Modal Tasks\.We represent each omni\-modal task asx=\(q,ℱ,𝒴\)x=\(q,\\mathcal\{F\},\\mathcal\{Y\}\), whereqqis a natural\-language instruction,ℱ\\mathcal\{F\}is a set of input files, and𝒴\\mathcal\{Y\}is the set of acceptable final answers \(formal definitions and the full rollout formulation are deferred to Appendix[C](https://arxiv.org/html/2606.00579#A3)\)\. A native omni\-modal model and a sandboxed agent differ mainly inwhereomni\-modal processing happens:

native omni:\(q,ℱ\)→packmodel context→Mθy^,\\displaystyle\(q,\\mathcal\{F\}\)\\xrightarrow\{\\mathrm\{pack\}\}\\text\{model context\}\\xrightarrow\{M\_\{\\theta\}\}\\hat\{y\},sandboxed agent:\(q,ℱ\)→stageworkspace→terminal interactionsτ→Ansy^\.\\displaystyle\(q,\\mathcal\{F\}\)\\xrightarrow\{\\mathrm\{stage\}\}\\text\{workspace\}\\xrightarrow\{\\text\{terminal interactions\}\}\\tau\\xrightarrow\{\\mathrm\{Ans\}\}\\hat\{y\}\.A native omni\-modal modelMθM\_\{\\theta\}packs all files into model context via media\-specific encoding and sampling, treating video and audio as first\-class model inputs to generate answersy^\\hat\{y\}\. A sandboxed agent, by contrast, stagesℱ\\mathcal\{F\}into an isolated workspace and is given only the instruction and file paths; it then inspects files, invokes tools such asffmpeg,ffprobe, OCR, and ASR, writes and executes scripts, produces intermediate artifacts \(transcripts, sampled frames, OCR text, timestamps, tables, crops\), which compose a trajectoryτ\\tau, and submits a final answery^\\hat\{y\}parsed fromτ\\tau\. This view treats raw video, audio, images, and documents as environment state rather than mandatory model context, so the agent can solve omni\-modal tasks by adaptively reducing raw media into compact evidence that fits the native reasoning interface of a general\-purpose coding model\.

Agent Harness and Models\.We use Claude Code\(anthropicClaudeCode\)and Codex\(openaiCodexLoop,openaiCodexHarness\)as our two primary agent harnesses\. Under CC, we evaluate Claude Opus 4\.6 and Claude Sonnet 4\.6\(anthropicOpus46,anthropicSonnet46\), as well as MiniMax M2\.7\(minimaxM27\)and Kimi K2\.5\(kimiK25\)\. Claude Opus 4\.6 and Claude Sonnet 4\.6 both use the defaultmaxsetting\. Under Codex, we evaluate GPT\-5\.4 withlow,medium,high, andxhighreasoning effort settings\(openaiGPT54\)\. Throughout evaluation, we follow the official harnesses as closely as possible in order to approximate realistic end\-user usage\.

For native omni models, we evaluate Gemini 3\.1 Pro\(googleGemini31Pro\), Gemini 3 Flash\(googleGemini3Flash\), Qwen 3\.5\-Omni\(qwen35Omni\), and MiMo\-V2\-Omni\(xiaomiMiMoV2Omni\)\. We further include two modular omni\-agent baselines, OmniAgent\(tao2025active\)and Agent\-Omni\(lin2025agentomni\), as strong system\-level comparators\. Unless otherwise specified, both agent baselines are instantiated with GPT\-5\.4 usinghighreasoning effort\(openaiGPT54\)\.

Evaluation Protocol\.We use four benchmarks with complementary emphases: OmniGAIA for open\-ended, tool\-mediated factual and multi\-hop omni\-modal reasoning\(li2026omnigaia\), SocialOmni for audio\-visual social perception and identity recognition\(xie2026socialomni\), LVOmniBench for long\-form audio\-video understanding\(tao2026lvomnibench\), and VideoZeroBench for challenging video\-centric reasoning\(meng2026videozero\)\. Given their scale, we evaluate the first 100 examples per level on SocialOmni and LVOmniBench, and on SocialOmni we report only Level\-1 and Level\-2 Q1 accuracy, omitting the aggregate Level\-2 score since it is computed only over Level\-2\-eligible examples and introduces substantial selection bias\. For VideoZeroBench, questions on the same video are asked in a grouped manner\. Unless otherwise noted, all models are evaluated with pass@1, and the agent receives only the task instruction and filesystem path\(s\) to the input files, with no pre\-built Skills\.

Environments\.All sandboxed\-agent experiments run in a controlled workspace based onUbuntu 22\.04\.3 LTSwithPython 3\.13\.2and standard Unix utilities\. By default, models may access web search and download additional models or dependencies; we inspect trajectories to detect answer leakage and evaluator hacking\. The environment provides common multimedia and scientific tooling \(e\.g\.,ffmpeg/ffprobe, Tesseract OCR, ImageMagick, OpenCV, Librosa\), and agents may freely install packages and author task\-specific scripts for video and audio processing\.

### 2\.1Main Results and In\-depth Analysis

Table 1:Given the limitations of the API and credits before the paper’s release, some benchmark results will be updated in the next manuscript\.‡Minimax M2\.7 uses its own MCP tool for image reading\.§\\SDirect image read is disabled for GPT\-5\.4\-highin this ablation setting\.Model/SystemHarnessOmniGAIASocial OmniLVOmniBenchVideoZeroBenchEasyMediumHardAvg\.Level1Level2 Q1 AccEasyMediumHighAvg\.Level 3Omnimodal LLMsGemini 3\.1 Pro–77\.863\.852\.666\.190\.062\.080\.171\.848\.069\.017\.6Gemini 3 Flash–67\.246\.937\.251\.785\.067\.072\.264\.132\.059\.017\.8Qwen 3 Omni–19\.710\.69\.013\.361\.039\.030\.617\.928\.025\.05\.2Mimo V2 Omni–33\.922\.519\.125\.864\.040\.051\.425\.224\.035\.07\.4Qwen 3\.5 Omni Plus–43\.539\.429\.038\.788\.040\.058\.335\.944\.046\.010\.4Coding Agents \(Our Findings\)GPT\-5\.4 LowCodex63\.153\.844\.955\.064\.059\.058\.346\.236\.048\.021\.0GPT\-5\.4 MediumCodex66\.455\.043\.656\.471\.049\.072\.243\.640\.053\.024\.0GPT\-5\.4 HighCodex70\.460\.050\.061\.472\.051\.066\.659\.044\.058\.027\.2GPT\-5\.4 xHighCodex82\.075\.064\.175\.075\.060\.066\.764\.132\.057\.027\.6Claude Opus 4\.6CC74\.669\.457\.768\.654\.068\.047\.241\.028\.040\.0\-Claude Sonnet 4\.6CC73\.862\.548\.863\.3\-\-\-\-\-\-\-MiniMax\-M2\.7‡CC38\.534\.423\.133\.328\.046\.050\.015\.424\.030\.05\.8Kimi K\-2\.5CC72\.150\.639\.755\.646\.050\.055\.628\.041\.043\.015\.6Tool\-use Agent \(GPT\-5\.4 High as backbone\)OmniAgent\-77\.963\.155\.166\.455\.056\.050\.021\.028\.033\.012\.2Agent\-Omni\-54\.141\.919\.241\.137\.048\.056\.028\.020\.036\.05\.0Text\-Only Coding AgentGPT\-5\.4 High§\\SCodex70\.563\.856\.464\.474\.048\.052\.838\.528\.041\.019\.0

From Table[1](https://arxiv.org/html/2606.00579#S2.T1), coding agents prove highly competitive with native omnimodal LLMs across all four benchmark suites\. On OmniGAIA, GPT\-5\.4xhigh\(75\.0\) clearly outperforms Gemini 3\.1 Pro \(66\.1\), especially on medium\- and hard\-level tasks; on VideoZeroBench, the best coding agent \(27\.6\) also exceeds Gemini 3 Flash \(17\.8\)\. Native omni models lead on Social Omni Level\-1, though Claude Opus 4\.6 slightly edges out the best native score on Level\-2 Q1\. On LVOmniBench, Gemini 3\.1 Pro remains the strongest overall, but coding agents stay competitive on the easy and medium subsets\. These patterns suggest that benchmarks emphasizing fine\-grained temporal perception and audio–visual alignment favor end\-to\-end omni modeling, while tool\-oriented benchmarks such as OmniGAIA align more naturally with the coding\-agent paradigm\.

Finding 1:Coding agents with only text\+image capability can perform competitively with, and in some cases outperform,*state\-of\-the\-art*native omnimodal LLMs on omnimodal tasks\.

Across the GPT\-5\.4 variants under the same Codex harness, increased reasoning effort yields clear gains on omnimodal performance\. On OmniGAIA, average accuracy rises steadily from 55\.0 \(low\) to 56\.4 \(medium\), 61\.4 \(high\), and 75\.0 \(xhigh\), with improvements appearing across easy, medium, and hard subsets\. The same trend is visible on LVOmniBench, where higher\-effort settings outperform lower\-effort ones, though not strictly monotonically\. Since the model family and harness are held fixed, this isolates the contribution of inference\-time computation alone\.

Finding 2:Increasing reasoning effort generally improves coding\-agent performance, suggesting that omnimodal task success depends not only on model perception capacity but also on the depth of agentic computation\.

![Refer to caption](https://arxiv.org/html/2606.00579v1/Figs/tool_use_1x6.png)Figure 2:Tool\-use distributions of GPT\-5\.4highand Claude Opus 4\.6maxacross four benchmarks\.Tool\-use Analysis\.We analyze tool\-use behavior in coding agents\. Table[2](https://arxiv.org/html/2606.00579#S2.T2)report the six most behaviorally salient tools \(ffmpeg,ffprobe,python3,whisper,WebSearch,tesseract\), together with their average calls per case \(Avg\.\), success/failure gap \(Gap, in percentage points\), and total tool calls per case \(Avg\. Tool Use\); Figure[2](https://arxiv.org/html/2606.00579#S2.F2)further visualizes overall usage patterns for GPT\-5\.4highand Claude Opus 4\.6Max\. Across benchmarks, coding agents handle non\-text inputs via a staged, tool\-mediated pipeline in whichffprobe/ffmpeginspect and transform media whilewhisperandpython3support transcription and downstream analysis\. Media extraction, transcription, and external search yield the clearest positive gains, whereaspython3, our proxy for self\-authored tools, is frequently invoked but shows mixed gains, suggesting ad\-hoc scripting is common but not reliably beneficial\. Holding the harness and model family fixed, higher reasoning\-effort settings tend to use more tools, most visibly on OmniGAIA and LVOmniBench\.

Finding 3:Omnimodal problem solving proceeds through a staged tool\-mediated pipeline, where media extraction, transcription, and external search \(ffmpeg,ffprobe,whisper,web\_search\) deliver the most consistent gains\. Agents invokepython3frequently but unreliably; tool\-use performance scales with reasoning effort when harness and model family are fixed\.

![Refer to caption](https://arxiv.org/html/2606.00579v1/x2.png)Figure 3:Pareto\-front of Acc\-Token tradeoff\. We find that sandboxed coding agents are efficient and competitive omnimodal task solvers on avg\. of four benchmarks\. The gray line indicates the estimated MLLM baseline\.Token Usage Analysis\.Given the concern of using more tokens for better performance, we calculate the avg\. token input and accuracy across four benchmarks\. As shown in Figure[3](https://arxiv.org/html/2606.00579#S2.F3), surprisingly, coding agents consume much fewer tokens compared to omnimodal models\. Based on our analysis of model trajectories and tool use patterns, we found that the image and audio content actually consumed by the coding agent is far smaller than what would otherwise be fed in its entirety to an omnimodal model\. Through proactive tool use, the model selectively retrieves only the content it needs, substantially reducing the multimedia footprint in its context and thereby eliminating a significant amount of redundant tokens\.

Text\-Only Ablation\.Since some agents \(*e\.g\.*, DeepSeek V4\(deepseekai2026deepseekv4\), MiniMax M2\.7\(minimaxM27\)\) are text\-only LLMs, we include a text\-only ablation in Table[1](https://arxiv.org/html/2606.00579#S2.T1), evaluating Codex GPT\-5\.4\-highunder the same coding\-agent setup but without direct image access\. It remains competitive with many native omnimodal LLMs, showing only a modest gap to its vision\-enabled counterpart onOmniGAIAandSocialOmnieven matching or slightly exceeding it, while on the more vision\-centricLVOmniBenchandVideoZeroBenchperformance drops sharply\. This exposes a limitation of current omnimodal benchmarks: many tasks can be solved through text or shallow perception \(e\.g\., OCR\) rather than deep visual understanding\.

![Refer to caption](https://arxiv.org/html/2606.00579v1/x3.png)Figure 4:Trajectory DAG of aOmniGAIAsample \(Baylor campus\-tour sign Texas sports\-facility audio; ground\-truth169169km\)\. Nodes are annotated as agent steps, coloured by the step\-supervisor reward \( /0/ for\+1\+1/0/−1\-1\)\.*Audio*sub\-goal includes three parallel strategies:*speech\_recognition*,*whisper*, and*YouTube\-ID lookup*; evidence from the*Image*and*Audio*sub\-goals merges into*Distance*\.![Refer to caption](https://arxiv.org/html/2606.00579v1/x4.png)Figure 5:Distribution of primary error types across four coding agents onOmniGAIA, one pie per agent\. The sample sizennis shown under each agent name\.Table 2:Tool\-use statistics of representative multimodal tools across coding\-agent systems onOmniGAIA,SocialOmni\(Level 1 & Level 2\), andLVOmniBench\. “Avg\.” denotes the average number of calls per case\. “Gap” is reported in percentage points and denotessuccess\_case\_ratio−failure\_case\_ratio\\text\{success\\\_case\\\_ratio\}\-\\text\{failure\\\_case\\\_ratio\}\. Positive gaps indicate tools more commonly used in successful cases, while negative gaps indicate tools more associated with failure\.SystemHarnessAvg\. Tool Useffmpegffprobepython3whisperweb\_searchtesseractAvg\.Gap \(%\)Avg\.Gap \(%\)Avg\.Gap \(%\)Avg\.Gap \(%\)Avg\.Gap \(%\)Avg\.Gap \(%\)OmniGAIAKimi K\-2\.5CC22\.631\.45\+12\.10\.34\+7\.52\.15\+9\.10\.62\+6\.62\.89\+19\.20\.44\+0\.0MiniMax\-M2\.7CC44\.861\.50\-0\.80\.49\-8\.37\.43\-5\.40\.31\+0\.82\.94\+2\.90\.06\-0\.8Claude Sonnet 4\.6CC40\.851\.97\+0\.40\.54\+1\.22\.49\+16\.50\.77\+8\.93\.90\+14\.50\.02\+0\.2GPT\-5\.4 LowCodex25\.502\.59\-2\.31\.19\-6\.64\.36\-9\.90\.08\+2\.75\.74\+1\.40\.69\-9\.9GPT\-5\.4 MediumCodex31\.833\.05\-4\.01\.25\-0\.14\.98\-6\.60\.06\+1\.57\.97\-4\.30\.81\-7\.3SocialOmni Level1GPT\-5\.4 LowCodex16\.286\.94\+11\.12\.04\+4\.02\.94\+9\.40\.22\-1\.70\.00\+0\.00\.00\+0\.0GPT\-5\.4 MediumCodex21\.529\.18\+3\.42\.06\+0\.03\.22\+3\.20\.52\+24\.70\.00\+0\.00\.00\+0\.0GPT\-5\.4 HighCodex19\.685\.06\+12\.92\.18\+16\.52\.32\+10\.60\.46\+12\.20\.00\+0\.00\.00\+0\.0GPT\-5\.4 xHighCodex39\.909\.24\+33\.32\.06\+34\.715\.64\+28\.01\.22\+14\.70\.00\+0\.00\.04\-2\.7SocialOmni Level2GPT\-5\.4 LowCodex5\.591\.98\-4\.00\.76\-5\.20\.41\-5\.70\.01\-2\.40\.00\+0\.00\.00\+0\.0GPT\-5\.4 MediumCodex10\.074\.92\+0\.01\.20\+2\.01\.28\-16\.80\.04\-1\.90\.00\+0\.00\.00\+0\.0GPT\-5\.4 HighCodex13\.476\.75\+0\.01\.25\+0\.01\.05\-3\.10\.00\+0\.00\.10\+5\.90\.00\+0\.0GPT\-5\.4 xHighCodex12\.763\.67\-0\.81\.26\+0\.81\.55\-9\.20\.17\+6\.70\.00\+0\.00\.00\+0\.0LVOmniBenchGPT\-5\.4 LowCodex31\.207\.10\+0\.01\.40\+0\.06\.45\-16\.01\.31\-9\.50\.00\+0\.00\.44\-8\.0GPT\-5\.4 MediumCodex28\.726\.62\+2\.11\.26\+0\.04\.67\+12\.81\.46\+15\.90\.00\+0\.00\.29\+8\.2GPT\-5\.4 HighCodex38\.758\.56\-3\.41\.20\+0\.07\.64\-15\.01\.34\-1\.60\.00\+0\.00\.66\+3\.8GPT\-5\.4 xHighCodex42\.2910\.93\+0\.61\.23\+0\.08\.21\-4\.21\.65\+13\.20\.00\+0\.00\.69\+10\.1

### 2\.2Failure Analysis: Taxonomy and Process\-level Trajectory\.

Given the strong performance of coding agents, we investigate the challenges they face in omni content processing\. We propose a new failure mode taxonomy based on task type, and sample 200 trajectories \(6,268 steps in total\) across different models and benchmarks, along with their primary failure reasons\. For each trajectory, we construct a DAG to identify its main trunk and branches, enabling process\-level annotation\. The resulting taxonomy comprises six failure modes: \(i\)Audio Perception and Extraction, \(ii\)Video Perception and Extraction, \(iii\)Insufficient Exploration of Modal Content, \(iv\)Knowledge Retrieval and Factual, \(v\)Logical Reasoning and Calculation, and \(vi\)Tool and Environment Infrastructure\. Detailed definitions and representative examples are deferred to Appendix[G](https://arxiv.org/html/2606.00579#A7)\. To scale the analysis beyond the 200 hand\-annotated trajectories, we use GPT\-5\.5 as the process\-level judge, which we validate against human labels on this small benchmark \(see Appendix[8](https://arxiv.org/html/2606.00579#A7.T8)for the validation protocol and results\)\.

Figure[5](https://arxiv.org/html/2606.00579#S2.F5)reports the empirical distribution of the six failure modes across four coding agents onOmniGAIA\. Knowledge\-Retrieval dominates the residual errors for three of four agents \(Kimi 40%, Sonnet 43%, MiniMax 35%\), while Codex\-GPT5\.4\-Low is distinctive in that Tool/Environment Failure \(21%\) and Insufficient Exploration \(20%\) each rival its knowledge errors \(27%\)\. MiniMax further stands out for heavy Video\-Perception errors \(28%\), indicating that different agents break down along different axes of the taxonomy\. Beyond aggregate rates, Figure[4](https://arxiv.org/html/2606.00579#S2.F4)dissects a singleOmniGAIAtrajectory as a DAG of annotated steps, showing how multiple modes from our taxonomy—audio\-perception errors, rule violations, and scraping failures—co\-occur with exploration\-with\-recovery behavior inside the same run, and how the agent ultimately recovers the correct answer\.

Case StudyQuestion\.Watch the video showcasing a prominent surfing destination in British Columbia, Canada\. Identify the specific town featured\. Then, according to the Canadian Marine Mammal Regulations and 2024/2025 management measures for killer whale protection, what is the minimum mandatory vessel approach distance in meters required for killer whales in the waters immediately offshore of that town?Labeled Answer\.200Shared Prefix\.I will first use the video to identify the town shown in the clip\. From the coastal scenery and visible landmarks, I will narrow the location down toTofinobefore looking up the whale\-protection rule\.tool\_callExtract frames from the video and inspect visible location cues\./tool\_calltool\_responseThe clip shows a prominent surfing destination in British Columbia, consistent withTofino\./tool\_responsedivergenceAfter identifyingTofino, the trajectories diverge on how they handle the Southern B\.C\. special protection measure: the failed branch applies the stricter400 mrule directly, while the successful branch checks whether that special corridor actually covers waters immediately offshore of Tofino\./divergenceFailed TrajectoryI have identified the town asTofino\. Next I will check official 2024/2025 killer\-whale protection pages and use the applicable vessel\-distance rule for this location\.tool\_callweb\_search:official killer\-whale measures near Tofino, 2024/2025/tool\_calltool\_responseThe search returns official pages mentioning a400 mapproach distance in a protected Southern B\.C\. corridor\./tool\_responseI will use that stricter regional measure as the answer for waters offshore ofTofino\.final\_answerTofino, 400 meters/final\_answerSuccessful TrajectoryI have identified the town asTofino\. Next I will check whether the stricter Southern B\.C\. killer\-whale protection corridor actually applies to waters immediately offshore of this town, rather than assuming that the regional400 mrule governs here\.tool\_callweb\_search:official killer\-whale measures near Tofino, 2024/2025/tool\_calltool\_responseThe search returns official pages mentioning a400 mapproach distance in a protected Southern B\.C\. corridor\./tool\_responseI will then check the general regulation for killer whales in Pacific waters so I can compare the special regional measure against the baseline legal rule\.tool\_callweb\_search:Marine Mammal Regulations killer whales Pacific waters 200 m/tool\_calltool\_responseThe official Marine Mammal Regulations give a baseline minimum approach distance of200 mfor killer whales in Pacific waters\./tool\_responseI will verify that the Southern B\.C\. special corridor does not directly govern the waters immediately offshore ofTofino, so I will use the general200 mrule instead\.final\_answerTofino, 200 metres/final\_answerKey Difference\.The failed trajectory directly applies the Southern B\.C\. special400 mmeasure to Tofino, whereas the successful trajectory explicitly checks whether that special corridor covers Tofino’s offshore waters before falling back to the general200 mrule\.Figure 6:A representative OmniGAIA case comparing GPT\-5\.4highwithout Skills and with Human\-in\-the\-loop Skills\. The no\-Skills run answers incorrectly by directly applying the Southern B\.C\. special400 mrule to Tofino, whereas the Human\-in\-the\-loop\-Skills run answers correctly using200 mrule for the waters immediately offshore of Tofino\.
### 2\.3Skills and Self\-Evolution for Strong Many\-modality Content Processing

Experiment Setups\.Following recent work, we treat Skills as structured, inference\-time packages of procedural knowledge that supply reusable workflows, instructions, and optionally executable resources to an agent\(li2026skillsbench,xu2026agentskills\)\. Unless explicitly stated otherwise, the main results in this paper are reported without Skills\. We study three settings as follows:

- •Calibration\-set self\-iteration\.The agent is given a small calibration set and iteratively revises its Skills using only binary correctness feedback; we reveal whether an attempt is correct or incorrect, but do not provide any human\-written diagnosis of the intermediate process\.
- •Log\-driven self\-distillation\.A separate agent reviews raw execution logs, extracts recurring success and failure patterns, and writes Skills from these traces\. This setting is inspired by recent log\- and trace\-driven harness optimization work such as Meta\-Harness\(lee2026metaharness\)\.
- •Human\-in\-the\-loop Skills\.We provide manually written Skills that encode a strong human route or best\-known workflow\. This setting serves as an approximate upper bound on the value of explicit procedural knowledge, while also introducing human prior and route bias\.

Results and Analysis\.Figure[7](https://arxiv.org/html/2606.00579#S2.F7)shows that explicitSkillsconsistently improveOmniGAIAperformance over theNo Skillsbaseline across all difficulty levels\. All skill\-enhanced variants raise average accuracy, with gains remaining visible on theHighsplit\. The best setting improves average accuracy from 61\.4% to 76\.7%, suggesting that inference\-time procedural knowledge can substantially strengthen omni\-modal agent performance\. Figure[6](https://arxiv.org/html/2606.00579#S2.F6)provides a representative example, where Skills help the agent avoid an incorrect application of regional rules and reach the correct answer\.

![Refer to caption](https://arxiv.org/html/2606.00579v1/Figs/omni_coding_agent_sheet2_skill_accuracy.png)Figure 7:Performance of GPT\-5\.4highon OmniGAIA under four settings\.Both self\-improving methods outperform the baseline, withLog\-driven self\-distillationachieving the strongest results: 76\.7% average accuracy versus 73\.0% forCalibration\-set self\-iteration, and the gap widens on harder problems \(65\.4% vs\. 59\.0% on theHighsplit\)\. Execution traces appear to provide richer supervision than binary feedback from a small calibration set, aggregating recurring success and failure patterns from broader agent behavior rather than adapting to local calibration\-set patterns, likely because they offer process\-level signals with broader coverage, enabling more stable and generalizable skill refinement in practice\.

Finding 4:Execution\-trace\-based skill refinement yields more effective and robust improvements than both calibration\-based self\-iteration and manually written Skills\.

## 3Code\-X: Towards Stronger Coding Agent for Omnimodal Content

SubsetAllSFTRLType / Modality*By source*Omnimodal\-Agent\-SFT\-2K2,1101,413697open\-ended, multimodalOmniVideoBench997670327MCQ, video\+audioAVUTBenchmark1,7001,136564MCQ, video\+audioVideo\-MME\-v2 selected1,228823405MCQ, video*By modality*Video\+audio2,3691,585784dual\-stream mediaVideo2,2251,493732visual\-temporal mediaAudio\+image1,437961476cross\-modal mediaAudio431audio\-only mediaTotal6,0354,0421,993

Table 3:Statistics ofOmniCoding\.![Refer to caption](https://arxiv.org/html/2606.00579v1/x5.png)
Figure 8:Duration distribution\.
Given that closed\-source coding models outperform native omni\-modal models in both performance and efficiency, we ask whether such capability can be elicited through post\-training and reproduced on open\-source models\. To this end, we propose a strong baseline on Qwen\-3\.5\-9B and Qwen\-3\.6\-27B, with a complete SFT\-then\-RL training recipe, aiming to explore this direction on open\-source coding agents on omnimodal content\.

### 3\.1OmniCoding: Synthesized Omnimodal Coding Agent Trajectory Dataset

We constructOmniCoding, a dataset of 6,035 verifiable question–answer examples paired with video, audio, or image files for multimodal terminal\-tool\-use agents\. Each example follows a unified schema with a question, a list of acceptable ground\-truth answers, media paths, and source/modality/task metadata\. We draw from four complementary sources, OmniGAIA\-SFT\-2K\(li2026omnigaia\), OmniVideoBench\(li2025omnivideobench\), AVUTBenchmark\(yang2025avut\), and Video\-MME\-v2\(fu2026video\), keeping only the Video\-MME\-v2 subset requiring audio\-visual or temporal reasoning\. Table[3](https://arxiv.org/html/2606.00579#S3.T3)and Figure[8](https://arxiv.org/html/2606.00579#S3.F8)report the per\-source composition and media\-duration distribution\. Construction proceeds in three steps:\(i\) Normalization, unifying all sources into the same schema with deduplicated questions and media consolidated into a flat directory;\(ii\) Verifiability, regenerating open\-ended questions with ground\-truth lists covering common surface variants \(units, abbreviations, capitalization, number formats\) and converting multiple\-choice into option\-letter answer sets; and\(iii\) Cross\-review, verifying answer lists and task coverage\. We stratified\-split the data into 4,042 SFT and 1,993 RL examples by source\-task key, and use exact match against the ground\-truth list \(extracted from an<answer\>tag\) as the binary RL reward\.

### 3\.2Method and Training Infrastructure

We train the terminal agent with supervised warm\-start followed by reinforcement learning from verifiable rewards\. All training examples follow the schemax=\(q,ℱ,𝒴\)x=\(q,\\mathcal\{F\},\\mathcal\{Y\}\), where multiple\-choice tasks use option\-letter answer sets and open\-ended tasks are assigned curated answer sets covering common aliases, units, abbreviations, capitalization, and number formats\. The agent submits its final answer inside`<answer\>`\.

Supervised warm start\.For examples with demonstration trajectories, we fine\-tune on assistant\-generated tokens only, masking terminal outputs, tool responses, and environment observations from the loss\. This stage teaches the basic interaction protocol: inspect files, run commands, create intermediate artifacts, recover from tool failures, and submit answers in the required format\.

Reward design\.We design a process\-aware verifiable reward that goes beyond binary answer matching, addressing three failure patterns: \(1\) the agent guesses without engaging the workspace, \(2\) it ignores the relevant modality \(e\.g\., answering a video question from text alone\), and \(3\) it emits malformed, escape\-attempting, or otherwise broken tool calls\. For each rolloutτ\\tauwe compute

R\(x,τ\)=\{0,ifτis removed \(timeout / step\-limit / context overflow\),base\(τ\)\+rmod\(τ\)\+rtool\(τ\),otherwise,R\(x,\\tau\)=\\begin\{cases\}0,&\\text\{if \}\\tau\\text\{ is removed \(timeout / step\-limit / context overflow\)\},\\\\\[2\.0pt\] \\text\{base\}\(\\tau\)\+r\_\{\\mathrm\{mod\}\}\(\\tau\)\+r\_\{\\mathrm\{tool\}\}\(\\tau\),&\\text\{otherwise\},\\end\{cases\}with the following components:

- •Three\-tier base reward\.base\(τ\)=wcorrect\\text\{base\}\(\\tau\)=w\_\{\\text\{correct\}\}if gated\-correct \(normalized exact match against𝒴\\mathcal\{Y\}, modality engaged,1\\ge 1tool call, no escape attempt\); elsewfmt\+w\_\{\\text\{fmt\}\+\}if the agent terminated cleanly viatask\_completewith a non\-empty`<answer\>`; elsewfmt−w\_\{\\text\{fmt\}\-\}\. The format bonus does*not*stack with correctness; it exists solely to keep gradient signal on tried\-correctly\-but\-answered\-wrong trajectories that would otherwise collapse the group to zero variance\.
- •Modality penaltyrmodr\_\{\\mathrm\{mod\}\}\.A fixed negative weightwmodw\_\{\\mathrm\{mod\}\}is charged when the task contains video or audio but the agent invokes no tool from the corresponding universe \(ffmpeg,whisper, etc\.\), detected by substring matching on bash commands\.
- •Bad\-tool penaltyrtoolr\_\{\\mathrm\{tool\}\}\.A weighted ratio−wtoolPbad\-w\_\{\\mathrm\{tool\}\}\\cdot P\_\{\\text\{bad\}\}withPbad=\(nunparseable\+ndisallowed\+nescape\+nsyntax\-fail\)/\(1\+ntool\)P\_\{\\text\{bad\}\}=\(n\_\{\\text\{unparseable\}\}\+n\_\{\\text\{disallowed\}\}\+n\_\{\\text\{escape\}\}\+n\_\{\\text\{syntax\-fail\}\}\)/\(1\+n\_\{\\text\{tool\}\}\)aggregates malformed calls, disallowed tools, gold\-answer leakage attempts, and persistent shell errors\.

Weight values are listed in Appendix[H\.1](https://arxiv.org/html/2606.00579#A8.SS1)\. We further apply group\-level mechanisms:\(i\) group advantage normalizationcenters rewards within each group ofGGsibling rollouts;\(ii\) zero\-variance maskingdrops groups where all rollouts succeed or fail;\(iii\) active\-only baselineexcludes removed samples from group statistics; and\(iv\) dynamic filterdrops groups with1\\le 1active sample or near\-zero variance originated from DAPO\(yu2025dapo\)\.

GSPO update\.For each task we sampleGGrolloutsτiπθold\(x\)\\tau\_\{i\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid x\)and compute group\-normalized advantagesAi=\(Ri−R¯\)/\(σR\+ϵA\)A\_\{i\}=\(R\_\{i\}\-\\bar\{R\}\)/\(\\sigma\_\{R\}\+\\epsilon\_\{A\}\)\. Letzi=\(zi,1,…,zi,Li\)z\_\{i\}=\(z\_\{i,1\},\\ldots,z\_\{i,L\_\{i\}\}\)be the model\-generated tokens inτi\\tau\_\{i\}\(tool outputs masked from the loss\) andci,ℓc\_\{i,\\ell\}the transcript prefix beforezi,ℓz\_\{i,\\ell\}\. Following GSPO\(zheng2025gspo\), we use a length\-normalized sequence\-level importance ratio and optimize the clipped objective:

ρi\(θ\)=exp⁡\(1Li\\slimits@ℓ=1Li\[log⁡πθ\(zi,ℓci,ℓ\)−log⁡πθold\(zi,ℓci,ℓ\)\]\),𝒥GSPO\(θ\)=𝔼x\[1G\\slimits@i=1Gmin⁡\(ρi\(θ\)Ai,clip\(ρi\(θ\),1−ϵ−,1\+ϵ\+\)Ai\)\]\.\\displaystyle\\rho\_\{i\}\(\\theta\)=\\exp\\\!\\left\(\\frac\{1\}\{L\_\{i\}\}\\tsum\\slimits@\_\{\\ell=1\}^\{L\_\{i\}\}\\big\[\\log\\pi\_\{\\theta\}\(z\_\{i,\\ell\}\\mid c\_\{i,\\ell\}\)\-\\log\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(z\_\{i,\\ell\}\\mid c\_\{i,\\ell\}\)\\big\]\\right\),\\qquad\\mathcal\{J\}\_\{\\mathrm\{GSPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\}\\\!\\left\[\\frac\{1\}\{G\}\\tsum\\slimits@\_\{i=1\}^\{G\}\\min\\\!\\Big\(\\rho\_\{i\}\(\\theta\)A\_\{i\},\\,\\mathrm\{clip\}\\\!\\big\(\\rho\_\{i\}\(\\theta\),1\-\\epsilon\_\{\-\},1\+\\epsilon\_\{\+\}\\big\)A\_\{i\}\\Big\)\\right\]\.

Since reward is assigned to whole trajectories, sequence\-level clipping is a natural fit: one importance ratio weights all generated tokens while tool outputs stay masked\.

Training infrastructure\.Our RL system decouples sandboxed rollout generation from policy optimization\. Rollout workers run the current policy in isolated workspaces, execute terminal commands, collect trajectories, parse answers, compute reward components, and return old log\-probabilities and masks; the trainer consumes these logs and applies GSPO without executing tools or touching raw media\. This separation is useful because omni\-modal rollouts involve non\-differentiable operations, while the optimizer only needs token sequences, masks, old log\-probabilities, and scalar rewards\. Sandboxes are hosted on a Slurm cluster with CPU/GPU nodes and storage, the trainer runs on cloud servers, and the two are connected via a*Cloudflare*tunnel for trajectory collection\.

### 3\.3Experiment Setups and Results

Agent Harness\.To trainOmniCoding, we build a local terminal\-agent harness adapted from Terminus\-KIRA\(terminuskira2026\)for converting raw multimodal files into small, checkable agent operations\. At each episode, the environment creates an isolated workspace with the question and staged media files\. The media remains in the filesystem rather than being packed into the model context, so the agent must decide how to inspect it through audio extraction, frame sampling, OCR/ASR, metadata reading, scripting, and intermediate verification\. The visible action space is deliberately compact:execute\_commandsruns one command in a persistent shell,image\_readinspects an image as a native multimodal observation, andtask\_completeends the episode\. Higher\-level media processing is expressed through code and command\-line tools rather than benchmark\-specific APIs, keeping the harness lightweight but expressive\. Unlike general software\-engineering harnesses\(wang2024openhands,minisweagent2025\), our scaffold focuses on offline multimedia processing and trajectory collection: tool calls, terminal/image observations, final answers, exit reasons, and rewards are serialized for SFT filtering, RL replay, and failure analysis\. Full details are in Appendix[I\.1](https://arxiv.org/html/2606.00579#A9.SS1)\.

Baselines and Evaluation Protocol\.We train Qwen3\.5\-9B and Qwen3\.6\-27B and compare against three baselines:\(i\)training\-free direct MLLM inference, where the model is fed visual frames sampled at fps=1 \(capped at 512\) together with Whisper\-large ASR transcripts;\(ii\)the same MLLM evaluated under our harness with direct image reading; and\(iii\)SFT warm\-up baseline\. We additionally include Codex GPT\-5\.4 \(low\) under our harness as a reference\. The thinking budget is set to 4K tokens; if the model fails to produce a correctly formatted answer, we issue a multi\-turn reminder to re\-answer in the required format, and if reasoning is truncated, the model is allowed to continue thinking\. We evaluate on the same benchmark suite as before, reporting pass@1 throughout\.

Experiment Results and Analysis\.As shown in Table[4](https://arxiv.org/html/2606.00579#S3.T4), model\-tool\-environment finetuned models are more adaptive to harness and perform much better than baseline and even direct MLLM settings that take up to 512 images\.Code\-X\-RL\-9B shows extremely reasoning and terminal tool\-use capability, achieving 15\.8 on VideoZeroBench, which is almost competitive to Gemini 3\.1 Pro\.

Table 4:Performance of Qwen3\.5\-9B and Qwen3\.6\-27B under different settings on four benchmarks\. For the Direct MLLM setting in the OmniGAIA benchmark, we provide models with tools following their codebase\. The thinking budget is fixed at 4K tokens for all settings\.Model/SettingOmniGAIASocial OmniLVOmniBenchVideoZeroBenchEasyMediumHardAvg\.Level1Level2EasyMediumHighAvg\.Level 3Strong BaselinesOmniAtlas\-Omni\-30B\-A3B31\.118\.89\.020\.850\.061\.044\.423\.136\.034\.06\.0Codex\-GPT\-5\.4 Low63\.153\.844\.955\.064\.059\.058\.346\.236\.048\.021\.0Qwen3\.5\-9BDirect MLLM23\.016\.223\.120\.037\.059\.044\.433\.316\.033\.07\.2Baseline \(direct image\)23\.814\.417\.918\.341\.058\.041\.715\.436\.030\.05\.9Code\-X\(Ours, SFT\)30\.021\.017\.023\.150\.061\.048\.628\.916\.033\.09\.5Code\-X\(Ours, RL\)32\.826\.324\.428\.152\.063\.072\.241\.044\.053\.015\.8Δ\\Deltavs Direct MLLM\+42\.6%\+62\.3%\+5\.6%\+40\.5%\+40\.5%\+6\.8%\+62\.6%\+23\.1%\+175\.0%\+60\.6%\+119\.4%Qwen3\.6\-27BDirect MLLM44\.338\.128\.638\.261\.047\.066\.746\.236\.051\.07\.4Baseline \(direct image\)32\.818\.823\.124\.446\.055\.052\.833\.332\.040\.013\.0Code\-X\(Ours, SFT\)50\.045\.028\.243\.364\.067\.075\.053\.848\.060\.018\.4Δ\\Deltavs Direct MLLM\+12\.9%\+18\.1%−\-1\.4%\+13\.4%\+4\.9%\+42\.6%\+12\.4%\+16\.5%\+33\.3%\+17\.6%\+148\.6%

## 4TerminalBench\-O: From Many\-modality Understanding to Processing

![Refer to caption](https://arxiv.org/html/2606.00579v1/x6.png)Figure 9:Overview of our benchmark, illustrated with Task T01\. The coding agent is required to generate a highlight clip and caption from a soccer video and a player query, and its outputs are evaluated by an LLM\-based judge in terms of event accuracy, video quality, and task correctness\.As many\-modality understanding capabilities mature, the frontier is shifting from passiveunderstandingto activeprocessing: editing videos, auditing image databases, and producing spec\-compliant multimodal outputs\. This shift reveals a gap between native omni\-models, which excel at end\-to\-end perception but cannot directly operate on files, and coding agents, which can decompose tasks into tool\-use pipelines\. Yet existing benchmarks do not evaluate this capability: omnimodal benchmarks such as OmniGAIA\(li2026omnigaia\)and LVOmniBench\(tao2026lvomnibench\)focus on QA\-based perception, while coding\-agent benchmarks such as Terminal\-Bench\(merrill2026terminal\)and Claw\-Eval\(ye2026claw\)mainly target text or text\+image tasks\. As shown in Table[6](https://arxiv.org/html/2606.00579#S4.T6), none jointly cover many\-modality understanding \(MMU\) and many\-modality processing \(MMP\) on real\-world cases\. To close this gap, we introduceTerminalBench\-O, the first process\-level benchmark for coding agents on omnimodal tasks, requiring agents to process realistic inputs such as long videos, multilingual documents, and noisy audio through multi\-stage pipelines, and evaluating both multimodal understanding and operation capability\.

Design Principles\.TerminalBench\-Ocomprises 50 challenging tasks across video \(38%\), audio \(26%\), image \(14%\), document \(4%\), and cross\-modal \(18%\) categories\. Four core principles guide the design:\(i\) Real\-world cases, where all tasks are drawn from authentic sources \(Internet Archive, HuggingFace datasets, public videos\) rather than synthetic constructions;\(ii\) Multi\-dimensional grading, where each task is scored along 4–8 weighted dimensions covering accuracy, cross\-document consistency, and output quality;\(iii\) Shortcut\-resistant design, which combines closed vocabularies, cross\-file ID consistency checks, decoy injection, and word limits to prevent gaming; and\(iv\) Process\-level evaluation, which validates intermediate artifacts and reasoning chains rather than only final answers\.

Task Categories\.Representative tasks include:Video Understanding & Processing\(T01a: extract 60\-90s highlights from 90\-min soccer broadcast by jersey number; T03a: auto\-edit 5\-min final cut from 3\-camera meeting footage\);Audio Forensics\(T107a: segment 22\-min concert recording into songs with tempo/defect detection; T108a: extract rhetorical devices and thematic structure from political speech\);Image Database Analysis\(T13a: detect wood defects with pixel\-level masks; T50a: audit mislabeled images in CIFAR\-10 subset\)\. Unlike QA benchmarks,TerminalBench\-Orequires agents tocreatemultimodal outputs \(edited videos, text\+video interleaved reports\) that satisfy strict format and consistency constraints\.

Table 5:Overall accuracy and average score of different coding agents onTerminalBench\-O\.†Models without native vision capabilities use the image\-read implemented with Qwen3\.5\-Flash\.ModelHarnessPass Rate \(%\)Avg\. ScoreGPT\-5\.5 LowCodex240\.68GPT\-5\.5 MediumCodex180\.69GPT\-5\.5 HighCodex240\.71GPT\-5\.5 xHighCodex240\.71GLM\-5\.1 \(V\)†CC160\.69DeepSeek V4 Pro \(V\)†CC120\.64DeepSeek V4 Flash \(V\)†CC100\.60Kimi K2\.6CC80\.56MiniMax M2\.7 \(V\)†CC40\.55

Evaluation Protocol\.Each task defines a weighted scoring formulascore=\\slimits@iwimi\\text\{score\}=\\tsum\\slimits@\_\{i\}w\_\{i\}\\cdot m\_\{i\}wheremim\_\{i\}are normalized metrics \(IoU, F1, accuracy, LLM\-judged quality\)\. A submission passes ifscore70\\text\{score\}\\geq 70andall gating conditions hold: required files exist, JSON schemas validate, cross\-document IDs match, and outputs comply with closed vocabularies\. Gating failures result in zero score regardless of partial correctness\. For open\-ended outputs \(summaries, captions\), we use LLM judges with detailed rubrics; for temporal tasks \(video segmentation, audio boundaries\), we apply tolerance windows \(±2s\) and Hungarian matching for IoU computation\.

Results Analysis\.As shown in Table[5](https://arxiv.org/html/2606.00579#S4.T5), GPT\-5\.5 obtains the strongest results among the evaluated settings, with high and xHigh achieving the highest average scores\. MiniMax M2\.7 \(V\) performs the worst, reaching only a 4% pass rate and an average score of 0\.55\. Overall, the results show thatTerminalBench\-Oremains challenging across different model and harness configurations, especially for tasks requiring reliable long\-horizon tool use\.

Table 6:Comparison ofTerminalBench\-Owith existing benchmarks\.BenchmarkModalities\# SizeTarget CapabilityRealCross\-sourceMMUMMPOmniGAIA\(li2026omnigaia\)V\+A\+I\+T360QA, factuality seeking◐✘✔✘Terminal\-Bench 2\.0\(merrill2026terminal\)T\+F89Terminal workflows✔◐✘◐Claw\-Eval\(ye2026claw\)T\+I\+V\+D300Agent workflows✔◐◐◐WildClawBench\(wildclawbench\)T\+I\+V\+A\+D60In\-the\-wild agent tasks✔◐◐◐TerminalBench\-O\(Ours\)T\+V\+A\+I\+D\+F50Many\-modality Processing✔✔✔✔

✔ denotes full support, ◐ denotes partial or mixed support, and ✘ denotes absent support\. MMU and MMP mean many\-modality understanding and processing\.V: video,A: audio,I: image,T: text,D: document,F: files\.

## 5Related Works

Omnimodal Models and EvaluationRecent multimodal foundation models have advanced from single\-modality extensions to omni\-modal systems that jointly process speech, audio, images, and video\(chu2024qwen2audio,sun2024videosalmonn,xu2025qwen25omni,qwen35Omni,google2026gemini31pro\)\. As models push toward unified vision\-and\-speech interaction, evaluation has correspondingly shifted from task\-specific video or audio benchmarks\(li2023mvbench,liu2024tempcompass,fu2024videomme,wu2024longvideobench,li2022avqa\)to omni\-modal suites that more explicitly target joint reasoning over visual, auditory, and textual cues, with growing attention to long videos, grounded interaction, and real\-world complexity\(li2024omnibench,hong2025worldsense,zhang2025omnieval,li2026omnigaia,radwan2026sonico1,tao2026lvomnibench,xie2026socialomni\)\. Yet our audit shows that many supposedly audio\-visual tasks still allow strong shortcuts or rely only weakly on audio\(kim2025needtolisten\), motivating more careful benchmark construction\.

Multimodal Agentic Tool\-Use and Sandboxed Agents\.Tool use has become a central mechanism for extending LLM agents beyond their parametric knowledge\(yao2022react,schick2023toolformer,qin2023toolllm,patil2025bfcl,yao2024taubench,wang2025mcpbench\), and increasingly improves multimodal understanding via active search, code generation, and tool orchestration without retraining a monolithic model\(hu2024visualsketchpad,zhao2025pyvision,zhang2025thyme,zhang2024omagent,zhang2025deepvideodiscovery,tao2025omniagent,lin2025agentomni,li2026lenswalk\)\. The most mature form of sandboxed agency has emerged in software engineering, where models resolve real issues inside an execution environment\(jimenez2023swebench,yang2024sweagent,wang2024openhands,kwa2025measuringlongsoftware,cheng2026llminsandbox\), but existing work primarily targets natively textual or computational tasks\. Our work takes a step beyond: we show, for the first time, that a model can use a sandboxed terminal with self\-constructed tool use to solve non\-native multimodal tasks, outperforming many specialized omni models and suggesting that sandboxed agency is a general mechanism for broader many\-modality problem\-solving\.

## 6Conclusion

We study offline omnimodal understanding through the lens of sandboxed coding agents in both training\-free and fine\-tune settings\. Across multiple benchmarks, these agents solve a surprisingly large fraction of video\-audio tasks by selectively retrieving and processing evidence from raw inputs, often rivaling or surpassing native omnimodal systems\. We hope this work motivates a deeper study of multimodal intelligence at the model–tool–environment interface\.

## Ackownledgement

We thank Chenlong Wang, Jieneng Chen, Xirui Li, Ming Li, Kwesi Cobbina and Misaki Wang for their invaluable support and discussion\. Dongping Chen is supported by Modal, Tavily, and Linear’s academic research project\.

## References

## Appendix ALimitations

Our study has several limitations\. First, due to computing\-resource constraints, theCode\-Xtraining recipe is only validated at the 9B and 27B scales \(Qwen\-3\.5\-9B and Qwen\-3\.6\-27B\), and we have not yet characterized how the recipe scales to frontier\-size open\-source models\. Second, due to API credit constraints, several closed\-source coding agents could not be evaluated on the full benchmark suite\. In particular, Claude Opus 4\.6 results on VideoZeroBench and Claude Opus 4\.7 results across all four benchmarks are incomplete in the current draft, which means the comparison between coding agents and native omni\-modal models is somewhat uneven on the closed\-source side\. Third, our scope is restricted to*offline*omni\-modal tasks, where files are staged into a workspace ahead of time; conclusions do not transfer directly to streaming or interactive settings\.

## Appendix BBroader Impacts

Our findings establish a more compute\- and token\-efficient pathway for offline omni\-modal understanding, showing that text\+image coding agents with sandboxed tool use can match or surpass native omni\-modal models on many video\- and audio\-centric tasks while consuming fewer tokens\. This lowers the resource barrier for omni\-modal applications and broadens access for researchers and users who lack dedicated omni\-model APIs, while our open\-sourcedCode\-Xrecipe,OmniCodingdataset, andTerminalBench\-Obenchmark further support reproducible research at the model–tool–environment interface\. As with any tool\-using agent that processes video and audio, downstream deployments should incorporate standard safeguards such as content provenance, access controls, and user consent; we note that our agents operate only on user\-staged files within an isolated workspace and do not autonomously take real\-world actions\.

## Appendix CAdapting Sandboxed Agents for Omni\-Modal Tasks

We represent each omni\-modal task as

x=\(q,ℱ,𝒴\),x=\(q,\\mathcal\{F\},\\mathcal\{Y\}\),whereqqis a natural\-language instruction,ℱ\\mathcal\{F\}is a set of input files, and𝒴\\mathcal\{Y\}is the set of acceptable final answers\. We use an answer set rather than a single reference string so that aliases, option letters, numerical formats, and unit variants can be handled uniformly in both evaluation and reinforcement learning\.

A native omni\-modal model solves the task by packing the files into model context:

y^omni=Mθ\(Pack\(q,ℱ\)\),\\hat\{y\}\_\{\\mathrm\{omni\}\}=M\_\{\\theta\}\\\!\\left\(\\mathrm\{Pack\}\(q,\\mathcal\{F\}\)\\right\),wherePack\(\)\\mathrm\{Pack\}\(\\cdot\)denotes model\-specific media encoding, sampling, and serialization\. In this formulation, video and audio are first\-class model inputs\.

In contrast, our sandboxed agent does not need to ingest all raw media as context\. The filesℱ\\mathcal\{F\}are staged into an isolated workspace, and the agent is given the instruction and the corresponding file paths\. It then interacts with the workspace through a terminal interface: it can inspect files, invoke tools such asffmpeg,ffprobe, OCR, or ASR systems, write and execute scripts, create intermediate artifacts, and finally submit an answer\.

A sandbox rollout is written as

τ=\(o0,a0,o1,a1,…,oT,aT\),\\tau=\(o\_\{0\},a\_\{0\},o\_\{1\},a\_\{1\},\\ldots,o\_\{T\},a\_\{T\}\),whereoto\_\{t\}is the current observation, including command outputs and workspace feedback, andata\_\{t\}is the agent’s next message or terminal action\. During a rollout, raw omni\-modal files may be converted into task\-relevant artifacts such as transcripts, sampled frames, OCR text, timestamps, tables, crops, or executable analysis scripts\. The final prediction is parsed from the trajectory:

y^=Ans\(τ\),\\hat\{y\}=\\mathrm\{Ans\}\(\\tau\),whereAns\(\)\\mathrm\{Ans\}\(\\cdot\)extracts the submitted answer, e\.g\., the content inside an`<answer\>`tag when this format is required\.

Thus the two settings differ mainly in where omni\-modal processing happens:

native omni:\(q,ℱ\)→Packmodel context→Mθy^,\\displaystyle\(q,\\mathcal\{F\}\)\\xrightarrow\{\\mathrm\{Pack\}\}\\text\{model context\}\\xrightarrow\{M\_\{\\theta\}\}\\hat\{y\},sandboxed agent:\(q,ℱ\)→stageworkspace→terminal interactionτ→Ansy^\.\\displaystyle\(q,\\mathcal\{F\}\)\\xrightarrow\{\\mathrm\{stage\}\}\\text\{workspace\}\\xrightarrow\{\\text\{terminal interaction\}\}\\tau\\xrightarrow\{\\mathrm\{Ans\}\}\\hat\{y\}\.This view treats raw video, audio, images, and documents as environment state rather than mandatory model context\. The agent can therefore solve omni\-modal tasks by adaptively reducing raw media into compact evidence that fits the native reasoning interface of a general\-purpose coding model\.

## Appendix DDiscussion and Future Works

Do non\-native modalities really require native end\-to\-end perception?Our results suggest that non\-native modalities do not always require native end\-to\-end perception\. Even in the simplest Codex and Claude Code settings, coding agents solve a surprisingly large fraction of video\-audio tasks, and on OmniGAIA they can even outperform strong native omnimodal baselines\. The key is not that the base model suddenly becomes an audio\-video perceiver, but that the sandbox lets it decompose raw multimedia into artifacts that match its native reasoning interface, such as transcripts, key frames, timestamps, OCR, metadata, and compact summaries\. In this sense, multimodal processing becomes analogous to protocol\-based software access in text\-domain agents: the model delegates non\-native operations to the environment and then reasons over the returned evidence\. This perspective is consistent with the broader shift from static function calling to environment\-mediated problem solving in tool\-use and terminal\-agent evaluation\(wang2025mcpbench,merrill2026terminalbench,cheng2026llminsandbox\)\. It also resonates with recent tool\-augmented perception work, where explicit visual\-tool policies and tool\-derived perceptual signals both improve fine\-grained multimodal reasoning\(zhou2025reinforced,wei2026zooming\)\. A natural next question is therefore not only whether a model is natively omni, but which capabilities should remain externalized, which can be compiled into reusable skills, and how agents should detect and recover from imperfect tool outputs\. Our failure taxonomy and process\-level benchmark point in this direction, suggesting that the central bottleneck is often tool selection, verification, and repair rather than the mere absence of raw multimodal encoders\.

Shortcuts in current video\-audio benchmarks\.Are current video\-audio benchmarks measuring what they claim to measure? Our findings suggest only partially\. When a text\+image coding agent can answer many ostensibly omnimodal tasks by selectively extracting transcripts, sparse frames, or metadata, benchmark accuracy no longer cleanly measures native end\-to\-end audio\-video understanding\. This does not make such benchmarks uninformative, but it does mean that they frequently conflate at least three capabilities: native perception, external evidence acquisition, and general knowledge\-based reasoning\. Our results are aligned with recent audits and diagnostic studies showing that many video or audio\-visual benchmarks admit strong visual, textual, or temporal shortcuts\(kim2025needtolisten,lim2026video\)\. They are also consistent with benchmark\-design efforts that explicitly enforce stronger audio dependence or reduce text\-based shortcut channels\(wei2026zooming\)\. Going forward, we believe benchmark design should standardize at least three controls: reporting text\-only, text\+image, and agentic baselines; auditing modality dependence with masking or filtering tests; and minimizing leakage from subtitles, transcripts, metadata, and external retrieval whenever the target claim is native multimodal understanding\. Conversely, when the target is multimodal agentic competence, retrieval, tool orchestration, and evidence processing should be treated as first\-class evaluation targets rather than accidental shortcuts\. Making this distinction explicit would help separate backbone perception from system\-level problem solving, which are both important but not interchangeable\.

Can the same sandbox move from understanding to generation?We believe this is the most promising next step\. Once an agent can inspect raw video, audio, and documents, compress them into structured evidence, and maintain task\-relevant state over long horizons, it can in principle also plan edits, retrieve assets, invoke generative models, and validate outputs in a closed loop\. Early multimedia\-generation and agentic video\-editing systems already point in this direction, showing that multimodal agents can coordinate text, images, video, music, and external tools under explicit planning and feedback loops\(zhang2026versatile,zhao2026cutclaw\)\. The next frontier is to connect that generative loop with persistent multimodal memory, so that an assistant can build private memos, recap videos, highlight reels, or even broadcast\-ready drafts grounded not only in a single uploaded file but in a user’s longer\-term audiovisual history\(liu2026omnimem\)\. Doing so will require stronger memory management, preference modeling, provenance tracking, and process\-level verification, since generated content must remain faithful both to source media and to user intent\. More broadly, this points to a different picture of future omni assistants: not a single monolithic model that passively consumes everything in one shot, but a sandboxed system that can inspect, remember, transform, and generate across modalities with the right mix of reusable skills and external tools\. Under this view, offline omnimodal understanding is not the endpoint of this line of work, but the substrate for practical multimedia copilots\.

## Appendix EFull Related Works

### E\.1Multimodal Models and Evaluation

Recent progress toward omni\-modal assistants reflects a shift from modality\-specific perception modules to unified models that can process speech, general audio, images, and video within a single system\. On the model side, audio\-language systems such as Qwen2\-Audio\(chu2024qwen2audio\)and video\-centered models such as VideoLLaMA 2\(cheng2024videollama2\)and video\-SALMONN\(sun2024videosalmonn\)established strong foundations for audio\-visual understanding\. More recent omni\-modal systems, including VITA\(fu2024vita\), OpenOmni\(luo2025openomni\), Baichuan\-Omni\(li2024baichuanomni\), Moshi\(defossez2024moshi\), and Qwen2\.5\-Omni\(xu2025qwen25omni\), move toward unified and often real\-time interaction across speech and vision\. These systems substantially improve the practicality of omni interaction, but they also make it harder to separate intrinsic model ability from gains due to decoding strategies, longer context, or external processing pipelines\.

Benchmarking has evolved in parallel\. Earlier suites mainly targeted either video understanding, such as MVBench\(li2023mvbench\), Video\-MME\(fu2024videomme\), and MLVU\(zhou2024mlvu\), or audio / audio\-visual understanding, such as AIR\-Bench\(yang2024airbench\), MMAU\(sakshi2024mmau\), and AV\-Odyssey\(gong2024avodyssey\)\. While these benchmarks advanced multimodal evaluation, many still under\-represent audio, weakly couple audio with video, or emphasize short\-context perception more than long\-horizon multimodal reasoning\.

Recent work therefore moves toward explicitly omni\-modal and long\-form evaluation\. Benchmarks such as OmniBench\(li2024omnibench\), OmniEval\(zhang2025omnieval\), OmniGAIA\(li2026omnigaia\), LVOmniBench\(tao2026lvomnibench\), SocialOmni\(xie2026socialomni\), MMOU\(goel2026mmou\), and OmniPlay\(bie2025omniplay\)target more realistic settings that require integrating visual, auditory, and textual evidence over longer horizons\. At the same time, recent audits show that several audio\-video benchmarks admit strong visual shortcuts\(kim2025needtolisten\), suggesting that progress in omni\-modal evaluation depends not only on stronger models but also on shortcut\-resistant benchmark design and explicit measurement of true modality dependence\.

### E\.2Agentic Tool\-Use

The literature on tool\-augmented language models treats external tools as a way to expand the action space of models rather than as simple add\-ons\. Foundational work such as ReAct\(yao2022react\)and Toolformer\(schick2023toolformer\)showed that reasoning can be interleaved with actions and that models can learn when to invoke tools\. Later efforts, including Gorilla\(patil2023gorilla\), ToolBench\(xu2023toolbench\), ToolLLM\(qin2023toolllm\), BFCL\(patil2025bfcl\), andτ\\tau\-bench\(yao2024taubench\), scaled this paradigm to large tool collections and more interactive, stateful environments\. Across these works, the central shift is from static capability elicitation to environment\-mediated problem solving\.

Executable code has emerged as one of the most expressive forms of tool use\. CodeAct\(wang2024codeact\)argues that executable programs provide a general interface for acting in digital environments, allowing agents to manipulate files, call libraries, and compose multi\-step behaviors more flexibly than fixed action vocabularies\. In multimodal settings, Visual Sketchpad\(hu2024visualsketchpad\)and MLLM\-Tool\(wang2024mllmtool\)show that models can externalize perception into sketches, programs, and structured tool calls\. PyVision\(zhao2025pyvision\)and Thyme\(zhang2025thyme\)further demonstrate that models can synthesize and execute Python\-based image operations on demand, blurring the boundary between model capability and environment capability\.

This perspective is increasingly visible in video and omni\-modal reasoning\. OmAgent\(zhang2024omagent\), OmniAgent\(tao2025omniagent\), Agent\-Omni\(lin2025agentomni\), and LensWalk\(li2026lenswalk\)all treat multimodal understanding as an active process of decomposition, search, and selective inspection rather than passive end\-to\-end perception\. LongShOTBench\(kurpath2025longshotbench\)explicitly evaluates tool use and iterative refinement in long\-video settings\. More broadly, LLM\-in\-Sandbox argues that rich environments can elicit capabilities not apparent in static chat settings, reinforcing the view that multimodal competence may emerge from the interaction between a strong base model and an expressive tool substrate\(cheng2026llminsandbox\)\. Together, these works motivate viewing multimodal intelligence as a property of a model–environment system rather than of a standalone backbone\.

### E\.3Terminal Coding Agents and Benchmarks

The most mature form of sandboxed agency has so far emerged in software engineering\. SWE\-bench\(jimenez2023swebench\)established the canonical issue\-resolution setting by grounding evaluation in real GitHub issues and executable repository states\. Systems such as SWE\-agent\(yang2024sweagent\), AutoCodeRover\(zhang2024autocoderover\), Agentless\(xia2024agentless\), and OpenHands\(wang2024openhands\)explored different agent designs for this setting, ranging from search\-heavy repair to lightweight execution\-driven pipelines\. This line of work reframes coding as long\-horizon environment interaction rather than static code generation\.

Benchmarks have expanded accordingly\. SWE\-bench Verified\(openai2024swebenchverified\)improves reliability through a filtered subset, while SWE\-bench Multimodal\(yang2024swebenchmultimodal\)extends evaluation to visually grounded software issues\. Terminal\-Bench\(merrill2026terminalbench\)and LongCLI\-Bench\(feng2026longclibench\)move beyond repository patching toward broader command\-line tasks and longer\-horizon CLI interaction\. Other recent benchmarks, including SetupBench\(arora2025setupbench\), SWE\-Bench Pro\(deng2025swebenchpro\), and SEC\-bench\(lee2025secbench\), further broaden evaluation to environment setup, longer\-horizon issue resolution, and security workflows\.

The ecosystem has also expanded beyond benchmark construction into self\-improvement and broader autonomy evaluation\. SWE\-smith\(yang2025swesmith\)studies scalable data generation for software engineering agents, Live\-SWE\-agent\(xia2025livesweagent\)explores self\-evolution during deployment, and SWE\-Fuse\(wen2026swefuse\)uses trajectory learning to improve agent performance\. Beyond software\-specific tasks, RE\-Bench\(wijk2024rebench\), HCAST\(rein2025hcast\), and TheAgentCompany\(xu2024theagentcompany\)evaluate agents on longer\-horizon work in realistic environments\. At the same time, recent meta\-analyses caution that benchmark realism, contamination, and task validity remain open concerns, and that high scores do not automatically imply broader engineering competence\(martinez2026whatsinbenchmark,prathifkumar2025swebenchmemory\)\. These critiques are especially relevant when interpreting terminal\-agent performance as evidence of general agentic ability\.

## Appendix FTerminalBench\-OConstruction Details

TerminalBench\-0 contains 50 curated terminal tasks under a shared workspace interface, each requiring the agent to create concrete artifacts under/workspace/outputrather than return a short answer\.

Table 7:Summary statistics forTerminalBench\-O\.StatisticValueTasks50Agent\-visible fixture files3,013audio/video/image/PDF/FITS files1,293##### Raw data\.

We ground the suite in real\-source data: public research datasets, public archives, public\-domain scans, and direct\-download media\. The sources cover broadcast/web video, AMI and legal speech, action and egocentric video, industrial and scientific imagery, geospatial data, music stems, bioacoustics, score\-audio pairs, FITS files, and historical comics\. Synthetic content is reserved for controlled corruptions, decoys, overlays, and auxiliary labels needed to make evaluation precise\.

##### Preprocessing\.

Each source is normalized into a stable workspace representation before task packaging\. We also prepare closed vocabularies, masks, catalogs, WCS headers, timestamp maps, and source\-to\-fixture mappings for downstream grading\.

##### Construction\.

Step 1: Candidate mining\.We first select source segments whose native structure supports a realistic operation task, such as eventful video intervals, speaker turns, action clips, score\-aligned music excerpts\. Candidate segments are filtered for duration, visibility, signal quality, annotation availability, and non\-triviality before fixture generation\.

Step 2: Task\-specific transformation\.We then impose an operation or creation objective on the selected source\. Some tasks add controlled corruptions, hidden mappings, decoys, overlays, splice events, synchronization errors, or multi\-source audio compositions\. Others keep the media unchanged but require structured deliverables such as ledgers, catalogs, action datasets, OMR alignments, reading\-order tables, or redaction reports\.

Step 3: Hidden reference synthesis\.For each instance, we write hidden references that specify the answer geometry used by the grader\. These references include event windows, speaker maps, transcripts, quote spans, crop paths, clean media, masks, stem targets, timestamp curves, object tracks and closed\-vocabulary labels\. They are withheld from the agent but used to score both final outputs and intermediate artifacts\.

Step 4: Output\-surface packaging\.Finally, we expose each task as a small working directory rather than a question\-only prompt\. The visible files may include media, manifests, closed vocabularies, policy snippets, PDFs, CSVs, point clouds, FITS files, or auxiliary metadata, and the expected outputs are ordinary files such as MP4/WAV/PNG artifacts, CSV/JSON ledgers, subtitles,plots, and reports\. This format keeps the benchmark native to terminal tools,ffmpeg, OCR/ASR, numerical Python, and file\-system coordination\.

##### Filtering\.

We apply filtering before and after packaging\. Source candidates are removed if they are too short, ambiguous, silent, weakly annotated, or trivially solvable\. Packaged tasks are then checked for complete schemas, aligned hidden references, active closed vocabularies, decoy coverage, and resistance to shortcuts such as pass\-through media, empty manifests, constant labels, or copied templates\.

##### Evaluation\.

Each task uses a dedicated grader with weighted dimensions and gating conditions\. Deterministic checks cover temporal IoU, event recall, closed\-vocabulary accuracy, WER/CER, speaker attribution, ASR preservation, signal metrics, SSIM/PSNR, mask IoU, timestamp error, catalog matching, photometric error, geospatial overlap, table consistency, and schema validity\. For naturally open\-form artifacts such as memos, overlays, or semantic summaries, some graders add constrained LLM/MLLM checklist axes\. The final score therefore measures process\-level completion: media quality, structured intermediate artifacts, format compliance, and cross\-document consistency\.

## Appendix GFailure Mode Taxonomy and Process\-level Trajectory Dataset

![Refer to caption](https://arxiv.org/html/2606.00579v1/Figs/bench_overview.png)Figure 10:Overview of our process\-level benchmark\.\(a\) Dataset Overview:domain distribution across the union of OmniGAIA, LVOmniBench and SocialOmniBench \(200 tasks\); a word cloud of the question text; the distribution of annotated logical steps per task overlaid with the agent’s actual turn count\.\(b\) Capability Analysis:per\-task counts of image / audio / video inputs; required\-versus\-actual tool\-category coverage on the OmniGAIA subset \(n=152\) – “Required” is drawn fromrequired\_external\_toolssources,“Actual”is recovered by argument\-level classification of the agent trajectory; and the distribution of primary error types on the 100 incorrect trajectories we annotate\.Here we provide detailed definitions of the six failure modes introduced in the main text\.

- •Audio Perception and Extraction Error\.The agent mis\-transcribes, mis\-segments, or mis\-attributes the audio signal itself \(*e\.g\.*, ASR errors under noise or speaker confusion\), so all downstream reasoning is conditioned on content that diverges from the true recording\.
- •Video Perception and Extraction Error\.The visual stream is misread at the source via blank or off\-key frame sampling, OCR mistakes, or inverted event ordering, leaving later stages with incorrect evidence\.
- •Insufficient Exploration of Modal Content\.With multimodal inputs available, the agent answers from textual cues or priors, or terminates exploration before critical evidence is verified\. The failure lies in the decision to stop probing rather than in perception quality\.
- •Knowledge Retrieval and Factual Error\.Given correct perception, the agent links it to wrong external knowledge through poorly specified queries, unreliable sources, conflated entities, or fabricated facts when retrieval fails\.
- •Logical Reasoning and Calculation Error\.Evidence is largely correct, but the derivation breaks down via wrong formulas \(*e\.g\.*, planar vs\. spherical distance\), arithmetic mistakes, or dropped conditions in multi\-step deductions\.
- •Tool and Environment Infrastructure Failure\.The agent picks the right tool but is blocked by missing dependencies, sandbox permissions, or resource limits, forcing later steps to build on corrupted intermediates\.

Table 8:LLMs as process\-level judges on our 200\-trajectory subset\.Judge modelStepPrimaryBudgetGPT\-5\.4\-mini0\.5880\.6573\.17GPT\-5\.40\.7370\.83010\.54GPT\-5\.50\.7460\.84032\.49Qwen3\.6\-35B\-A3B0\.6050\.8131\.68DeepSeek\-V4\-Flash0\.6360\.7860\.46To scale failure mode analysis beyond the 200 human\-annotated trajectories, we benchmark off\-the\-shelf LLMs as process\-level judges and validate the chosen judge against human labels on this annotated set\. Each model replays all 200 trajectories under a unified prompt, predicting a step reward\{−1,0,\+1\}\\in\\\{\-1,0,\+1\\\}for every assistant step and, on the 100 incorrect trajectories, a primary error type from our taxonomy\. We follow the AgentProcessBench\(fan2026agentprocessbench\)definition for process reward, with the rubric summarized below\.

Table 9:Step\-type\-specific criteria for the\{−1,0,\+1\}\\\{\-1,0,\+1\\\}process reward\.Step type\+1\+1\(effective\)0\(neutral\)−1\-1\(harmful\)Text\-only step \(planning, reasoning\)Useful plan or constraint with concrete task value\.Generic narration or low\-value planning\.States false facts or commits to a bad premise\.Tool callTool choice is appropriate and materially useful\.Reasonable but low\-yield, redundant, externally blocked, or based on incomplete information\.Clearly wrong tool, violates instructions, or repeats a failed pattern without strategy change\.Post\-tool interpretationInterpretation is supported by the tool output and advances the task\.Cautious but inconclusive\.Misreads the output or overclaims beyond the evidence\.Final answer / decisive claimSupported by evidence already collected in the trajectory\.Cautious partial synthesis that does not materially help\.Guesses, hallucinates, or depends on an unresolved earlier error\.Step reward rubric\.Onlyassistantmessages are scored;userandtoolmessages serve as evidence\. Each assistant step is judged using only information visible up to that step \(no hindsight\), and the three\-level label is defined as follows:

- •\+1\+1\(correct and effective\)\.The step is factually correct given the available evidence and clearly advances the task, e\.g\., appropriate tool use, evidence\-supported interpretation of a tool output, an effective decision, or an explicit correction of a prior mistake\.
- •0\(neutral or exploratory\)\.The step is reasonable but yields limited or unclear progress, e\.g\., sensible exploration, tentative planning, redundant recap, or a reasonable tool attempt that fails for external reasons such as a 404, timeout, or tool limitation\.
- •−1\-1\(incorrect or harmful\)\.The step is factually wrong, misreads or fabricates evidence, violates system or policy constraints, repeats a failed pattern without a meaningful strategy change, or otherwise moves the trajectory away from success\.

We additionally applyerror propagation: once a step is labeled−1\-1, dependent later assistant steps remain−1\-1until the agent explicitly corrects the mistake or starts an independent subtask no longer relying on the bad premise\. We provide our annotation interface in Figure[11](https://arxiv.org/html/2606.00579#A7.F11)\.

![Refer to caption](https://arxiv.org/html/2606.00579v1/Figs/process_benchmark_anno.png)Figure 11:Annotation interface of our process\-level dataset\.Table[8](https://arxiv.org/html/2606.00579#A7.T8)compares predictions against human labels via Step Acc\. and Primary Acc\. GPT\-5\.5 wins on both metrics; GPT\-5\.4 is second on step agreement; Qwen3\.6\-35B\-A3B matches GPT\-5\.4 on error\-type accuracy at much higher throughput; GPT\-5\.4\-mini and DeepSeek\-V4\-Flash trail on step reward, indicating that smaller models are not yet adequate for fine\-grained process supervision here\. Based on these results, we adopt GPT\-5\.5 as the process\-level judge for all large\-scale failure mode analyses in the main paper\.

## Appendix HImplementation Details

### H\.1Code\-XReward Hyperparameters

Table[10](https://arxiv.org/html/2606.00579#A8.T10)lists the reward weights and group\-level coefficients used in all RL experiments\. The values are shared across the 9B and 27B runs\.

Table 10:Reward and group\-shaping hyperparameters used in our RL training\.SymbolValueDescriptionPer\-trajectory reward componentswcorrectw\_\{\\text\{correct\}\}\+1\.0\+1\.0Base reward for gated\-correct trajectories\.wfmt\+w\_\{\\text\{fmt\}\+\}\+0\.2\+0\.2Format bonus: wrong answer, cleantask\_completeexit, non\-empty`<answer\>`\.wfmt−w\_\{\\text\{fmt\}\-\}−0\.2\-0\.2Format penalty: any other exit \(timeout, step\-limit, no answer, etc\.\)\.wmodw\_\{\\mathrm\{mod\}\}−0\.3\-0\.3Modality penalty when video/audio task is solved without a corresponding tool\.wtoolw\_\{\\mathrm\{tool\}\}0\.50\.5Bad\-tool weight; effective penalty is−wtoolPbad\[−0\.5,0\]\-w\_\{\\mathrm\{tool\}\}\\cdot P\_\{\\text\{bad\}\}\\in\[\-0\.5,0\]\.Group\-level shapingGG88Sibling rollouts per task \(i\.e\.,n\_samples\_per\_prompt\)\.α\\alpha0\.20\.2Length\-penalty slope\.fminf\_\{\\min\}0\.50\.5Minimum length factor for correct samples\.ϵA\\epsilon\_\{A\}10−610^\{\-6\}Stabilizer in advantage normalization\.GSPO clippingϵ−\\epsilon\_\{\-\}0\.200\.20Lower clip threshold forρi\(θ\)\\rho\_\{i\}\(\\theta\)\.ϵ\+\\epsilon\_\{\+\}0\.280\.28Upper clip threshold \(DAPO clip\-higher\)\.Tool universes for modality detection\.Modality engagement \(used bywmodw\_\{\\mathrm\{mod\}\}and the correctness gate\) is detected by substring\-matching the bash text insideexecute\_commandsarguments against the following tool universes:

- •Video:ffmpeg,ffprobe,ffplay,yt\-dlp,mpv,vlc,VideoCapture,VideoWriter,VideoFileClip,moviepy,decord,VideoReader,imageio,mediainfo\.
- •Audio:whisper,faster\_whisper,librosa,torchaudio,soundfile,sf\.read,pydub,AudioSegment,wave\.open,sox,demucs,spleeter,lame\.
- •Image:PIL\.Image\.open,Image\.open,cv2\.imread,cv2\.imwrite,tesseract,pytesseract,convert,magick,identify,mogrify,easyocr,paddleocr,pdftotext,pdftoppm,exiftool,image\_read\.

A task withhas\_video=True\\texttt\{has\\\_video\}\{=\}\\texttt\{True\}requires at least one match from the video or image set; a task withhas\_audio=True\\texttt\{has\\\_audio\}\{=\}\\texttt\{True\}requires at least one match from the audio set\. Image\-only tasks are not modality\-gated\.

Bad\-tool fail categories\.The numerator ofPbad=\(nunparseable\+ndisallowed\+nescape\+nsyntax\-fail\)/\(1\+ntool\)P\_\{\\text\{bad\}\}=\(n\_\{\\text\{unparseable\}\}\+n\_\{\\text\{disallowed\}\}\+n\_\{\\text\{escape\}\}\+n\_\{\\text\{syntax\-fail\}\}\)/\(1\+n\_\{\\text\{tool\}\}\)aggregates four categories:

- •Unparseable:malformed JSON in tool call, missing or non\-dictarguments, or missing toolname\.
- •Disallowed:tool name outside the allowed set\{execute\_commands,image\_read,task\_complete\}\\\{\\texttt\{execute\\\_commands\},\\,\\texttt\{image\\\_read\},\\,\\texttt\{task\\\_complete\}\\\}, or arguments matching network primitives \(curl,wget,ssh,scp,rsync,nc,telnet,http\(s\)://\)\.
- •Escape:arguments matching gold\-answer patterns \(ground\_truth,answer\_key,gold\_answer,\*\.gold,all\_final\.jsonl,rl\_train\.jsonl,sft\_train\.jsonl\), or absolute paths resolving outside the workspace\. Any escape event additionally triggers a hard zero on correctness\.
- •Syntax\-fail:the tool reply containsSyntaxError,invalid syntax,unexpected EOF,parse error,unexpected token,missing operand, orunterminated string\.

Unparseable tool calls short\-circuit the other static checks, since they never actually execute in the sandbox\.

Removed\-trajectory criterion\.A trajectory is marked*removed*\(score zeroed, sample masked from the policy\-gradient loss\) iff the agent emits no`<answer\>`*and*the exit reason is one oftimeout,step\_limit, orcontext\_overflow\. Group\-level statistics \(mean, std, dynamic filter\) are computed over active \(non\-removed\) samples only\.

### H\.2Prompts in Benchmark

The exact prompts used in our experiments are listed in Figure[12](https://arxiv.org/html/2606.00579#A8.F12),[13](https://arxiv.org/html/2606.00579#A8.F13),[14](https://arxiv.org/html/2606.00579#A8.F14),[15](https://arxiv.org/html/2606.00579#A8.F15),[16](https://arxiv.org/html/2606.00579#A8.F16),[17](https://arxiv.org/html/2606.00579#A8.F17)\.

Prompt 1: Calibration\-set self\-iterationYou are updating a reusable benchmark agent guide\.Read these files in the current workspace:\-\{previous\_agent\_md\_name\}\-\{summary\_json\_name\}Task:1\. Update the previous guide into a better next\-round guide\.2\. Use only generic, reusable tactics suggested by the sanitized summary\.3\. Do not include benchmark\-specific facts, named entities, dates, exact answers, or any clues tied to individual cases\.4\. Do not quote or paraphrase specific questions\.5\. Keep the guide concise, operational, and directly useful during future runs\.What to extract from the summary:\- recurring failure patterns\- search\-breadth problems\- weak verification habits\- answer\-format mistakes\- underused or misused tools\- signals about when a workflow should escalate from local inspection to search, OCR, ASR, calculation, or multi\-step verification\- patterns in required tools and tool\-call usage that suggest missing playbook guidance\- the abstract difference between OmniGAIA reference answer processes and the model’s observed processes\- reusable workflow lessons from cases where the reference process used stronger verification, better ordering, or better skill/tool composition than the model didFigure 12:First part of prompt for calibration\-set self\-iteration\.Prompt 2: Calibration\-set self\-iterationAdditional requirements for the new guide:\- explicitly state that the agent must not access, read, inspect, rely on, or search for any content outside the current workspace\- include a short section that summarizes when to use specific skills or capability families\- include a short section that summarizes tool\-use heuristics\- include a short section that summarizes better end\-to\-end answering workflows derived from reference\-process\-vs\-model\-process comparisons\- for tool use, prioritize generic guidance such as: \- whenweb\_searchshould be preferred over guesswork \- when shell inspection should precede search \- when OCR / ASR / frame extraction / calculation should be combined \- when to stop searching and verify the answer format instead\- when the summary provides reference\-process signals, compare them against the model process and convert the differences into better workflow steps or skill\-selection rules\- make the guide explain not just which tool to use, but in what order and with what verification checkpoints\- if the reference process suggests a stronger answer pipeline, rewrite that into a reusable skill/playbook description rather than a case\-specific tip\- if the summary suggests over\-searching or under\-searching, reflect that explicitly in the updated guide\- if the summary suggests tools were used but not verified against the final answer, add a guardrail for that\- if the summary suggests repeated empty or failed predictions, add concise recovery rulesRequired output:\- return markdown only\- start with\# Agent Guide\- prefer short sections and flat bullet lists\- keep the guide high\-level enough to generalize, but concrete enough to change behaviorFocus areas:\- search strategy\- tool selection\- skill / capability\-family selection\- workflow / process design\- verification discipline\- common failure guards\- answer\-format disciplineHard prohibitions:\- no benchmark\-specific examples\- no copied phrases from individual questions\- no named entities, dates, locations, titles, or answer strings from the sampled cases\- no hidden hints that would advantage the exact benchmark subsetFigure 13:Second part of prompt for calibration\-set self\-iteration\.Prompt for testing SocialOmni Leve1You are evaluating a SocialOmni benchmark item in an isolated local workspace\.Work only inside the current workspace\.Your current working directory is the workspace root for this run\.Treat the current working directory tree as the only allowed workspace root\.Do not access, read, inspect, rely on, or search for files, directories, or other local content outside the current workspace\.Never open absolute paths or parent\-directory paths outside the workspace, and never use symlinks, environment discovery, shell expansion, or search commands to reach files beyond the workspace root\.Treat any host file, cache, home\-directory content, sibling directory, mounted dataset source, or path outside this workspace as strictly forbidden, even if it appears readable from the shell\.Use only the staged files listed below; do not assume any hidden benchmark files exist\.You may install dependencies without sudo when necessary, but keep them inside the current workspace\.User\-level install prefixes and caches are redirected into the workspace for this run\.Available staged files:\[LOCAL\_VIDEO\_PATH\]Network access may be available for shell commands in this run\.You may use the network only when it materially helps process the current sample, including downloading tools or packages into the current workspace only\.Do not use the network to search for benchmark answers, leaked annotations, dataset\-specific solutions, or existing evaluation outputs\.\[SYSTEM\]\[LEVEL1\_SYSTEM\_PROMPT\]\[LOCAL\_VIDEO\]Analyze the local video file at:\[LOCAL\_VIDEO\_PATH\]\[ASR\]\[ASR\_CONTENT\_IF\_ENABLED\]Options:A\. \.\.\.B\. \.\.\.C\. \.\.\.D\. \.\.\.\[LEVEL1\_USER\_PROMPT\_BASE\]\[ANSWER\_FORMAT\]Output requirements:Return ONLY the single option letter \(A, B, C, D\)\.Do not repeat the option text\.Do not provide any explanation, reasoning, analysis, punctuation, markdown, or extra words\.Your entire response must be exactly one of:A/B/C/D\.\[QUESTION\_TEXT\]Figure 14:Prompt for testing SocialOmni Level1\.Prompt for testing SocialOmni Level2You are evaluating a SocialOmni benchmark item in an isolated local workspace\.Work only inside the current workspace\.Your current working directory is the workspace root for this run\.Treat the current working directory tree as the only allowed workspace root\.Do not access, read, inspect, rely on, or search for files, directories, or other local content outside the current workspace\.Never open absolute paths or parent\-directory paths outside the workspace, and never use symlinks, environment discovery, shell expansion, or search commands to reach files beyond the workspace root\.Treat any host file, cache, home\-directory content, sibling directory, mounted dataset source, or path outside this workspace as strictly forbidden, even if it appears readable from the shell\.Use only the staged files listed below; do not assume any hidden benchmark files exist\.You may install dependencies without sudo when necessary, but keep them inside the current workspace\.User\-level install prefixes and caches are redirected into the workspace for this run\.Available staged files:\[LOCAL\_PREFIX\_VIDEO\_PATH\]Network access may be available for shell commands in this run\.You may use the network only when it materially helps process the current sample, including downloading tools or packages into the current workspace only\.Do not use the network to search for benchmark answers, leaked annotations, dataset\-specific solutions, or existing evaluation outputs\.\[SYSTEM\]\[LEVEL2\_SYSTEM\_PROMPT\]\[LOCAL\_VIDEO\]Analyze the local prefix video file at:\[LOCAL\_PREFIX\_VIDEO\_PATH\]\[ASR\]\[FULL\_ASR\_IF\_ENABLED\]\[LEVEL2\_USER\_PROMPT\_BASE\]Task 1 \(Q1\):Question:\[Q1\_QUESTION\_TEXT\]Options:A\. \[OPTION\_A\]B\. \[OPTION\_B\]Task 2 \(Q2\):Question:\[Q2\_QUESTION\_TEXT\]If your Q1 answer is A, provide a concise natural utterance for Q2\.If your Q1 answer is B, set Q2 to N/A\.Output requirements:Do not provide explanation, reasoning, analysis, markdown, or extra words\.Return exactly in the following format:<q1\>A\_or\_B</q1\><q2\>utterance\_or\_NA</q2\>Figure 15:Prompt for testing SocialOmni Level2\.Prompt for testing LVOmniBenchYou are evaluating a long audio\-video benchmark item in an isolated local workspace\.Work only inside the current workspace\.Do not access, read, inspect, rely on, or search for any files, directories, or other content outside the current workspace\.Never open absolute paths or parent\-directory paths outside the workspace, and never use symlinks, environment discovery, shell expansion, or search commands to reach files beyond the workspace root\.Treat any host file, cache, home\-directory content, sibling directory, mounted dataset source, or path outside the current workspace as strictly forbidden, even if it appears readable from the shell\.Use the staged video file\(s\) listed below and any local tools available in the workspace\.Do not assume access to any original benchmark directories outside this workspace\.Network access may be available for shell commands in this run\.You may use the network only when it materially helps process the current sample, including downloading tools or packages into the current workspace only\.Do not use the network to search for benchmark answers, leaked annotations, dataset\-specific solutions, or existing evaluation outputs\.Available staged files:\[STAGED\_VIDEO\_FILES\]Use the benchmark’s official answering format below\.Question:\[QUESTION\_TEXT\]Options:A\. \.\.\.B\. \.\.\.C\. \.\.\.D\. \.\.\.Select the best answer from the options above\. Directly provide the letter representing your choice \(A/B/C/D\) and nothing else\. Do not include the full text of the option, do not provide any explanation\.Figure 16:Prompt for testing LVOmniBench\.Prompt for testing VideoZeroBenchYou are evaluating VideoZeroBench questions in a sandboxed local workspace\.The model can access only the staged video file and files it creates during the current run\. Do not access, read, inspect, rely on, or search for files outside the sandbox, including benchmark annotations, gold answers, evidence windows, evidence boxes, previous evaluation outputs, cached results, or any other files that may reveal the answer\.Network access is allowed only for general\-purpose assistance that does not reveal benchmark\-specific information\. Do not use the network to search for the benchmark name, dataset, question text, video id, video filename, leaked annotations, gold answers, existing evaluation outputs, or papers/pages that evaluate or discuss the same sample\. Do not retrieve any content that directly or indirectly reveals the answer\.If video inspection is needed, use information derived only from the provided staged video and the question\. Do not use external sources or prior knowledge to retrieve or infer benchmark\-specific answers\.All questions below refer to the same video and must be answered in a single response\.Available staged video:\-\{VIDEO\_FILENAME\}\.mp4Video id / filename:\{VIDEO\_FILENAME\}\.mp4Requested evaluation level: Level\-3\.Return ONLY valid JSON\.Return a JSON array with exactly one object perquestion\_idshown below\.Do not include markdown fences, explanations, extra keys, chain\-of\-thought, or commentary\.The entire final response must be the JSON array itself\.Your first output character must be\[and your final output character must be\]\.If you cannot determine an answer, use an empty string\.Required output object schema:\{
"question\_id": "string",
"level3\_answer": "string"
\}Question bundle:\[
\{
"question\_id": "\{QUESTION\_ID\_1\}",
"question": "\{QUESTION\_TEXT\_1\}"
\},
\{
"question\_id": "\{QUESTION\_ID\_2\}",
"question": "\{QUESTION\_TEXT\_2\}"
\}
\]Level\-3 QA rule:\- This is a question answering task without benchmark\-provided temporal or spatial evidence\.\- Answer according to the video content and the question\.\- Return only the final answer text inlevel3\_answer\.Figure 17:Prompt for testing VideoZeroBench, where all questions for the same video are answered together in a single response\.
### H\.3Experimental Configuration Details

This subsection summarizes the evaluation configurations used in our experiments for each agent and omnimodal model setting, including controller settings, model assignments, modality\-processing rules, and benchmark\-specific runtime defaults\.

#### H\.3\.1Agent\-Omni

Controller and graph configuration\.Agent\-Omni is evaluated with a fixed multi\-agent graph consisting of modality summarization, master reasoning, modality\-specific execution, and master decision\. The default reasoning budget is two rounds, i\.e\.,max\_round\_num=2\. The global retry budget isretry\_times=10\.

Model assignment\.The committed benchmark configuration uses the following model split\.We use even stronger models compared to the configuration in their paper:

- •orchestrate agent:openai/gpt\-5\.4;
- •text agent:openai/gpt\-5\.4;
- •image agent:openai/gpt\-5\.4;
- •video agent: localQwen/Qwen2\.5\-Omni\-7B;
- •audio agent: localQwen/Qwen2\.5\-Omni\-7B\.

Generation settings\.The orchestrate, text, and image agents usemax\_tokens=4096andtemperature=0\. The video agent usesmax\_tokens=2048,temperature=0,max\_image\_input=12,max\_frames\_num=448, andimage\_size=448 448\. The audio agent usesmax\_tokens=4096andtemperature=0\.

Modality processing\.For video benchmarks, the audio track is extracted from each video and provided jointly with the original video input, so the effective input to the graph is audio\-video rather than video only\. All four modality branches remain part of the graph, while the master reasoning module determines which agents receive non\-empty questions\.

Output normalization\.For multiple\-choice evaluation, the graph output is not used directly\. Instead, a separate selector or finalizer model is applied after graph execution\. By default, this selector/finalizer uses the same model as the master agent\.

LVOmniBench configuration\.The default runtime configuration is:

- •concurrent\_limit=1,
- •max\_round\_num=2,
- •request\_timeout=600,
- •ffmpeg\_timeout=300,
- •empty\_answer\_retry\_rounds=3\.

SocialOmni configuration\.The default runtime configuration is:

- •max\_concurrency=1,
- •max\_round\_num=2,
- •request\_timeout=600,
- •ffmpeg\_timeout=300,
- •empty\_answer\_retry\_rounds=3\.

For Level\-2 evaluation, a prefix video is first cropped up to the target timestamp\. The model predicts Q1 before Q2, and Q2 is executed only when the gold Q1 label isAand the Q1 prediction is correct\. The default judge model for Level\-2 isgpt\-5\.4\.

VideoZeroBench configuration\.The default runtime configuration is:

- •level=level3,
- •group\_by\_video=True,
- •concurrent\_limit=1,
- •max\_round\_num=2,
- •request\_timeout=600,
- •ffmpeg\_timeout=300,
- •empty\_answer\_retry\_rounds=2\.

When grouped by video, all questions attached to the same video are answered from a single graph run\.

#### H\.3\.2OmniAgent

Controller configuration\.OmniAgent is evaluated as a single tool\-using controller\. The default controller budget ismax\_iterations=30\. The controller model usestemperature=1, and the default reasoning effort is set tohighwhen supported by the serving API\.

Model assignment\.The default released configuration is listed below\.We use even stronger models compared to the configuration in their paper:

- •controller model:openai/gpt\-5\.4;
- •video tool model: localQwen/Qwen2\.5\-Omni\-7B;
- •audio tool model:openai/gpt\-4o\-audio\-preview\.

Tool set\.The default core tools are:

- •audio\_global\_caption,
- •audio\_qa,
- •audio\_ASR,
- •Audio\_EventList,
- •Audio\_EventLocation,
- •video\_global\_qa,
- •video\_clip\_qa,
- •video\_metadata\.

External web and code tools are implemented but disabled by default in the benchmark runs\.

Video sampling configuration\.The default multimodal settings are:

- •VIDEO\_INPUT\_MODE=frames,
- •VIDEO\_TARGET\_FPS=0\.5,
- •VIDEO\_CLIP\_TARGET\_FPS=2\.0,
- •VIDEO\_MAX\_FRAMES=24,
- •VIDEO\_CLIP\_MAX\_FRAMES=40,
- •VIDEO\_MAX\_EDGE=1024,
- •MULTIMODAL\_MAX\_TOKENS=4096\.

For oversized or empty multimodal responses, the video tool falls back to smaller payloads using16/12/8frames and768/640/512maximum edge size\.

Audio processing\.The audio tool can consume audio files directly and can also extract audio from video inputs\. Audio event localization and timeline generation are performed with chunk\-based prompting followed by a merge step\.

LVOmniBench configuration\.The default runtime configuration is:

- •concurrent\_limit=1,
- •max\_iterations=30,
- •item\_timeout=5400,
- •empty\_answer\_retry\_rounds=3,
- •external tools disabled by default\.

SocialOmni configuration\.The default runtime configuration is:

- •concurrent\_limit=2,
- •max\_iterations=30,
- •item\_timeout=5400,
- •empty\_answer\_retry\_rounds=3,
- •external tools disabled by default\.

For Level\-2 evaluation, a prefix video is cached up to the target timestamp\. Q1 is executed before Q2, and Q2 is run only when the gold Q1 label isAand the Q1 prediction is correct\. When available, the benchmark ASR field is injected into the prompt\. The Level\-2 judge uses the evaluation endpoint specified by the evaluation model settings, which default to the controller\-model endpoint unless overridden\.

VideoZeroBench configuration\.The default runtime configuration is:

- •level=level3,
- •request\_unit=video,
- •concurrent\_limit=1,
- •max\_iterations=30,
- •item\_timeout=5400,
- •empty\_answer\_retry\_rounds=3,
- •external tools disabled by default\.

Underrequest\_unit=video, all questions corresponding to the same video are answered in one batched response, and the final answers are parsed from a JSON answer map\.

## Appendix ILicenses and Use Terms of Existing Assets

We list below all third\-party benchmarks, datasets, models, and software libraries used in this work, together with their licenses to the best of our knowledge\. All assets are used in a manner consistent with their stated terms; we used them only for non\-commercial academic research\.

### Benchmarks for Evaluation

- •OmniGAIA\(li2026omnigaia\): an open\-ended, tool\-mediated factual and multi\-hop omni\-modal reasoning benchmark\. Released under the MIT License for the code and CC\-BY\-4\.0 for the data annotations; the underlying media inherits the licenses of its original public sources\.
- •SocialOmni\(xie2026socialomni\): an audio\-visual social interaction benchmark for omni models\. Released under the MIT License for the code; data is provided for non\-commercial research use, with the underlying videos governed by their original platform licenses\.
- •LVOmniBench\(tao2026lvomnibench\): a long audio\-video understanding benchmark\. Released under the MIT License for the code and Apache\-2\.0 for the released annotations; underlying videos retain their original licenses\.
- •VideoZeroBench\(meng2026videozerobench\): a video MLLM benchmark with spatio\-temporal evidence verification\. Released under the MIT License for the code; data annotations are provided under CC\-BY\-4\.0 for academic use\.

### Datasets Used inOmniCoding

- •OmniGAIA\-SFT\-2K\(li2026omnigaia\): the supervised\-trajectory subset distributed with OmniGAIA\. Released alongside OmniGAIA under the same MIT/CC\-BY\-4\.0 terms\.
- •OmniVideoBench\(li2025omnivideobench\): an audio\-visual understanding benchmark for omni MLLMs\. Released under the MIT License; underlying videos retain their original public licenses\.
- •AVUTBenchmark\(yang2025avut\): an audio\-centric video understanding benchmark designed to remove text shortcuts\. Released under the MIT License for code and CC\-BY\-4\.0 for annotations\.
- •Video\-MME\-v2\(audio\-visual / temporal subset\)\(fu2026videommev2\): a comprehensive video understanding benchmark\. Released under the MIT License \(code\) and CC\-BY\-NC\-4\.0 \(data\) for academic, non\-commercial use\.

### Source Material forTerminalBench\-O

All raw media inTerminalBench\-Ois drawn from authentic public sources:

- •Internet Archive: content used under the licenses indicated on each Internet Archive item page \(typically Public Domain, CC\-BY, or CC0\)\.
- •HuggingFace public datasets: each used under the dataset\-specific license declared on its HuggingFace card \(the specific licenses are listed in our benchmark release card\)\.
- •Public web videos: incorporated only when the source platform’s terms permit research use, with attribution preserved in our task metadata\.

### Models

- •Qwen\-3\.5\-9BandQwen\-3\.6\-27B\(qwen35Omni\): open\-weight base models we fine\-tune\. Released by Alibaba under the Tongyi Qianwen / Apache\-2\.0\-style license, which permits research and commercial use subject to the model’s published terms\.
- •Whisper \(large\-v3\): ASR model used inside agent rollouts\. Released by OpenAI under the MIT License\.
- •Closed\-source models\(GPT\-5\.4, Claude Opus 4\.6 / Sonnet 4\.6, Gemini 3\.1 Pro, Gemini 3 Flash, etc\.\) are accessed exclusively through their official APIs in compliance with each provider’s terms of service\.

### Software and Libraries

- •ffmpeg / ffprobe: LGPL\-2\.1\+ \(with GPL components in some builds\)\.
- •Tesseract OCR: Apache\-2\.0\.
- •OpenCV: Apache\-2\.0\.
- •Librosa: ISC License\.
- •ImageMagick: ImageMagick License \(Apache\-2\.0\-compatible\)\.
- •yt\-dlp: Unlicense \(public domain\)\.
- •Pythonand standard scientific stack \(NumPy, SciPy, etc\.\): each under their respective permissive licenses \(BSD/PSF/MIT\)\.

### I\.1Omni\-terminal Harness

This appendix details our terminal\-agent harness, adapted from Terminus\-KIRA\(terminuskira2026\)and designed around a single operation: turning raw multimodal files into small, checkable evidence\. Each episode begins with an isolated workspace containing the question and the staged media files\. Rather than packing media into the model context, files remain on disk, and the agent must decide how to inspect them: extracting audio, sampling frames, running OCR or ASR, reading metadata, writing scripts, and verifying intermediate results before answering\. The agent operates through a small action space \(execute\_commands,image\_read,task\_complete\) and expresses higher\-level media operations through code and command\-line tools \(ffmpeg/ffprobe, ASR, OCR, Python, OpenCV, NumPy, and self\-authored scripts\) rather than benchmark\-specific APIs\. Beyond execution, the harness is centered on trajectory collection: every tool call, terminal observation, image observation, final answer, exit reason, and reward field is serialized so the same rollout can be reused for SFT filtering, RL replay, and failure analysis\.

Concretely, the harness has four properties that are important for training:

- •Filesystem\-grounded media processing\.Raw video, audio, images, and documents are workspace state\. The agent learns to transform them into transcripts, key frames, OCR text, timestamps, tables, and other compact evidence\.
- •Persistent terminal interaction\.Commands run in a stateful bash session with marker\-based completion, bounded observations, timeout recovery, and support for scripts written during the episode\.
- •Replay\-faithful multimodal observations\.image\_readinserts the actual image into the main conversation, rather than replacing it with a caption from a separate model\. The saved rollout therefore matches the observation distribution used for multimodal SFT/RL\.
- •Qwen\-compatible rollout collection\.The tool schema is flat, malformed tool\-call blocks are conservatively recovered, and requests are routed through sticky SGLang endpoints to preserve prefix\-cache locality during large parallel rollouts\.

Episode state\.Each task starts from an isolated workspace

s0=\(q,W0,U0\),s\_\{0\}=\(q,W\_\{0\},U\_\{0\}\),whereqqis the task instruction,W0W\_\{0\}contains the staged input files, andU0U\_\{0\}contains the available command\-line tools and Python libraries\. The model does not receive the raw video or audio stream as context\. Instead, it interacts with files through the terminal\. During a rollout, the workspace may accumulate derived artifacts such as extracted frames, audio clips, ASR transcripts, OCR outputs, metadata files, temporary scripts, and numerical tables\. The final prediction is accepted only from the benchmark\-normalized<answer\>\.\.\.</answer\>wrapper\.

Tool interface\.The model sees three tools\. The small tool set is deliberate: the terminal is the compositional interface, and the agent decides how to combine existing utilities with code it writes during the episode\.

ToolFunctionexecute\_commandsRuns one shell command in a persistent bash session\. The schema is flat: the call containsanalysis,plan, andkeystrokes\. Multi\-step work is expressed with shell syntax, scripts, or heredocs\.image\_readLoads an image file as a native multimodal observation for the main agent\. This is used for frames, crops, figures, screenshots, OCR targets, and visual verification\.task\_completeEnds the rollout after the model has emitted the final answer wrapper\. If the wrapper is missing, the harness issues one format reminder before scoring the run as empty\.Table 11:Model\-visible tools in theOmniCodingharness\.Agent loop\.At each step, the harness sends the current conversation and tool schema to the model\. If the model callsexecute\_commands, the command is executed in the persistent shell and the bounded output is returned as a tool observation\. If it callsimage\_read, the harness first records a short tool acknowledgement and then appends the image itself as a multimodal observation in the main conversation\. If it callstask\_complete, the harness checks whether a wrapped answer already appears in the trajectory\. The loop stops on successful completion, step limit, no\-tool\-call exhaustion, or runtime error\.

Shell runtime\.The shell is a persistent bash subprocess\. This preserves state across turns, including the current directory, environment variables, installed packages, and files created by the agent\. Command completion is detected by appending a hidden marker after each command and polling until the marker appears\. Terminal observations are capped with head–tail truncation, which keeps both the beginning of long outputs and the end where errors or final answers often appear\. When a command times out, the harness terminates the direct child processes while keeping the shell alive\. If the shell itself dies, it is restarted and the next observation warns the model that shell\-local state has been lost\.

Native multimodal replay\.The main change from a standard terminal harness is the nativeimage\_readpath\. A caption\-only image reader is acceptable for evaluation, but it is not acceptable for multimodal policy training: the saved trajectory would contain text produced by a separate model rather than the pixels observed during rollout\. Our harness instead stores image observations as multimodal content blocks in the main conversation\. For Qwen\-style chat templates, these messages are folded into the provider\-specific tool\-response format only at send time\. The saved artifact remains provider\-neutral and can be replayed for SFT or RL without changing the observation sequence\.

Qwen serving and tool\-call robustness\.Open\-weight models served through SGLang can emit partially malformed tool\-call blocks, especially when shell commands contain quotes, heredocs, or escape characters\. We therefore use a flatexecute\_commandsschema and a conservative recovery layer\. If the backend returns no structured tool call but the assistant text contains a recognizable<tool\_call\>block, the harness reconstructs the call only when the function name is in the declared tool list and the arguments can be parsed without inventing content\. Recovered calls are marked in the trajectory, so downstream data filtering can keep or drop them explicitly\.

Trajectory artifacts\.Each rollout stores the full message sequence, parsed tool calls, shell outputs, image\-observation references, final extracted answer, exit reason, and evaluation metadata\. The same artifact supports three downstream uses\. First, successful trajectories can be filtered into SFT examples\. Second, RL training can replay the same observation distribution seen during rollout\. Third, failures can be inspected at the process level: whether the agent sampled the wrong frames, trusted a bad ASR transcript, stopped too early, chose the wrong external fact, or made a calculation error\.

Table 12:Trajectory fields saved by the harness\.Artifact fieldPurposeMessagesReplayable conversation with tool calls and observations\.Workspace outputsDerived files such as frames, transcripts, OCR text, scripts, and tables\.Image referencesNative multimodal payloads used byimage\_read\.Answer fieldExtracted text inside<answer\>\.\.\.</answer\>\.Exit reasonCompletion, step limit, no\-tool\-call exit, or runtime error\.Reward metadataAutomatic verification result and benchmark\-specific scoring fields\.Prompt organization\.The prompt is factored into reusable pieces\. A static system prompt defines the tool contract\. A shared workspace prefix defines sandbox, network, and answer\-format rules\. Benchmark\-specific prompt bodies supply the actual question, options, files, and output schema\. Runtime reminders are used only when the model stops without a tool call or callstask\_completebefore producing a wrapped answer\. The exact prompt templates are shown in Figures[18](https://arxiv.org/html/2606.00579#A9.F18)–[23](https://arxiv.org/html/2606.00579#A9.F23)\.

`Common workspace and answer\-format prefix`Figure 18:Shared workspace, leakage\-prevention, network\-use, and final\-answer rules\.`Runtime reminder prompts`Figure 19:Runtime reminders used when the model stops without a tool call or callstask\_completebefore producing a wrapped answer\.`SocialOmni prompt bodies`Figure 20:Benchmark\-specific prompt bodies for SocialOmni Level 1 and Level 2\.`LVOmniBench prompt body`Figure 21:Benchmark\-specific prompt body for LVOmniBench\.`VideoZeroBench grouped\-video prompt body`Figure 22:Benchmark\-specific grouped\-video prompt body for VideoZeroBench\.`Calibration\-set self\-iteration prompt`Figure 23:Prompt used for calibration\-set self\-iteration\.

## Appendix JAdditional Experiment Results

Skill SettingLowMediumHighAvg\.No Skills70\.460\.050\.061\.4Human\-in\-the\-loop Skills80\.368\.859\.070\.5Log\-driven Self\-distillation86\.075\.065\.476\.7Calibration\-set Self\-iteration83\.671\.959\.073\.0

Table 13:Performance of differentSkillssettings onOmniGAIAacross three difficulty levels\. We report accuracy on theLow,Medium, andHighsplits, together with the overall average accuracy\.Table[13](https://arxiv.org/html/2606.00579#A10.T13)reports the performance of differentSkillssettings onOmniGAIAacross theLow,Medium, andHighdifficulty splits, together with the overall average accuracy\.

## Appendix KCase Study

### K\.1Representative Failure Case per Error Category

To make the failure taxonomy concrete, we present one representative case from each category, drawn from our 100 manually annotated incorrect trajectories\. Each case shows the question, ground\-truth answer, the agent’s final prediction, and a brief analysis of where the trajectory went wrong\.

Audio Perception and Extraction Error \(OmniGAIA, sonnet\-4\.6\)Question:Based on two audio clips and an image of toys\-to\-life gaming platforms \(Disney Infinity and Skylanders\), identify the fictional universes referenced in each audio clip, determine which universe was officially integrated into Disney Infinity \(and how\), identify which other toys\-to\-life platform integrated the non\-Disney universe, and provide the relevant release dates\.Labeled Answer:Clip 1 references both*Kingdom Hearts*and*Sonic the Hedgehog*; clip 2 references*Sonic the Hedgehog*\. Kingdom Hearts was integrated into Disney Infinity 3\.0 \(released August 30, 2015\) via the King Mickey Power Disc and the Sora Keyblade Ultimate Unlock\. Sonic was integrated into LEGO Dimensions via the Sonic Level Pack \(released November 18, 2016\)\.Model Output by OmniAtlas\-sonnet\-4\.6\.I’ll transcribe both audio clips with Whisper and then cross\-reference the YouTube IDs to identify the source material\.tool\_call"name": "Bash", "arguments": \{"command": "whisper …/audios/hYOcABfDmtk\.wav –model tiny"\}/tool\_calltool\_response\[…\] “Eggman’s army has everyone tear off\.” \[…\] “May your heart be your guiding key\.” \[…\]/tool\_responsetool\_call"name": "Bash", "arguments": \{"command": "whisper …/audios/m7z29\_PohpY\.wav –model tiny"\}/tool\_calltool\_response\[…\] “psychotic robot doctor chasing a supersonic blue head job” \[…\]/tool\_responseInstead of reasoning about the transcripts directly, let me reverse\-lookup the YouTube video IDs to confirm the source\.tool\_call"name": "WebFetch", "arguments": \{"url": "https://www\.youtube\.com/watch?v=hYOcABfDmtk&format=json"\}/tool\_calltool\_responseThe title of the YouTube video is “Team Robot’s First Adventure Begins Cinematic Trailer \(Fan\-Made\)\.”/tool\_responsetool\_call"name": "WebSearch", "arguments": \{"query": "Kingdom Hearts Disney Infinity integration Power Disc figure"\}/tool\_calltool\_response\[D23 King Mickey Power Disc; Disney Infinity 3\.0 Ultimate Unlock is Kingdom Hearts Keyblade …\]/tool\_responsefinal\_answerAudio clip 1 \(hYOcABfDmtk, fan\-made “Team Robot” trailer\) references only the*Kingdom Hearts*universe\. Clip 2 references*Sonic the Hedgehog*\. Kingdom Hearts was integrated into Disney Infinity 3\.0 \(August 30, 2015\); Sonic was integrated into LEGO Dimensions \(November 18, 2016\)\./final\_answerFailure Analysis\.The Whisper transcript of clip 1 already contained the explicit Sonic signal “Eggman’s army\.” By routing source identification through the YouTube\-ID lookup \(which returned only the*Kingdom Hearts*\-flavoured fan title\) the agent overwrote its own audio evidence, and the aggregation step locked clip 1 to a single universe\.Figure 24:Representative failure case: audio perception and extraction\.Video Perception and Extraction Error \(OmniGAIA, sonnet\-4\.6\)Question:Given an image of a riverside restaurant and an audio clip discussing duck sourcing, determine the restaurant’s founding year, the duck farm’s founding year, the maximum weight per duck \(converted to pounds, one decimal place\), and the number of ducks clearly visible swimming in the river\. Compute \(weightduck count\)\+\+\(farm year−\-restaurant year\)\.Labeled Answer:52\.0Model Output by OmniAtlas\-sonnet\-4\.6\.I’ll transcribe the audio for the farm and weight information, then carefully inspect the image to count the ducks in the river\.tool\_call"name": "Bash", "arguments": \{"command": "whisper inputs/audios/bIlloRTNKhU\.wav –model base –language en"\}/tool\_calltool\_response\[…\] “I brought in this Irish duck …Silver Hill …” \[…\] duck weight “from 2\.2 to 2\.7 kilos\.” \[…\]/tool\_responsetool\_call"name": "WebSearch", "arguments": \{"query": "Silver Hill Farm duck Ireland founding year established"\}/tool\_calltool\_response\[Silver Hill Duck, Emyvale, Co\. Monaghan, founded 1962 …\]/tool\_responsetool\_call"name": "WebSearch", "arguments": \{"query": "Casa Rio restaurant San Antonio River Walk founding year"\}/tool\_calltool\_response\[Casa Rio, 430 E Commerce St, San Antonio, opened 1946 …\]/tool\_responseNow I’ll crop and enhance the water region repeatedly to count ducks\.tool\_call"name": "Bash", "arguments": \{"command": "python3 \-c ’Image\.open\(…\)\.crop\(\(0, \.55H, \.4W, H\)\)\.save\(/tmp/left\_water\.jpg\); …’"\}/tool\_calltool\_response\[Sections saved: left\_water\.jpg, mid\_water\.jpg, right\_water\.jpg, enhanced\_water\.jpg, far\_right\_zoom\.jpg, far\_left\_zoom\.jpg\]/tool\_responseAfter several crop/enhance passes, I count 2 ducks on the left and 3 on the right, for a total of 5 ducks visible\.tool\_call"name": "Bash", "arguments": \{"command": "python3 \-c ’w=2\.7\*2\.2046; print\(round\(w,1\)\*5 \+ \(1962\-1946\)\)’"\}/tool\_calltool\_responseProduct: 6\.0 5 = 30\.0; Difference: 16; Final sum: 46\.0/tool\_responsefinal\_answer46/final\_answerFailure Analysis\.All non\-visual evidence \(restaurant identity, duck farm, per\-duck weight\) is correct, and the arithmetic chain is clean\. The single perceptual miss – counting 5 rather than 7 swimming ducks after repeated crop/enhance passes – fully determines the final error\.Figure 25:Representative failure case: video perception and extraction\.Insufficient Exploration of Modal Content \(OmniGAIA, gpt\-5\.4\_low\)Question:Godson’s video demonstrates a Clash of Clans wall\-breaking technique where two P\.E\.K\.K\.A\.s simultaneously attack a wall segment\. Using max\-level P\.E\.K\.K\.A\. damage and the Level 13 wall HP \(as of late 2024\), how many such simultaneous volleys are needed to destroy the wall?Labeled Answer:3 simultaneous volleysModel Output by OmniAtlas\-gpt\-5\.4\_low\.I’ll verify the two quantities needed – max\-level P\.E\.K\.K\.A\. per\-hit damage and Level 13 Wall HP – using web search before computing the volley count\.tool\_call"name": "web\_search", "arguments": \{"query": "Clash of Clans P\.E\.K\.K\.A max level damage per hit late 2024 Home Village official"\}/tool\_calltool\_response\{"status": null, …\}/tool\_responsetool\_call"name": "web\_search", "arguments": \{"query": "https://clashofclans\.fandom\.com/wiki/Wall"\}/tool\_calltool\_response\{"status": null, …\}/tool\_responsetool\_call"name": "web\_search", "arguments": \{"query": "’13\|\|’ in https://clashofclans\.fandom\.com/wiki/Wall"\}/tool\_calltool\_response\{"status": null, …\}/tool\_responsefinal\_answer4/final\_answerFailure Analysis\.After two shallow lookups returned no numeric content, the agent did not retry with different queries, try fetching the wiki directly, or derive the quantity from the video\. It emitted “4” with no retrieval evidence and no derivation – the failure is premature termination of exploration, not a retrieval\-quality or calculation error\.Figure 26:Representative failure case: insufficient exploration of modal content\.Knowledge Retrieval and Factual Error \(OmniGAIA, kimi\)Question:From two wetland\-bird images identify each species and family, then use Table 1 of a cited Alabama marsh\-bird study to obtain mean nest\-site water depths\. Given that spinnerbaits sink at 1 ft/s, compute the count\-down time for each depth, pick the weight from the audio’s depth–weight recommendations, and identify the optimal month for pre\-spawn bass fishing in Purple Gallinule habitat\.Labeled Answer:Common Gallinule: 1\.0 s count\-down, 3/8 oz spinnerbait\. Purple Gallinule: 1\.4 s count\-down, 3/8 oz spinnerbait\. Optimal month: April\.Model Output by OmniAtlas\-kimi\.I’ll identify both species from the images and transcribe the audio for the weight–depth rule, then link to the Alabama study\.tool\_call"name": "Read", "arguments": \{"file\_path": "…/images/000000156858\.jpg"\}/tool\_calltool\_response\{"type": "image"\}*\[Common Gallinule\]*/tool\_responsetool\_call"name": "Read", "arguments": \{"file\_path": "…/images/000000189159\.jpg"\}/tool\_calltool\_response\{"type": "image"\}*\[Purple Gallinule\]*/tool\_responsetool\_call"name": "Bash", "arguments": \{"command": "whisper …/audios/XTnFUc2WneU\.wav –model tiny"\}/tool\_calltool\_response\[…\] “three\-eighths, half, three\-quarter, and the one ounce …” \[rules linking depth to weight\] \[…\]/tool\_responseI have species and the weight rule\. For the Alabama Table 1 depths I’ll rely on general knowledge of marsh\-bird nesting ecology\.final\_answerSpecies 1: Common Gallinule \(Rallidae\), nest depth0\.8\\approx 0\.8ft, count\-down 0\.8 s, 3/8 oz\. Species 2: Purple Gallinule \(Rallidae\), nest depth1\.5\\approx 1\.5ft, count\-down 1\.5 s, 1/2 oz\. Optimal month: May\./final\_answerFailure Analysis\.Image and audio evidence were correctly obtained\. Without ever locating or reading the cited Alabama study, the agent substituted hallucinated nest\-depth values \(0\.8 ft, 1\.5 ft\) for the actual Table 1 entries \(1\.0 ft, 1\.4 ft\) and mis\-selected the optimal month\. Every derived quantity \(count\-down, weight, month\) is conditioned on this factual substitution\.Figure 27:Representative failure case: knowledge retrieval and factual error\.Logical Reasoning and Calculation Error \(OmniGAIA, sonnet\-4\.6\)Question:Observing the image of Grand Central Terminal’s four\-faced clock and listening to the audio from a video about Andon lamps, determine the number of years between the terminal’s opening and the lamps’ estimated creation date\.Labeled Answer:292Model Output by OmniAtlas\-sonnet\-4\.6\.I can see the four\-faced opal clock at Grand Central Terminal; the terminal opened in 1913\. Now I’ll transcribe the audio to recover the Andon\-lamp date reference\.tool\_call"name": "Bash", "arguments": \{"command": "whisper …/audios/VrcfGK\-\_iIo\.wav –model base"\}/tool\_calltool\_response\[…\] “They’re called Andon …that design was created 400 years ago\.” \[…\]/tool\_responseI have both pieces of information\. Grand Central opened in 1913\. The audio says Andon lamps were designed “400 years ago” – I’ll anchor that to the current year \(2026\) to get the lamp design year\.tool\_call"name": "Bash", "arguments": \{"command": "python3 \-c ’print\(1913 \- \(2026\-400\)\)’"\}/tool\_calltool\_response287/tool\_responsefinal\_answer287/final\_answerFailure Analysis\.All perceptual and retrieval evidence is correct \(opening year 1913; “400 years ago” in transcript\) and there are no infrastructure issues\. The decisive error is temporal anchoring: “400 years ago” should be taken relative to the video’s 2021 publication date \(giving 1621, difference 292\), not to 2026 \(giving 1626, difference 287\)\. One wrong anchor fully determines the error\.Figure 28:Representative failure case: logical reasoning and calculation\.Tool and Environment Infrastructure Failure \(OmniGAIA, gpt\-5\.4\_low\)Question:A multi\-part question pairs a train image \(with a visible unit number\) and a Jakarta\-region audio clip describing a commuter journey\. Identify the train, the audio’s departure station and destination, the great\-circle and rail distances, the expected versus actual KRL fare, and the average journey speeds\.Labeled Answer:TrainON 5937\(Sydney Trains H\-set “Oscar”\); departure Cikini, destination Bogor; straight\-line44\\approx 44km; rail distance46\.846\.8km; expected fare Rp 6,000 matches actual; average speeds28\.128\.1and35\.135\.1km/h\.Model Output by OmniAtlas\-gpt\-5\.4\_low\.The train unit numberON 5937is clearly visible; the livery and numbering format are consistent with the Sydney Trains H\-set\. Now I need the audio transcript for the departure station\.tool\_call"name": "command\_execution", "arguments": \{"command": "python \-c ’import whisper, faster\_whisper, speech\_recognition, vosk, transformers’"\}/tool\_calltool\_responsewhisper NO; faster\_whisper NO; speech\_recognition NO; vosk NO; transformers NO/tool\_responsetool\_call"name": "command\_execution", "arguments": \{"command": "which whisper && ffmpeg \-filters\|\|rg ’asr\|\|whisper\|\|pocket’"\}/tool\_calltool\_responsewhisper: command not found; ffmpeg has no ASR filters\./tool\_responsetool\_call"name": "command\_execution", "arguments": \{"command": "which uv && uv pip install transformers accelerate tokenizers"\}/tool\_calltool\_responseuv: not found; dependencies cannot be self\-installed; no GPU/CUDA available\./tool\_responseWith no available ASR, I’ll infer the departure station from web search using the KRL Commuterline context\.tool\_call"name": "web\_search", "arguments": \{"query": "Manggarai Station coordinates"\}/tool\_calltool\_response\[Manggarai railway station, Jakarta:−6\.2099\-6\.2099,106\.8502106\.8502…\]/tool\_responsefinal\_answerTrainON 5937\(Sydney H\-set/Oscar\)\. Audio journey: Manggarai Station to Bogor\. Straight\-line43\\approx 43km; rail distance44\.144\.1km\. Expected fare Rp 5,000 matches actual\. Average speeds44\.144\.1and35\.335\.3km/h\./final\_answerFailure Analysis\.Every local ASR route failed at the environment level \(whisper/faster\-whisper/vosk/speech\_recognitionabsent;transformers/accelerate/tokenizersmissing; GPU/CUDA unavailable;uvnot installed so dependencies cannot be self\-installed\)\. The audio modality was blocked before the agent could produce a transcript, and the downstream “Manggarai” guess \(and every distance/fare/speed value built on it\) is a consequence of that infrastructure failure, not a separate reasoning error\.Figure 29:Representative failure case: tool and environment infrastructure failure\.
Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

Similar Articles

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Coding Agent Is Good As World Simulator

SWE Context Bench just proved something I think a lot of coding agent users already feel

@markokraemer: Yes we are building SandboxAgent which is just opencode based runtime running in sandboxes 1 random thing is remote ses…

How We Built Secure, Scalable Agent Sandbox Infrastructure (8 minute read)

Submit Feedback

Similar Articles

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
Coding Agent Is Good As World Simulator
SWE Context Bench just proved something I think a lot of coding agent users already feel
@markokraemer: Yes we are building SandboxAgent which is just opencode based runtime running in sandboxes 1 random thing is remote ses…
How We Built Secure, Scalable Agent Sandbox Infrastructure (8 minute read)