Orchestra-o1: Omnimodal Agent Orchestration

arXiv cs.CL 06/15/26, 04:00 AM Papers
agent-orchestration multimodal llm-agent reinforcement-learning open-source benchmark
Summary
Orchestra-o1 is an omnimodal agent orchestration framework that supports efficient agent collaboration across text, image, audio, and video. It introduces decision-aligned group relative policy optimization (DA-GRPO) and achieves state-of-the-art performance on the OmniGAIA benchmark.
arXiv:2606.13707v1 Announce Type: cross Abstract: The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.
Original Article
View Cached Full Text
Cached at: 06/15/26, 09:00 AM
# Orchestra-o1: Omnimodal Agent Orchestration
Source: [https://arxiv.org/html/2606.13707](https://arxiv.org/html/2606.13707)
\\setheadertext

LUMIA Lab\\correspondingemail\\emailiconfzhang25@cse\.cuhk\.edu\.hk∗Equal Contribution†Corresponding Author\\githublinkhttps://github\.com/zfkarl/Orchestra\-o1\\huggingfacelinkhttps://huggingface\.co/Karl28/Orchestra\-o1\-8B\\setheadertitleOrchestra\-o1: Omnimodal Agent Orchestration

Vireo Zhang∗Shengju Qian2,†Haoxuan Li3Hao Wu4Jinyang Wu4Donghao Zhou1Zhihong Zhu3Zheng Lian5Xin Wang2Pheng\-Ann Heng1,† 1CUHK2LIGHTSPEED3PKU4THU5Tongji University

###### Abstract

The recent success of agent swarms has shifted the paradigm of large language model \(LLM\)\-based agents from single\-agent workflows to multi\-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration\. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact\. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video\. In this work, we propose Orchestra\-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities\. Orchestra\-o1 introduces a unified orchestration mechanism that enables modality\-aware task decomposition, online sub\-agent specialization, and parallel sub\-task execution\. This scalable design allows agent systems to effectively tackle complex real\-world tasks involving heterogeneous information sources, surpassing the second\-best approach by10\.3%10\.3\\%accuracy on the OmniGAIA benchmark\. Furthermore, we introduce decision\-aligned group relative policy optimization \(DA\-GRPO\), an efficient agentic reinforcement learning approach for training Orchestra\-o1\-8B, which also achieves state\-of\-the\-art performance against all existing open\-source omnimodal agents\. The source code is publicly available at the above links\.

![Refer to caption](https://arxiv.org/html/2606.13707v1/x1.png)Figure 1:Comparison among three types of omnimodal agents\.## 1Introduction

Large language model \(LLM\)\-based agents\[luo2025large,wang2024survey\]have recently emerged as a powerful paradigm for building intelligent systems that can reason, plan, use tools, and interact with external environments\. By augmenting LLMs with harness mechanisms\[pan2026natural,meng2026agent\], agent systems have substantially expanded the boundary of what language models can accomplish\. Representative applications such as code generation and execution\[zhang2024codeagent,huang2023agentcoder\], autonomous web research\[team2025tongyi,qiao2025webresearcher\], interactive problem solving\[yu2026webanchor,tao2025webshaper\], and open\-ended computer\-use tasks\[agashe2025agent,wangcomputer\]have demonstrated the potential of LLM agents to reshape human productivity and information access\. More recently, the success of agent swarms\[team2026kimi\]has further shifted the research focus from single\-agent workflows to multi\-agent systems, where a main agent coordinates multiple specialized agents to decompose complex tasks, execute sub\-tasks, and aggregate intermediate results\. This paradigm highlights the importance of agent orchestration, which determines how agents are created, specialized, scheduled, and coordinated during task solving\.

Despite this progress, most existing LLM\-based agent systems are still designed for a limited range of modalities, typically focusing on either pure\-text tasks\[zhang2024cut\]or vision\-language tasks\[geng2025webwatcher\]\. This creates a clear gap between current agent research and real\-world scenarios, where information is inherently omnimodal and often involves the coexistence and interaction of text, image, audio, and video\. In everyday situations, humans naturally process heterogeneous sensory signals in a unified manner\. For example, during face\-to\-face communication, people simultaneously interpret spoken language, facial expressions, gestures, and environmental cues, and then decide how to respond accordingly\. Such omnimodal understanding and decision\-making are natural for humans but remain highly challenging for existing agents\. To solve omnimodal tasks, an agent must not only perceive information from diverse modalities, but also reason over their interactions, decide which specialized capabilities are needed, and coordinate actions across multiple tools or sub\-agents\. This requires a unified framework that supports both omnimodal perception and high\-level agentic decision\-making\.

As shown in Figure[1](https://arxiv.org/html/2606.13707#S0.F1), current omnimodal agents can be broadly categorized into two types\. The first category isnative omnimodal agents\[team2026qwen3\], which directly employ an omnimodal large language model \(OLLM\) as the agentic backend and equip it with various action tools\. In this design, the same model is expected to perform perception, reasoning, planning, and tool\-use simultaneously\. However, existing OLLMs still exhibit limited capability in jointly handling perception and action, especially when tasks require long\-horizon reasoning, external information seeking, code execution, or fine\-grained cross\-modal understanding\. As a result, even strong proprietary omnimodal models such as Gemini\-3\-Pro\[gemini3pro\]achieve only62\.5%62\.5\\%accuracy on the challenging benchmark OmniGAIA\[li2026omnigaia\]\. The second category isorchestration\-based agents\[ruan2026aorchestra\], which decouple perception and action from high\-level reasoning\. In such systems, a text\-based language model usually serves as the main agent or orchestrator, while perception and action are delegated to specialized sub\-agents equipped with corresponding tools\. This design separates high\-level decision\-making from low\-level modality processing, making the system more modular, extensible, and potentially more scalable for complex omnimodal tasks\.

In this paper, we focus on orchestration\-based omnimodal agents\. Designing an effective omnimodal agent swarm, however, is non\-trivial for the following reasons\. First, many powerful closed\-source agent swarm frameworks, such as Kimi\[team2026kimi\]and Claude\[claudeopus46\], are hidden behind proprietary APIs, making it difficult to extend them for omnimodal research\. Second, existing open\-source agent orchestration frameworks\[ruan2026aorchestra,su2025toolorchestra\]are often limited by incomplete perception and action toolsets, as well as relatively rigid and linear sub\-agent workflows\. These limitations restrict both the scalability and efficiency of agent systems when handling complex tasks involving heterogeneous modalities\. Towards this end, we propose Orchestra\-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities\. At the model level, Orchestra\-o1 supports flexible agentic backends, allowing both the main agent and sub\-agents to be instantiated with different models, including open\-source models and proprietary models\. At the tool level, we provide a unified tool ecosystem consisting of perception tools and action tools, enabling the system to understand and coordinate diverse inputs such as text, image, audio, and video, while also supporting external information seeking and code execution\. At the scaffold level, Orchestra\-o1 introduces a collaborative orchestration mechanism based on agent skills and context memory, enabling modality\-aware task decomposition, online sub\-agent specialization, and parallel sub\-task execution\. Together, these designs make Orchestra\-o1 both effective and efficient for solving complex omnimodal agent tasks\. When using GPT\-5\[openaigpt5\]as the main agent, Orchestra\-o1 establishes a new state\-of\-the\-art \(SOTA\) on the OmniGAIA benchmark and substantially outperforms competing baselines, achieving a32\.8%32\.8\\%improvement over AOrchestra\[ruan2026aorchestra\]and a10\.3%10\.3\\%improvement over Gemini\-3\-Pro\[gemini3pro\]\.

In addition to the orchestration framework, we further explore how to train an open\-source model to serve as the main agent in Orchestra\-o1\. To this end, we proposedecision\-aligned group relative policy optimization\(DA\-GRPO\), an efficient offline agentic reinforcement learning algorithm for enhancing orchestration decision\-making\. DA\-GRPO extends GRPO\[guo2025deepseek\]with a design specifically tailored for agent orchestration\. Unlike the original GRPO, which focuses solely on final\-answer correctness, DA\-GRPO explicitly aligns the main agent’s step\-level decisions with high\-quality reference trajectories, covering key decisions such as task delegation, sub\-agent selection, tool usage, and answer generation\. Leveraging high\-quality synthetic trajectories and a multi\-dimensional rubric\-based reward design, we train Orchestra\-o1\-8B based on Qwen3\-8B\[yang2025qwen3\]to serve as an open\-source main agent within the Orchestra\-o1 framework\. Experimental results demonstrate that Orchestra\-o1\-8B significantly improves the performance of open\-source omnimodal agents on OmniGAIA, increasing the previous best accuracy from20\.8%20\.8\\%to30\.0%30\.0\\%\.

In summary, the main contributions of this paper are as follows:

- •Omnimodal Agent Orchestration Framework\.We propose Orchestra\-o1, an omnimodal agent orchestration framework for complex real\-world agent tasks\. Through modality\-aware task decomposition, online sub\-agent specialization, and parallel sub\-task execution, Orchestra\-o1 decouples high\-level orchestration from specialized perception and action execution, serving as a scalable open\-source framework for building omnimodal agent swarms\.
- •Efficient Agent Orchestration Training Recipe\.We develop DA\-GRPO, an efficient agentic reinforcement learning algorithm for orchestration training\. DA\-GRPO aligns the main agent’s step\-level orchestration decisions with high\-quality reference trajectories based on multi\-dimensional rubric reward design, enabling open\-source models to acquire stronger delegation, planning, and decision\-making capabilities in omnimodal agent systems\.
- •Multifaceted Experimental Validation\.Extensive experiments demonstrate that Orchestra\-o1 significantly outperforms existing omnimodal agents\. With a strong proprietary main agent, it achieves a new state\-of\-the\-art on OmniGAIA, surpassing the second\-best approach by10\.%10\.\\%accuracy\. Compared to AOrchestra, Orchestra\-o1 further achieves faster inference and better cost\-effectiveness, benefiting from its parallelizable orchestration design\. Moreover, when trained with DA\-GRPO, Orchestra\-o1\-8B consistently outperforms existing open\-source omnimodal agents by a large margin\.

## 2Related Work

### 2\.1LLM\-based Agent Orchestration

Recent advances in LLM\-based agents have shifted from single\-agent reasoning systems to multi\-agent orchestration frameworks\. Early efforts primarily focus on enhancing tool use and planning capabilities within a single agent\[yao2022react,schick2023toolformer\], where the model iteratively interacts with external tools to solve complex tasks\. More recently, multi\-agent systems have emerged as a promising direction, where a central orchestrator coordinates multiple specialized agents to improve scalability and task decomposition\. Representative works such as AutoGen\-style systems\[wu2024autogen\]and agent swarms\[team2026kimi\]demonstrate that dividing responsibilities across agents can significantly improve performance on complex reasoning and interactive tasks\. However, existing orchestration frameworks are mostly designed for text\-based or limited vision\-language settings\[ruan2026aorchestra,zhang2026flowsteer\], and often rely on linear or heuristic\-driven workflows\. In contrast, real\-world tasks require more flexible coordination strategies that can dynamically adapt agent roles, parallelize execution, and integrate heterogeneous tools\. Our work differs from prior studies by focusing on a unified orchestration framework that supports modality\-aware decomposition and scalable multi\-agent collaboration in omnimodal environments\.

### 2\.2Omnimodal Agent Intelligence

Omnimodal intelligence extends traditional vision\-language or audio\-language systems to handle heterogeneous inputs such as text, image, audio, and video within a unified framework\. Early multimodal models mainly focus on bimodal settings, such as vision\-language understanding\[li2023blip,liu2023visual\], which demonstrate strong capabilities in aligning visual and textual representations\. With the development of large\-scale multimodal models, recent works have begun exploring omnimodal agents\[gemini3pro,team2026qwen3,ai2025ming,team2025longcat\]\. These models aim to unify perception and reasoning across multiple modalities, enabling more general interaction capabilities\. However, their performance remains limited in complex agentic scenarios that require long\-horizon reasoning, tool use, and multi\-step decision\-making\. To address these limitations, recent approaches introduce external tool augmentation or modular decomposition to improve omnimodal reasoning\[li2026omnigaia\]\. Nevertheless, these methods often lack systematic orchestration mechanisms for coordinating multiple specialized components\. In contrast, our work focuses on an explicit omnimodal agent orchestration paradigm, where perception, reasoning, and action are decoupled and coordinated through a structured multi\-agent system, enabling more scalable and efficient omnimodal intelligence\.

## 3Methodology

In this section, we first review the background of agent orchestration and introduce the necessary preliminaries \(Section[3\.1](https://arxiv.org/html/2606.13707#S3.SS1)\)\. We then present our proposed omnimodal agent orchestration framework, Orchestra\-o1 \(Section[3\.2](https://arxiv.org/html/2606.13707#S3.SS2)\), followed by the training recipe for deriving an open\-source main agent within the framework \(Section[3\.3](https://arxiv.org/html/2606.13707#S3.SS3)\)\.

![Refer to caption](https://arxiv.org/html/2606.13707v1/x2.png)Figure 2:An overview of the Orchestra\-o1 framework\.### 3\.1Preliminary

##### Problem Definition\.

We formulate omnimodal agent orchestration as a multi\-round decision\-making problem over heterogeneous inputs\. Given a task instancex=\(q,ℳ\)x=\(q,\\mathcal\{M\}\), whereqqdenotes the natural\-language question andℳ=\{mi\}i=1N\\mathcal\{M\}=\\\{m\_\{i\}\\\}\_\{i=1\}^\{N\}denotes a set of auxiliary modality inputs such as images, audios, and videos\. The goal is to produce a concise final answera^\\hat\{a\}that maximizes the task rewardR\(a^,a∗\)R\(\\hat\{a\},a^\{\*\}\)with respect to the ground\-truth answera∗a^\{\*\}\.

##### System Formulation\.

An orchestration\-based agent system consists of a main agent, a set of sub\-agent backends, and a tool ecosystem\. The main agentπθ\\pi\_\{\\theta\}acts as an orchestrator rather than directly operating on every modality\. At orchestration roundtt, it observes a state:

st=\(q,ℳ,ct,Ht,ℬ,𝒯\),s\_\{t\}=\\big\(q,\\mathcal\{M\},c\_\{t\},H\_\{t\},\\mathcal\{B\},\\mathcal\{T\}\\big\),\(1\)wherectc\_\{t\}is the accumulated context,HtH\_\{t\}is the structured sub\-task history,ℬ\\mathcal\{B\}is the set of available sub\-agent models, and𝒯\\mathcal\{T\}is the set of tools available to sub\-agents\. The main agent outputs a structured decisionyty\_\{t\}from two action types:yt∈\{𝚍𝚎𝚕𝚎𝚐𝚊𝚝𝚎,𝚌𝚘𝚖𝚙𝚕𝚎𝚝𝚎\}y\_\{t\}\\in\\\{\\mathtt\{delegate\},\\mathtt\{complete\}\\\}\. Ifyt=𝚌𝚘𝚖𝚙𝚕𝚎𝚝𝚎y\_\{t\}=\\mathtt\{complete\}, the main agent terminates the trajectory and returnsa^\\hat\{a\}\. Ifyt=𝚍𝚎𝚕𝚎𝚐𝚊𝚝𝚎y\_\{t\}=\\mathtt\{delegate\}, it generates a batch ofKtK\_\{t\}sub\-tasks:

𝒰t=\{ut,j\}j=1Kt,ut,j=\(It,j,Ct,j,bt,j,𝒯t,j\),\\mathcal\{U\}\_\{t\}=\\\{u\_\{t,j\}\\\}\_\{j=1\}^\{K\_\{t\}\},\\quad u\_\{t,j\}=\(I\_\{t,j\},C\_\{t,j\},b\_\{t,j\},\\mathcal\{T\}\_\{t,j\}\),\(2\)whereIt,jI\_\{t,j\}is a sub\-task instruction,Ct,jC\_\{t,j\}is the context passed from previous rounds,bt,j∈ℬb\_\{t,j\}\\in\\mathcal\{B\}is the selected sub\-agent backend, and𝒯t,j⊆𝒯\\mathcal\{T\}\_\{t,j\}\\subseteq\\mathcal\{T\}is the assigned tool subset\. Each sub\-task is executed by an independent sub\-agent, producing a result tuplezt,jz\_\{t,j\}that contains its status, answer\-like result, summary, and execution trace\. The results are summarized and appended toHt\+1H\_\{t\+1\}, after which the main agent either launches another delegation round or produces the final answer\.

This formulation highlights two key requirements for omnimodal orchestration\. First, the main agent must makemodality\-aware decisions: it needs to identify which inputs and tools are relevant before dispatching sub\-tasks\. Second, it must makedependency\-aware scheduling decisions: independent sub\-tasks should be executed in parallel, while dependent sub\-tasks should be delayed until prerequisite results become available\.

### 3\.2The Orchestra\-o1 Framework

Figure[2](https://arxiv.org/html/2606.13707#S3.F2)presents the overall architecture of Orchestra\-o1\. The framework is designed as a hierarchical policy that factorizes complex omnimodal problem solving into high\-level orchestration and low\-level specialized execution\. Letℬ=\{bℓ\}ℓ=1L\\mathcal\{B\}=\\\{b\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}denote the candidate sub\-agent backends and𝒯=𝒯perc∪𝒯act\\mathcal\{T\}=\\mathcal\{T\}^\{\\mathrm\{perc\}\}\\cup\\mathcal\{T\}^\{\\mathrm\{act\}\}represent the unified tool set, respectively\. In Orchestra\-o1, the perception tool set𝒯perc\\mathcal\{T\}^\{\\mathrm\{perc\}\}consists of tools for image analysis, audio analysis, and video analysis\. The action tool set𝒯act\\mathcal\{T\}^\{\\mathrm\{act\}\}contains tools for web search, page visit, and code execution\. At roundtt, the main agent implements a stochastic orchestration policy:

yt∼πθ\(⋅∣st\),yt=\(at,ξt\),at∈\{𝚍𝚎𝚕𝚎𝚐𝚊𝚝𝚎,𝚌𝚘𝚖𝚙𝚕𝚎𝚝𝚎\},y\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\),\\quad y\_\{t\}=\(a\_\{t\},\\xi\_\{t\}\),\\quad a\_\{t\}\\in\\\{\\mathtt\{delegate\},\\mathtt\{complete\}\\\},\(3\)whereata\_\{t\}is the high\-level action andξt\\xi\_\{t\}denotes its structured arguments\. The system\-level trajectory is thereforeτ=\(s1,y1,Z1,s2,y2,Z2,…,sT,yT\)\\tau=\\big\(s\_\{1\},y\_\{1\},Z\_\{1\},s\_\{2\},y\_\{2\},Z\_\{2\},\\ldots,s\_\{T\},y\_\{T\}\\big\), whereZtZ\_\{t\}is the set of sub\-agent results returned after a delegation action\. The objective of Orchestra\-o1 is to maximize expected task utility under latency and monetary budgets:

maxπθ⁡𝔼τ∼πθ\[R\(a^,a∗\)−λcCost⁡\(τ\)−λlLatency⁡\(τ\)\]\.\\max\_\{\\pi\_\{\\theta\}\}\\;\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\Big\[R\(\\hat\{a\},a^\{\*\}\)\-\\lambda\_\{c\}\\operatorname\{Cost\}\(\\tau\)\-\\lambda\_\{l\}\\operatorname\{Latency\}\(\\tau\)\\Big\]\.\(4\)
##### Flexible Agentic Backends\.

Orchestra\-o1 supports heterogeneous model backends for both the main agent and sub\-agents\. Each backendb∈ℬb\\in\\mathcal\{B\}is represented by a skill vector and a cost\-latency profileϕ\(b\)=\(ϕbtxt,ϕbimg,ϕbaud,ϕbvid,ϕbcode,κb,δb\)\\phi\(b\)=\\big\(\\phi^\{\\mathrm\{txt\}\}\_\{b\},\\phi^\{\\mathrm\{img\}\}\_\{b\},\\phi^\{\\mathrm\{aud\}\}\_\{b\},\\phi^\{\\mathrm\{vid\}\}\_\{b\},\\phi^\{\\mathrm\{code\}\}\_\{b\},\\kappa\_\{b\},\\delta\_\{b\}\\big\), where the first five terms encode capability scores, whileκb\\kappa\_\{b\}andδb\\delta\_\{b\}denote unit cost and expected latency\. For a candidate sub\-taskuu, the main agent predicts a requirement vectorr\(u\)=\(rutxt,ruimg,ruaud,ruvid,rucode\)∈\[0,1\]r\(u\)=\\big\(r^\{\\mathrm\{txt\}\}\_\{u\},r^\{\\mathrm\{img\}\}\_\{u\},r^\{\\mathrm\{aud\}\}\_\{u\},r^\{\\mathrm\{vid\}\}\_\{u\},r^\{\\mathrm\{code\}\}\_\{u\}\\big\)\\in\[0,1\]\. The model assignment can be viewed as maximizing a cost\-aware matching score:

b∗\(u\)=argmaxb∈ℬ⁡⟨r\(u\),ϕb⟩⏟capability match−λcκbℓ\(u\)−λlδb,b^\{\*\}\(u\)=\\operatorname\*\{arg\\,max\}\_\{b\\in\\mathcal\{B\}\}\\;\\underbrace\{\\langle r\(u\),\\phi\_\{b\}\\rangle\}\_\{\\text\{capability match\}\}\-\\lambda\_\{c\}\\kappa\_\{b\}\\,\\ell\(u\)\-\\lambda\_\{l\}\\delta\_\{b\},\(5\)whereℓ\(u\)\\ell\(u\)is the estimated token or step length of the sub\-task\. This formulation captures the practical backend selection strategy in Orchestra\-o1: easy extraction and search sub\-tasks are routed to cheaper models, while difficult omnimodal reasoning sub\-tasks are routed to stronger backends\.

##### Unified Omnimodal Tool Ecosystem\.

The tool assignment problem is also formulated as requirement matching\. Let each toolg∈𝒯g\\in\\mathcal\{T\}have a capability vectorψ\(g\)=\(ψgtxt,ψgimg,ψgaud,ψgvid,ψgweb,ψgcode\)∈\{0,1\}\\psi\(g\)=\\big\(\\psi^\{\\mathrm\{txt\}\}\_\{g\},\\psi^\{\\mathrm\{img\}\}\_\{g\},\\psi^\{\\mathrm\{aud\}\}\_\{g\},\\psi^\{\\mathrm\{vid\}\}\_\{g\},\\psi^\{\\mathrm\{web\}\}\_\{g\},\\psi^\{\\mathrm\{code\}\}\_\{g\}\\big\)\\in\\\{0,1\\\}\. For a sub\-taskuu, the selected tool subset is:

𝒯∗\(u\)=\{g∈𝒯:pθ\(g∣u,st\)\>γ\},\\mathcal\{T\}^\{\*\}\(u\)=\\\{g\\in\\mathcal\{T\}:p\_\{\\theta\}\(g\\mid u,s\_\{t\}\)\>\\gamma\\\},\(6\)or equivalently the solution of a sparse coverage objective:

𝒯∗\(u\)=argmax𝒮⊆𝒯⁡\[⟨rT\(u\),∑g∈𝒮ψ\(g\)⟩−λs\|𝒮\|\],\\mathcal\{T\}^\{\*\}\(u\)=\\operatorname\*\{arg\\,max\}\_\{\\mathcal\{S\}\\subseteq\\mathcal\{T\}\}\\Big\[\\langle r\_\{T\}\(u\),\\sum\_\{g\\in\\mathcal\{S\}\}\\psi\(g\)\\rangle\-\\lambda\_\{s\}\|\\mathcal\{S\}\|\\Big\],\(7\)whererT\(u\)r\_\{T\}\(u\)denotes the tool\-side requirement vector\. In this view, image, audio, and video tools supply modality evidence, while web search, page visit, and code execution tools supply external knowledge and computation\. The complete low\-level executor for sub\-taskuuis thus:

e\(u\)=\(b∗\(u\),𝒯∗\(u\)\)\.e\(u\)=\\big\(b^\{\*\}\(u\),\\mathcal\{T\}^\{\*\}\(u\)\\big\)\.\(8\)

##### Modality\-aware Task Decomposition\.

At each round, the main agent first induces a latent dependency graph over unsolved sub\-goals𝒢t=\(𝒱t,ℰt\)\\mathcal\{G\}\_\{t\}=\(\\mathcal\{V\}\_\{t\},\\mathcal\{E\}\_\{t\}\)and𝒱t=\{vt,1,…,vt,nt\}\\mathcal\{V\}\_\{t\}=\\\{v\_\{t,1\},\\ldots,v\_\{t,n\_\{t\}\}\\\}, where a directed edge\(vi,vj\)∈ℰt\(v\_\{i\},v\_\{j\}\)\\in\\mathcal\{E\}\_\{t\}means thatvjv\_\{j\}depends on the result ofviv\_\{i\}\. Each node is associated with a modality maskμ\(v\)∈\{0,1\}\\mu\(v\)\\in\\\{0,1\\\}and a tool maskα\(v\)∈\{0,1\}\|𝒯\|\\alpha\(v\)\\in\\\{0,1\\\}^\{\|\\mathcal\{T\}\|\}, whereμ\(v\)\\mu\(v\)indicates whether text, image, audio, or video evidence is required, andα\(v\)\\alpha\(v\)indicates candidate tools\. A sub\-goal is executable if all of its predecessors have already been completed\. Therefore, the ready set at roundttis:

ℛt=\{v∈𝒱t∖𝒞t:Pred⁡\(v\)⊆𝒞t\},\\mathcal\{R\}\_\{t\}=\\\{v\\in\\mathcal\{V\}\_\{t\}\\setminus\\mathcal\{C\}\_\{t\}:\\operatorname\{Pred\}\(v\)\\subseteq\\mathcal\{C\}\_\{t\}\\\},\(9\)where𝒞t\\mathcal\{C\}\_\{t\}denotes completed sub\-goals\. The main agent selects a parallel batch from this ready set:

𝒫t=argmax𝒫⊆ℛt∑v∈𝒫Uθ\(v∣st\)s\.t\.\|𝒫\|≤Kmax,∑v∈𝒫cost⁡\(v\)≤Bt\.\\mathcal\{P\}\_\{t\}=\\operatorname\*\{arg\\,max\}\_\{\\mathcal\{P\}\\subseteq\\mathcal\{R\}\_\{t\}\}\\sum\_\{v\\in\\mathcal\{P\}\}U\_\{\\theta\}\(v\\mid s\_\{t\}\)\\quad\\text\{s\.t\.\}\\quad\|\\mathcal\{P\}\|\\leq K\_\{\\max\},\\;\\sum\_\{v\\in\\mathcal\{P\}\}\\operatorname\{cost\}\(v\)\\leq B\_\{t\}\.\(10\)For each selected nodevt,j∈𝒫tv\_\{t,j\}\\in\\mathcal\{P\}\_\{t\}, Orchestra\-o1 materializes a concrete sub\-task:

ut,j=Γθ\(vt,j,st\)=\(It,j,Ct,j,bt,j,𝒯t,j\),u\_\{t,j\}=\\Gamma\_\{\\theta\}\(v\_\{t,j\},s\_\{t\}\)=\\big\(I\_\{t,j\},C\_\{t,j\},b\_\{t,j\},\\mathcal\{T\}\_\{t,j\}\\big\),\(11\)wherebt,j=b∗\(ut,j\)b\_\{t,j\}=b^\{\*\}\(u\_\{t,j\}\)and𝒯t,j=𝒯∗\(ut,j\)\\mathcal\{T\}\_\{t,j\}=\\mathcal\{T\}^\{\*\}\(u\_\{t,j\}\)\. The delegated action is therefore a structured batch decision:

yt=𝚍𝚎𝚕𝚎𝚐𝚊𝚝𝚎\(𝒰t\),𝒰t=\{ut,j\}j=1Kt,Kt=\|𝒫t\|\.y\_\{t\}=\\mathtt\{delegate\}\(\\mathcal\{U\}\_\{t\}\),\\quad\\mathcal\{U\}\_\{t\}=\\\{u\_\{t,j\}\\\}\_\{j=1\}^\{K\_\{t\}\},\\quad K\_\{t\}=\|\\mathcal\{P\}\_\{t\}\|\.\(12\)This mathematical formulation makes the decomposition strategy explicit: Orchestra\-o1 does not only split the task into text strings, but also predicts dependency structure, modality requirements, tool requirements, and backend assignments\.

##### Parallel Sub\-task Execution\.

Each delegated sub\-task is executed by an independent ReAct\-style\[yao2022react\]sub\-agent\. For sub\-taskut,ju\_\{t,j\}, the sub\-agent trajectory is:

ζt,j=\{\(ρt,j\(ℓ\),at,j\(ℓ\),ot,j\(ℓ\)\)\}ℓ=1Lt,j,at,j\(ℓ\)∈𝒯t,j∪\{𝙵𝚒𝚗𝚒𝚜𝚑\},\\zeta\_\{t,j\}=\\big\\\{\(\\rho\_\{t,j\}^\{\(\\ell\)\},a\_\{t,j\}^\{\(\\ell\)\},o\_\{t,j\}^\{\(\\ell\)\}\)\\big\\\}\_\{\\ell=1\}^\{L\_\{t,j\}\},\\quad a\_\{t,j\}^\{\(\\ell\)\}\\in\\mathcal\{T\}\_\{t,j\}\\cup\\\{\\mathtt\{Finish\}\\\},\(13\)whereρt,j\(ℓ\)\\rho\_\{t,j\}^\{\(\\ell\)\}is the reasoning state,at,j\(ℓ\)a\_\{t,j\}^\{\(\\ell\)\}is the selected tool action, andot,j\(ℓ\)o\_\{t,j\}^\{\(\\ell\)\}is the observation\. The final sub\-agent output is summarized as:

zt,j=Ω\(ζt,j\)=\(σt,j,ηt,j,ωt,j,ct,j,δt,j\),z\_\{t,j\}=\\Omega\(\\zeta\_\{t,j\}\)=\(\\sigma\_\{t,j\},\\eta\_\{t,j\},\\omega\_\{t,j\},c\_\{t,j\},\\delta\_\{t,j\}\),\(14\)whereσt,j\\sigma\_\{t,j\}is the execution status,ηt,j\\eta\_\{t,j\}is the answer\-like result,ωt,j\\omega\_\{t,j\}is a compact trace summary,ct,jc\_\{t,j\}is the cost, andδt,j\\delta\_\{t,j\}is the latency\. Since all sub\-tasks in𝒰t\\mathcal\{U\}\_\{t\}are conditionally independent givensts\_\{t\}and do not share mutable environment states, their execution factorizes as:

p\(Zt∣𝒰t,st\)=∏j=1Ktp\(zt,j∣ut,j,st\),Zt=AsyncExecute⁡\(ut,1,…,ut,Kt\)\.p\(Z\_\{t\}\\mid\\mathcal\{U\}\_\{t\},s\_\{t\}\)=\\prod\_\{j=1\}^\{K\_\{t\}\}p\(z\_\{t,j\}\\mid u\_\{t,j\},s\_\{t\}\),\\quad Z\_\{t\}=\\operatorname\{AsyncExecute\}\(u\_\{t,1\},\\ldots,u\_\{t,K\_\{t\}\}\)\.\(15\)This factorization yields a formal latency advantage for parallel orchestration, as stated below\. The proof can be found in Section[A\.1](https://arxiv.org/html/2606.13707#A1.SS1)\.

###### Proposition 1\(Round\-level Latency Advantage\)\.

Consider an orchestration roundttwithKt≥2K\_\{t\}\\geq 2ready sub\-tasks whose execution times areδt,1,…,δt,Kt\>0\\delta\_\{t,1\},\\ldots,\\delta\_\{t,K\_\{t\}\}\>0\. Assume these sub\-tasks are conditionally independent givensts\_\{t\}, do not share mutable environment states during execution, and the only additional overhead of parallel execution is a nonnegative synchronization costδtsync≥0\\delta^\{\\mathrm\{sync\}\}\_\{t\}\\geq 0\. If a linear orchestrator executes the sub\-tasks sequentially, while Orchestra\-o1 launches them asynchronously and waits for all outputs, then we have:

Latencylinear⁡\(t\)=∑j=1Ktδt,j,Latencyparallel⁡\(t\)=max1≤j≤Kt⁡δt,j\+δtsync\.\\operatorname\{Latency\}^\{\\mathrm\{linear\}\}\(t\)=\\sum\_\{j=1\}^\{K\_\{t\}\}\\delta\_\{t,j\},\\quad\\operatorname\{Latency\}^\{\\mathrm\{parallel\}\}\(t\)=\\max\_\{1\\leq j\\leq K\_\{t\}\}\\delta\_\{t,j\}\+\\delta^\{\\mathrm\{sync\}\}\_\{t\}\.\(16\)Moreover, parallel execution is no slower than linear execution if and only if:

δtsync≤∑j=1Ktδt,j−max1≤j≤Kt⁡δt,j\.\\delta^\{\\mathrm\{sync\}\}\_\{t\}\\leq\\sum\_\{j=1\}^\{K\_\{t\}\}\\delta\_\{t,j\}\-\\max\_\{1\\leq j\\leq K\_\{t\}\}\\delta\_\{t,j\}\.\(17\)Under this condition, the round\-level speedup satisfies:

1≤St=Latencylinear⁡\(t\)Latencyparallel⁡\(t\)=∑j=1Ktδt,jmax1≤j≤Kt⁡δt,j\+δtsync≤Kt\.1\\leq S\_\{t\}=\\frac\{\\operatorname\{Latency\}^\{\\mathrm\{linear\}\}\(t\)\}\{\\operatorname\{Latency\}^\{\\mathrm\{parallel\}\}\(t\)\}=\\frac\{\\sum\_\{j=1\}^\{K\_\{t\}\}\\delta\_\{t,j\}\}\{\\max\_\{1\\leq j\\leq K\_\{t\}\}\\delta\_\{t,j\}\+\\delta^\{\\mathrm\{sync\}\}\_\{t\}\}\\leq K\_\{t\}\.\(18\)The upper boundSt=KtS\_\{t\}=K\_\{t\}is attainable only whenδtsync=0\\delta^\{\\mathrm\{sync\}\}\_\{t\}=0and all sub\-task runtimes are equal\.

##### Context Memory and Iterative Refinement\.

After each delegation round, Orchestra\-o1 updates a structured memory that stores the evidence returned by all sub\-agents\. Let the memory be:Ht=\{h1,…,hmt\}H\_\{t\}=\\\{h\_\{1\},\\ldots,h\_\{m\_\{t\}\}\\\}andh=\(I,b,𝒯,σ,η,ω\)h=\(I,b,\\mathcal\{T\},\\sigma,\\eta,\\omega\), the update after roundttis:

Ht\+1=Ht∪\{Summarize⁡\(ut,j,zt,j\)\}j=1Kt\.H\_\{t\+1\}=H\_\{t\}\\cup\\\{\\operatorname\{Summarize\}\(u\_\{t,j\},z\_\{t,j\}\)\\\}\_\{j=1\}^\{K\_\{t\}\}\.\(19\)To keep the main\-agent context within the token budgetLctxL\_\{\\mathrm\{ctx\}\}, Orchestra\-o1 constructs a compressed context by solving:

Ct\+1=argmaxC:\|C\|≤Lctx⁡\[I\(C;q\)\+∑h∈Ht\+1w\(h\)I\(C;h\)\],C\_\{t\+1\}=\\operatorname\*\{arg\\,max\}\_\{C:\|C\|\\leq L\_\{\\mathrm\{ctx\}\}\}\\Big\[I\(C;q\)\+\\sum\_\{h\\in H\_\{t\+1\}\}w\(h\)I\(C;h\)\\Big\],\(20\)whereI\(⋅;⋅\)I\(\\cdot;\\cdot\)denotes information relevance andw\(h\)w\(h\)up\-weights successful or recently produced evidence\. The next orchestration state and budget are:

st\+1=\(q,ℳ,Ct\+1,Ht\+1,ℬ,𝒯,Bt\+1\),Bt\+1=Bt−∑j=1Ktct,j\.s\_\{t\+1\}=\\big\(q,\\mathcal\{M\},C\_\{t\+1\},H\_\{t\+1\},\\mathcal\{B\},\\mathcal\{T\},B\_\{t\+1\}\\big\),\\quad B\_\{t\+1\}=B\_\{t\}\-\\sum\_\{j=1\}^\{K\_\{t\}\}c\_\{t,j\}\.\(21\)The main agent terminates when its evidence sufficiency score exceeds a threshold:

pθstop\(st\)=pθ\(at=𝚌𝚘𝚖𝚙𝚕𝚎𝚝𝚎∣st\),pθstop\(st\)\>τstop\.p^\{\\mathrm\{stop\}\}\_\{\\theta\}\(s\_\{t\}\)=p\_\{\\theta\}\(a\_\{t\}=\\mathtt\{complete\}\\mid s\_\{t\}\),\\quad p^\{\\mathrm\{stop\}\}\_\{\\theta\}\(s\_\{t\}\)\>\\tau\_\{\\mathrm\{stop\}\}\.\(22\)The final answer is generated from the compressed evidence statea^=Aθ\(q,ℳ,Ct,Ht\)\\hat\{a\}=A\_\{\\theta\}\(q,\\mathcal\{M\},C\_\{t\},H\_\{t\}\)\. Otherwise, the main agent refines the dependency graph according to new evidence𝒢t\+1=Refineθ⁡\(𝒢t,Ht\+1\)\\mathcal\{G\}\_\{t\+1\}=\\operatorname\{Refine\}\_\{\\theta\}\(\\mathcal\{G\}\_\{t\},H\_\{t\+1\}\), and continues delegation\. Overall, Orchestra\-o1 differs from linear orchestration frameworks by explicitly modeling omnimodal agent collaboration as a dependency\-aware parallel scheduling process with learnable decomposition, model selection, tool selection, evidence aggregation, and stopping decisions\.

##### Theoretical Advantage over Native Omnimodal Agents\.

We next provide an information\-theoretic justification for why agent orchestration can be preferable to a native omnimodal agent design in heterogeneous tasks\. The proof can be found in Section[A\.2](https://arxiv.org/html/2606.13707#A1.SS2)\.

###### Proposition 2\(Information Gain from Omnimodal Orchestration\)\.

LetYYdenote the latent task answer and letℳ=\(M1,…,MR\)\\mathcal\{M\}=\(M\_\{1\},\\ldots,M\_\{R\}\)denoteRRmodality sources\. A native omnimodal agent compresses all modalities into a single internal evidence variableE0=f0\(q,ℳ\)E\_\{0\}=f\_\{0\}\(q,\\mathcal\{M\}\)under a fixed context and computation budget\. An orchestration\-based system assigns modality\-aware sub\-tasks to specialized sub\-agents and obtains evidence variablesEorch=\(E1,…,ER\)E\_\{\\mathrm\{orch\}\}=\(E\_\{1\},\\ldots,E\_\{R\}\), whereEr=fr\(q,Mr,Cr\)E\_\{r\}=f\_\{r\}\(q,M\_\{r\},C\_\{r\}\)is produced by a backend/tool pair specialized for modalityMrM\_\{r\}\. Suppose that: \(i\) the main agent aggregates all returned evidence without losing information relevant toYY; \(ii\) the native evidence admits modality\-wise components\(E10,…,ER0\)\(E^\{0\}\_\{1\},\\ldots,E^\{0\}\_\{R\}\)whose joint information upper\-bounds the information retained byE0E\_\{0\}; and \(iii\) specialized execution is at least as informative as native processing at every modality step, with a strict gain for at least one modality:

I\(Y;Er∣q,E<r\)≥I\(Y;Er0∣q,E<r0\),r=1,…,R,I\(Y;E\_\{r\}\\mid q,E\_\{<r\}\)\\geq I\(Y;E^\{0\}\_\{r\}\\mid q,E^\{0\}\_\{<r\}\),\\quad r=1,\\ldots,R,\(23\)and the inequality is strict for somerr\. Then we have:

I\(Y;Eorch∣q\)\>I\(Y;E0∣q\)\.I\(Y;E\_\{\\mathrm\{orch\}\}\\mid q\)\>I\(Y;E\_\{0\}\\mid q\)\.\(24\)Moreover, under Bayes\-optimal prediction with log loss, whose minimal risk isℛlog\(E\)=H\(Y∣q,E\)\\mathcal\{R\}\_\{\\log\}\(E\)=H\(Y\\mid q,E\), orchestration has strictly smaller expected risk:

ℛlog\(Eorch\)<ℛlog\(E0\)\.\\mathcal\{R\}\_\{\\log\}\(E\_\{\\mathrm\{orch\}\}\)<\\mathcal\{R\}\_\{\\log\}\(E\_\{0\}\)\.\(25\)

##### Framework Summary\.

In summary, Orchestra\-o1 implements omnimodal agent orchestration as a closed\-loop decision process that separates high\-level planning from specialized perception and action execution\. The main agent maintains a structured memory, decomposes the task into dependency\-aware sub\-goals, selects suitable backends and tools for each sub\-task, executes independent sub\-tasks in parallel, and iteratively compresses returned evidence until the answer is sufficiently supported\. This design makes the system both modular and scalable: new modalities, tools, or sub\-agent models can be integrated through the same requirement\-matching interface, while the dependency\-aware scheduler improves latency whenever multiple independent sub\-tasks can be solved concurrently\. Algorithm[1](https://arxiv.org/html/2606.13707#alg1)summarizes the overall workflow of Orchestra\-o1\.

Input :Question

qq, Modality Inputs

ℳ\\mathcal\{M\}, Backend Pool

ℬ\\mathcal\{B\}, Tool Set

𝒯\\mathcal\{T\}, Maximum Rounds

TmaxT\_\{\\max\}
Output :Final Answer

a^\\hat\{a\}
Initialize

H1←∅H\_\{1\}\\leftarrow\\emptyset,

C1←∅C\_\{1\}\\leftarrow\\emptyset, and

s1←\(q,ℳ,C1,H1,ℬ,𝒯\)s\_\{1\}\\leftarrow\(q,\\mathcal\{M\},C\_\{1\},H\_\{1\},\\mathcal\{B\},\\mathcal\{T\}\);

for*t=1,…,Tmaxt=1,\\ldots,T\_\{\\max\}*do

Sample orchestration decision

yt=\(at,ξt\)∼πθ\(⋅∣st\)y\_\{t\}=\(a\_\{t\},\\xi\_\{t\}\)\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\);

if*at=𝚌𝚘𝚖𝚙𝚕𝚎𝚝𝚎a\_\{t\}=\\mathtt\{complete\}*then

Generate

a^=Aθ\(q,ℳ,Ct,Ht\)\\hat\{a\}=A\_\{\\theta\}\(q,\\mathcal\{M\},C\_\{t\},H\_\{t\}\);

return

a^\\hat\{a\};

Induce or refine dependency graph

𝒢t=\(𝒱t,ℰt\)\\mathcal\{G\}\_\{t\}=\(\\mathcal\{V\}\_\{t\},\\mathcal\{E\}\_\{t\}\)and compute ready set

ℛt\\mathcal\{R\}\_\{t\};

Select parallel batch

𝒫t⊆ℛt\\mathcal\{P\}\_\{t\}\\subseteq\\mathcal\{R\}\_\{t\}under dependency and budget constraints;

foreach*vt,j∈𝒫tv\_\{t,j\}\\in\\mathcal\{P\}\_\{t\}in parallel*do

Materialize sub\-task

ut,j=Γθ\(vt,j,st\)u\_\{t,j\}=\\Gamma\_\{\\theta\}\(v\_\{t,j\},s\_\{t\}\);

Assign backend

bt,j=b∗\(ut,j\)b\_\{t,j\}=b^\{\*\}\(u\_\{t,j\}\)and tools

𝒯t,j=𝒯∗\(ut,j\)\\mathcal\{T\}\_\{t,j\}=\\mathcal\{T\}^\{\*\}\(u\_\{t,j\}\);

Execute sub\-agent trajectory

ζt,j\\zeta\_\{t,j\}and summarize

zt,j=Ω\(ζt,j\)z\_\{t,j\}=\\Omega\(\\zeta\_\{t,j\}\);

Update memory

Ht\+1←Ht∪\{Summarize⁡\(ut,j,zt,j\)\}j=1KtH\_\{t\+1\}\\leftarrow H\_\{t\}\\cup\\\{\\operatorname\{Summarize\}\(u\_\{t,j\},z\_\{t,j\}\)\\\}\_\{j=1\}^\{K\_\{t\}\};

Compress context

Ct\+1C\_\{t\+1\}under

LctxL\_\{\\mathrm\{ctx\}\}and update remaining budget;

Set

st\+1←\(q,ℳ,Ct\+1,Ht\+1,ℬ,𝒯\)s\_\{t\+1\}\\leftarrow\(q,\\mathcal\{M\},C\_\{t\+1\},H\_\{t\+1\},\\mathcal\{B\},\\mathcal\{T\}\);

Generate fallback answer

a^=Aθ\(q,ℳ,CTmax,HTmax\)\\hat\{a\}=A\_\{\\theta\}\(q,\\mathcal\{M\},C\_\{T\_\{\\max\}\},H\_\{T\_\{\\max\}\}\);

return

a^\\hat\{a\};

Algorithm 1Workflow of Orchestra\-o1![Refer to caption](https://arxiv.org/html/2606.13707v1/x3.png)Figure 3:An overview of our training recipe, including \(a\) data curation pipeline and \(b\) DA\-GRPO training process\.

### 3\.3Training Recipe

Although Orchestra\-o1 can use strong proprietary models as the main agent, a practical open\-source agent system requires an open\-source model that can make reliable orchestration decisions\. We therefore develop a training recipe for deriving Orchestra\-o1\-8B from Qwen3\-8B\[yang2025qwen3\]\. The data curation and post\-training process are illustrated in Figure[3](https://arxiv.org/html/2606.13707#S3.F3)\.

#### 3\.3\.1Training Data Curation

A central challenge in training an open\-source orchestrator is the lack of diverse omnimodal tasks with reliable answers and explicit evidence chains\. We therefore build a seed\-based data curation pipeline on top of public datasets such as FineVideo\[Farré2024FineVideo\], LongVideoBench\[wu2024longvideobench\], and COCO 2017\[lin2014microsoft\]\. Given the original seed set:

𝒟0=\{xi=\(qi,ℳi,ai,𝒯i\)\}i=1N,\\mathcal\{D\}\_\{0\}=\\\{x\_\{i\}=\(q\_\{i\},\\mathcal\{M\}\_\{i\},a\_\{i\},\\mathcal\{T\}\_\{i\}\)\\\}\_\{i=1\}^\{N\},\(26\)whereqiq\_\{i\}is the question,ℳi\\mathcal\{M\}\_\{i\}denotes the image/audio/video inputs,aia\_\{i\}is the answer, and𝒯i\\mathcal\{T\}\_\{i\}is the required tool set, our goal is to create new examples while keeping the original modality files unchanged\. Then we use this seed set to collect successful orchestration trajectories under Orchestra\-o1 \(GPT\-5\[openaigpt5\]\) and transform each trajectory into an annotated reasoning solutionrir\_\{i\}with a difficulty levelℓi\\ell\_\{i\}\.

The curation pipeline contains three stages\. First, we extract modality\-grounded anchor facts from the annotated solution and evidence sources\. For each seedxix\_\{i\}, an LLM extractor producesAi=\{\(fi,m,μi,m,ei,m\)\}m=1MiA\_\{i\}=\\\{\(f\_\{i,m\},\\mu\_\{i,m\},e\_\{i,m\}\)\\\}\_\{m=1\}^\{M\_\{i\}\}, wherefi,mf\_\{i,m\}is an anchor fact,μi,m\\mu\_\{i,m\}is its source modality, andei,me\_\{i,m\}records the supporting step in the annotated solution\. These anchors identify the non\-bypassable perceptual facts that every valid rewrite must preserve\. Second, conditioned on\(xi,Ai\)\(x\_\{i\},A\_\{i\}\), we generateKiK\_\{i\}candidate rewrites using five strategy families: pivot swapping, temporal shifting, numerical recombination, entity\-sibling querying, and multi\-hop reordering\. Formally, each candidate is sampled as:

x~i,k∼Gω\(⋅∣xi,Ai,zi,k\),zi,k∈\{𝙰,𝙱,𝙲,𝙳,𝙴\},\\tilde\{x\}\_\{i,k\}\\sim G\_\{\\omega\}\(\\cdot\\mid x\_\{i\},A\_\{i\},z\_\{i,k\}\),\\quad z\_\{i,k\}\\in\\\{\\mathtt\{A\},\\mathtt\{B\},\\mathtt\{C\},\\mathtt\{D\},\\mathtt\{E\}\\\},\(27\)subject to the invariantsℳ~i,k=ℳi\\tilde\{\\mathcal\{M\}\}\_\{i,k\}=\\mathcal\{M\}\_\{i\},ℓ~i,k=ℓi\\tilde\{\\ell\}\_\{i,k\}=\\ell\_\{i\},𝒯~i,k⊆𝒯i\\tilde\{\\mathcal\{T\}\}\_\{i,k\}\\subseteq\\mathcal\{T\}\_\{i\}, and\|s~i,k−si\|≤2\|\\tilde\{s\}\_\{i,k\}\-s\_\{i\}\|\\leq 2, wheres~i,k\\tilde\{s\}\_\{i,k\}andsis\_\{i\}denote the total reasoning steps of the rewritten and original examples\. In practice, easy seeds mainly use pivot swapping and entity\-sibling querying, medium seeds additionally use temporal or numerical variants, and hard seeds emphasize numerical recombination and multi\-hop reordering\. Third, we verify each candidate with a cascade of quality gates\. LetV\(x~i,k\)∈\{0,1\}V\(\\tilde\{x\}\_\{i,k\}\)\\in\\\{0,1\\\}denote the final verification decision\. We define:

V\(x~i,k\)=\\displaystyle V\(\\tilde\{x\}\_\{i,k\}\)=\{\}𝕀\{AnchorCov⁡\(q~i,k,Ai\)≥1\}⋅𝕀\{Sim⁡\(qi,q~i,k\)≤0\.85\}⋅𝕀\{Bypass⁡\(q~i,k,a~i,k\)=0\}\\displaystyle\\mathbb\{I\}\\\{\\operatorname\{AnchorCov\}\(\\tilde\{q\}\_\{i,k\},A\_\{i\}\)\\geq 1\\\}\\cdot\\mathbb\{I\}\\\{\\operatorname\{Sim\}\(q\_\{i\},\\tilde\{q\}\_\{i,k\}\)\\leq 85\\\}\\cdot\\mathbb\{I\}\\\{\\operatorname\{Bypass\}\(\\tilde\{q\}\_\{i,k\},\\tilde\{a\}\_\{i,k\}\)=0\\\}\(28\)⋅𝕀\{NumCheck⁡\(r~i,k,a~i,k\)=1\}⋅𝕀\{Judge⁡\(x~i,k\)=1\}\.\\displaystyle\\cdot\\mathbb\{I\}\\\{\\operatorname\{NumCheck\}\(\\tilde\{r\}\_\{i,k\},\\tilde\{a\}\_\{i,k\}\)=1\\\}\\cdot\\mathbb\{I\}\\\{\\operatorname\{Judge\}\(\\tilde\{x\}\_\{i,k\}\)=1\\\}\.The first two gates enforce anchor coverage and remove near\-duplicates by normalized lexical similarity\. The third gate performs a modal\-bypass test by asking a strong language model to answer without access toℳi\\mathcal\{M\}\_\{i\}: if the answer can still be recovered, the candidate is rejected\. The fourth gate executes numerical solutions in a restricted Python sandbox when code execution or numeric answers are involved\. The last gate uses an LLM judge to check factual consistency, difficulty preservation, and peer\-level duplication among rewrites from the same seed\. All LLM\-based processors in our curation pipeline, including the anchor extractor, question rewriter, and verification judge, are implemented with Claude\-Opus\-4\.6\[claudeopus46\]\.

Finally, all verified rewrites are merged to form the task set𝒟=\{x~i,k:V\(x~i,k\)=1\}\\mathcal\{D\}=\\\{\\tilde\{x\}\_\{i,k\}:V\(\\tilde\{x\}\_\{i,k\}\)=1\\\}\. Our implementation extracts valid anchors for 300 seeds, generates about 1500 raw rewrite candidates, and retains around 1200 verified examples after filtering\. For a trajectoryτ=\{\(st,yt∗,Zt\)\}t=1N\\tau=\\big\\\{\(s\_\{t\},y\_\{t\}^\{\*\},Z\_\{t\}\)\\big\\\}\_\{t=1\}^\{N\}, we createNNdecision\-level examples, wherests\_\{t\}reconstructs the exact main\-agent state before the expert decision andyt∗y\_\{t\}^\{\*\}stores the reference orchestration action\. This gives dense supervision for delegation, tool assignment, backend selection, parallel scheduling, and stopping decisions\.

#### 3\.3\.2Decision\-aligned Group Relative Policy Optimization

We propose decision\-aligned group relative policy optimization \(DA\-GRPO\), a GRPO\-style training objective tailored for main\-agent orchestration\. Standard GRPO\[guo2025deepseek\]samples a group of responses for the same prompt and normalizes rewards within the group to form relative advantages\. However, for agent orchestration, final\-answer reward is sparse and expensive because it requires executing the whole multi\-agent system\. DA\-GRPO instead evaluates each sampled main\-agent decision directly at the current orchestration state, using expert trajectories and a rubric reward that measures whether the decision is well\-formed, valid, tool\-aware, and strategically useful\.

For each promptsis\_\{i\}, the policy samples a group ofGGcandidate decisions\{yi,j\}j=1G\\\{y\_\{i,j\}\\\}\_\{j=1\}^\{G\}\. Each decision is scored by a multi\-dimensional reward:

ri,j=\\displaystyle r\_\{i,j\}=\{\}α1ri,jformat\+α2ri,jaction\+α3ri,jtool\+α4ri,jdecision,\\displaystyle\\alpha\_\{1\}\\,r^\{\\mathrm\{format\}\}\_\{i,j\}\+\\alpha\_\{2\}\\,r^\{\\mathrm\{action\}\}\_\{i,j\}\+\\alpha\_\{3\}\\,r^\{\\mathrm\{tool\}\}\_\{i,j\}\+\\alpha\_\{4\}\\,r^\{\\mathrm\{decision\}\}\_\{i,j\},\(29\)whererformatr^\{\\mathrm\{format\}\}measures whether the output is a valid JSON decision,ractionr^\{\\mathrm\{action\}\}measures whether the action is valid with appropriate parameters,rtoolr^\{\\mathrm\{tool\}\}measures whether the selected tools and sub\-task assignments are reasonable, andrdecisionr^\{\\mathrm\{decision\}\}measures the overall orchestration decision quality\. The first two dimensions are binary, while the latter two are graded and normalized to\[0,1\]\[0,1\]\. In our implementation, Claude\-Haiku\-4\.5\[claudehaiku45\]serves as a lightweight reward model and scores all four dimensions in a single call\. The judge is given the current question, ground\-truth answer, sub\-task history, expert decision, and model output\. Importantly, the expert decision is used as a reference rather than the only correct answer: alternative decompositions are rewarded if they are reasonable and likely to solve the task, while acompletedecision receives the highest decision\-quality score when its answer matches the ground truth\. The coefficients for each reward term are empirically set asα1=α2=0\.1\\alpha\_\{1\}=\\alpha\_\{2\}=0\.1,α3=0\.2\\alpha\_\{3\}=0\.2, andα4=0\.6\\alpha\_\{4\}=0\.6, prioritizing the tool reward and decision reward, since the two format\-related rewards exhibit relatively good initial values\.

Given group rewards, DA\-GRPO computes the relative advantage of each sampled decision by normalizing within the group:

A^i,j=ri,j−Mean\(\{ri,k\}k=1G\)Std\(\{ri,k\}k=1G\)\+ϵ\.\\hat\{A\}\_\{i,j\}=\\frac\{r\_\{i,j\}\-\\mathrm\{Mean\}\(\\\{r\_\{i,k\}\\\}\_\{k=1\}^\{G\}\)\}\{\\mathrm\{Std\}\(\\\{r\_\{i,k\}\\\}\_\{k=1\}^\{G\}\)\+\\epsilon\}\.\(30\)The policy is then optimized with a clipped policy\-gradient objective and a KL regularizer to the reference model:

ℒDA\-GRPO\(θ\)=−𝔼i,j\[min\(ρi,j\(θ\)A^i,j,clip\(ρi,j\(θ\),1−ϵ,1\+ϵ\)A^i,j\)−βDKL\(πθ\(⋅\|si\)∥πref\(⋅\|si\)\)\],\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{DA\\text\{\-\}GRPO\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{i,j\}\\Big\[\\min\\big\(\\rho\_\{i,j\}\(\\theta\)\\hat\{A\}\_\{i,j\},\\mathrm\{clip\}\(\\rho\_\{i,j\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{i,j\}\\big\)\-\\beta\\,D\_\{\\mathrm\{KL\}\}\\big\(\\pi\_\{\\theta\}\(\\cdot\|s\_\{i\}\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\|s\_\{i\}\)\\big\)\\Big\],\(31\)whereρi,j\(θ\)=πθ\(yi,j\|si\)/πold\(yi,j\|si\)\\rho\_\{i,j\}\(\\theta\)=\\pi\_\{\\theta\}\(y\_\{i,j\}\|s\_\{i\}\)/\\pi\_\{\\mathrm\{old\}\}\(y\_\{i,j\}\|s\_\{i\}\)andπref\\pi\_\{\\mathrm\{ref\}\}is the reference model\. This objective encourages the open\-source main agent to prefer decisions that are not only syntactically valid but also strategically aligned with successful orchestration behavior\. Compared with outcome\-only reinforcement learning, DA\-GRPO offers two advantages\. First, it avoids repeatedly executing expensive sub\-agent trajectories during training, since each decision can be scored offline from the reconstructed state\. Second, it provides dense feedback on the main agent’s core responsibilities: decomposing tasks, selecting tools, scheduling parallel sub\-tasks, and deciding when evidence is sufficient for final answering\. We train Orchestra\-o1\-8B with this recipe and deploy it as the main agent in Orchestra\-o1\.

## 4Experiments

Table 1:Category\-wise accuracy \(%\) on OmniGAIA\. The non\-orchestration\-based models are implemented under the standard ReAct framework\. The highest value in each category within each model group is highlighted in bold\.MethodCategory\-Wise BreakdownOverallGeo\.Tech\.Hist\.Fin\.SportArtMovieSci\.FoodOpen\-Source Agentic Models![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x4.png)Qwen2\.5\-Omni\-3B0\.02\.04\.50\.00\.00\.00\.03\.90\.01\.4![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x5.png)Qwen2\.5\-Omni\-7B1\.54\.17\.54\.00\.02\.80\.07\.75\.63\.6![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x6.png)Baichuan\-Omni\-1\.5\-8B2\.94\.13\.04\.02\.70\.03\.03\.80\.02\.8![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x7.png)MiniCPM\-O\-2\.6\-8B2\.92\.01\.50\.02\.78\.33\.03\.85\.63\.1![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x8.png)Ming\-Lite\-Omni\-1\.5\-20B\-A3B2\.96\.11\.54\.05\.42\.86\.17\.75\.63\.9![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x9.png)Qwen3\-Omni\-30B\-A3B8\.714\.311\.928\.010\.813\.99\.115\.422\.213\.3![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x10.png)Ming\-Flash\-Omni\-100B\-A6B5\.88\.210\.412\.08\.15\.66\.111\.511\.18\.3![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x11.png)LongCat\-Flash\-Omni\-560B\-A27B8\.710\.216\.412\.010\.88\.36\.111\.516\.711\.1![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/img/omniagent_icon.png)OmniAtlas\-Qwen2\.5\-3B4\.412\.216\.74\.016\.211\.13\.011\.511\.110\.3![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/img/omniagent_icon.png)OmniAtlas\-Qwen2\.5\-7B8\.718\.416\.44\.016\.222\.23\.07\.722\.213\.3![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/img/omniagent_icon.png)OmniAtlas\-Qwen3\-30B\-A3B10\.130\.629\.932\.018\.916\.712\.111\.527\.820\.8![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/img/orchestra_o1.png)Orchestra\-o1\-8B \(Ours\)21\.732\.737\.912\.029\.716\.745\.538\.538\.930\.0Proprietary Agentic Models![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x12.png)Gemini\-2\.5\-Flash\-Lite5\.88\.214\.94\.010\.88\.36\.13\.911\.18\.6![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x13.png)Gemini\-2\.5\-Pro23\.228\.632\.820\.032\.441\.742\.426\.933\.330\.8![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x14.png)Gemini\-3\-Flash50\.757\.144\.848\.059\.555\.654\.638\.561\.151\.7![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/x15.png)Gemini\-3\-Pro65\.259\.262\.172\.078\.452\.848\.542\.388\.962\.5![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/img/aorchestra.png)AOrchestra\-GPT\-534\.840\.856\.132\.051\.425\.042\.430\.822\.240\.0![[Uncaptioned image]](https://arxiv.org/html/2606.13707v1/img/orchestra_o1.png)Orchestra\-o1\-GPT\-5 \(Ours\)72\.569\.475\.864\.083\.863\.969\.773\.183\.372\.8

![Refer to caption](https://arxiv.org/html/2606.13707v1/x16.png)Figure 4:Difficulty\-level comparison among open\-source and proprietary agentic models on OmniGAIA\. The non\-orchestration\-based models are implemented under the standard ReAct framework\.![Refer to caption](https://arxiv.org/html/2606.13707v1/x17.png)Figure 5:Efficiency Comparison between Orchestra\-o1 and AOrchestra\.### 4\.1Experimental Setup

##### Benchmark and Baselines\.

We evaluate all methods on OmniGAIA\[li2026omnigaia\], a challenging omnimodal agent benchmark that covers heterogeneous inputs including text, image, audio, and video\. Each task requires a concise final answer and is associated with a difficulty level \(Easy, Medium, or Hard\) and a topical category\. We report accuracy as the primary metric\. For detailed analysis, we additionally break down the results by category and by difficulty level\. We compare Orchestra\-o1 with three groups of baselines\. First, we evaluate native open\-source omnimodal models, including Qwen2\.5\-Omni\[Qwen2\.5\-Omni\], Baichuan\-Omni\[li2025baichuan\], MiniCPM\-O\[yao2024minicpm\], Ming\-Lite\-Omni\[Mingomni2025\], Qwen3\-Omni\[Qwen3\-Omni\], Ming\-Flash\-Omni\[ai2025ming\], LongCat\-Flash\-Omni\[team2025longcat\], and OmniAtlas\[li2026omnigaia\]variants\. Second, we compare with proprietary omnimodal models, including Gemini\-2\.5\[gemini25flashlite,gemini25pro\]and Gemini\-3\[gemini3flash,gemini3pro\]variants\. Third, we compare with orchestration\-based agent systems, especially AOrchestra\[ruan2026aorchestra\], which is the strongest open\-source orchestration baseline in our experiments\. Non\-orchestration\-based models are implemented under a standard ReAct\-style agent framework\.

##### Implementation Details\.

For the proprietary setting, we use GPT\-5 as the main agent of Orchestra\-o1\. For the open\-source setting, we train Orchestra\-o1\-8B from Qwen3\-8B\[yang2025qwen3\]and deploy it as the main agent\. The maximum number of main\-agent orchestration attempts is set to 10\. All sub\-tasks within the same delegation call are executed asynchronously by separate ReAct\-style sub\-agents with cloned environments, and each sub\-agent can use the assigned subset of tools\. The maximum step for sub\-agents is set to 30\. The tool ecosystem contains six tools: image analysis, audio analysis, video analysis, web search, page visit, and code execution\. For the open\-source training experiments, DA\-GRPO is trained on a single node with 8×\\timesH20 GPUs\. We use a train batch size of 24, rollout group size of 8, learning rate5×10−65\\times 10^\{\-6\}, KL coefficient 0\.01, and cosine learning\-rate decay\. The maximums of prompt length and response length are set to 24,576 and 4,096, respectively\. The training process is stopped after 5 epochs\. The reward is a weighted sum of format correctness, action validity, tool reasonableness, and decision quality, with weights 0\.1, 0\.1, 0\.2, and 0\.6, respectively\.

### 4\.2Main Results

##### Category\-wise Comparison\.

Table[1](https://arxiv.org/html/2606.13707#S4.T1)reports category\-wise accuracy on OmniGAIA\. Orchestra\-o1\-GPT\-5 achieves the best overall accuracy of72\.8%72\.8\\%, outperforming the strongest native proprietary model, Gemini\-3\-Pro, by10\.3%10\.3\\%absolute accuracy and outperforming AOrchestra\-GPT\-5 by32\.8%32\.8\\%absolute accuracy\. The improvement is consistent across most categories\. In particular, Orchestra\-o1\-GPT\-5 obtains strong gains in geography, technology, history, sport, art, movie, and science, showing that explicit omnimodal orchestration is broadly useful rather than being specialized to a single topic domain\. The open\-source results further demonstrate the effectiveness of our training recipe\. Orchestra\-o1\-8B achieves30\.0%30\.0\\%overall accuracy, substantially improving over the strongest open\-source baseline OmniAtlas\-Qwen\-3\-30B\-A3B \(20\.8%20\.8\\%\), despite using a smaller 8B main\-agent backbone\. The gains are especially large in categories that benefit from structured evidence gathering and tool use, such as geography, history, movie, science, and food\. These results suggest that a compact language model can become a competitive omnimodal orchestrator when trained with DA\-GRPO\.

##### Difficulty\-level Comparison\.

Figure[4](https://arxiv.org/html/2606.13707#S4.F4)compares methods under easy, medium, and hard difficulty levels\. Across all difficulty groups, Orchestra\-o1 consistently ranks first among methods in the corresponding model family\. In the proprietary setting, Orchestra\-o1\-GPT\-5 reaches80\.3%80\.3\\%,75\.0%75\.0\\%, and56\.4%56\.4\\%accuracy on easy, medium, and hard tasks, respectively\. Compared with AOrchestra\-GPT\-5, the gains are35\.2%35\.2\\%,35\.0%35\.0\\%, and24\.3%24\.3\\%absolute accuracy\. The improvement on hard tasks is particularly important because these tasks usually require multi\-step reasoning over several pieces of heterogeneous evidence\. The result indicates that dependency\-aware decomposition and iterative evidence aggregation help the main agent avoid premature answering and better exploit specialized sub\-agents\. In the open\-source setting, Orchestra\-o1\-8B also improves the previous best results across all difficulty levels, reaching36\.1%36\.1\\%on easy,26\.9%26\.9\\%on medium, and26\.9%26\.9\\%on hard tasks\. The relatively strong hard\-task performance shows that DA\-GRPO does not merely teach surface\-level JSON formatting; instead, it improves the strategic quality of orchestration decisions, such as when to delegate, which tools to assign, and when to stop\.

##### Efficiency Analysis\.

Figure[5](https://arxiv.org/html/2606.13707#S4.F5)compares the efficiency of Orchestra\-o1 and AOrchestra with GPT\-5 as the main agent\. Orchestra\-o1 achieves higher accuracy while using lower cost across easy, medium, hard, and overall splits\. Overall, Orchestra\-o1 reaches72\.8%72\.8\\%accuracy with a cost of341\.6341\.6, while AOrchestra obtains40\.0%40\.0\\%accuracy with a cost of565\.7565\.7\. This means that Orchestra\-o1 is not only more accurate but also more cost\-effective\. The efficiency advantage comes from two design choices\. First, Orchestra\-o1 executes independent sub\-tasks in parallel within a single orchestration round, reducing latency compared with linear sub\-agent workflows\. Second, the main agent explicitly selects tools and sub\-agent backends for each sub\-task, which prevents unnecessary use of expensive or irrelevant capabilities\. The observed cost\-accuracy trade\-off is consistent with Proposition[1](https://arxiv.org/html/2606.13707#Thmtheorem1): when several independent perception or information\-seeking sub\-tasks can be executed simultaneously, parallel orchestration reduces the effective round\-level latency and improves resource utilization\.

### 4\.3Ablation Study

##### Ablation on Agent Harness\.

![Refer to caption](https://arxiv.org/html/2606.13707v1/x18.png)Figure 6:Ablation on the agent harness design\.Figure[6](https://arxiv.org/html/2606.13707#S4.F6)studies whether the gains come from the orchestration framework rather than only from the GPT\-5 backend\. We compare a standard ReAct\-GPT\-5 agent with Orchestra\-o1\-GPT\-5 under the same perception and action tools\. Orchestra\-o1 improves the overall accuracy from53\.9%53\.9\\%to72\.8%72\.8\\%, with consistent gains in all categories\. The largest gains appear in categories such as art, food, geography, science, movie, and sport, where tasks often require specialized omnimodal perception or external information retrieval before final reasoning\. This confirms that the proposed harness design, including task decomposition and sub\-agent specialization, provides substantial benefits beyond a strong single\-agent ReAct loop\.

##### Ablation on Post\-training Recipe\.

Table 2:Ablation on the post\-training recipe\.FrameworkModelToolsPost\-trainingAccuracy \(%\\%\)ReActQwen3\-8BOmniNone12\.5Orchestra\-o1Qwen3\-8BOmniNone26\.3Orchestra\-o1Qwen3\-8BOmniSFT28\.6Orchestra\-o1Qwen3\-8BOmniVanilla GRPO27\.7Orchestra\-o1Qwen3\-8BOmniDA\-GRPO30\.0

Table[2](https://arxiv.org/html/2606.13707#S4.T2)evaluates the contribution of the post\-training recipe for Qwen3\-8B\. A direct ReAct\-style Qwen3\-8B agent achieves only12\.5%12\.5\\%accuracy\. Simply placing the same model into the Orchestra\-o1 framework without post\-training improves accuracy to26\.3%26\.3\\%, showing that the orchestration scaffold itself provides a strong inductive bias\. Supervised fine\-tuning \(SFT\) further improves performance to28\.6%28\.6\\%\. Vanilla GRPO\[guo2025deepseek\]reaches27\.7%27\.7\\%, which is slightly worse than SFT, suggesting that sparse or weakly aligned reinforcement learning is insufficient for main\-agent orchestration\. In contrast, DA\-GRPO achieves the best accuracy of30\.0%30\.0\\%\. This validates our design choice of directly rewarding decision\-level alignment, tool reasonableness, and strategic orchestration quality\.

![Refer to caption](https://arxiv.org/html/2606.13707v1/x19.png)Figure 7:Case study of Orchestra\-o1’s response to a representative sample on OmniGAIA\.

### 4\.4Case Study

Figure[7](https://arxiv.org/html/2606.13707#S4.F7)presents a representative OmniGAIA example\. The question provides an audio clip and an image\. The audio states that Mabon falls upon the equinox on September 23 at 7:49 AM, while the image depicts the Prague Astronomical Clock in Prague, Czech Republic\. Solving the task requires fusing these two independently obtained facts and then applying the timezone conversion for Prague in September, when the city observes CEST \(UTC\+2\)\. Therefore, the correct UTC time is 5:49 AM\. Orchestra\-o1\-GPT\-5 decomposes the task according to evidence dependencies\. In its first orchestration round, the main agent launches an audio sub\-task to extract the event, date, and local time, and an image sub\-task to identify the landmark and timezone\. The returned evidence is compact and complementary: the audio sub\-agent extracts “equinox on the 23rd of September at 7:49 AM”, and the image sub\-agent identifies “Prague Astronomical Clock” with timezone Europe/Prague\. The main agent then aggregates these facts and performs the final conversion, producing 05:49 UTC\. This example highlights the central advantage of Orchestra\-o1: it improves reliability not merely by adding more tool calls, but by coordinating specialized evidence acquisition, maintaining a structured context memory, and delaying the final answer until all necessary evidence has been grounded\.

## 5Conclusion

In this paper, we introduced Orchestra\-o1, an omnimodal agent orchestration framework that separates high\-level orchestration from low\-level tool\-augmented action execution\. The main agent dynamically decomposes a complex task into dependency\-aware sub\-tasks, dispatches independent sub\-tasks to specialized sub\-agents in parallel, maintains a compact context memory, and decides when the accumulated evidence is sufficient to produce the final answer\. We further proposed Orchestra\-o1\-8B, an open\-source instantiation of the main agent trained with DA\-GRPO\. By rewarding format correctness, action validity, tool reasonableness, and decision quality, DA\-GRPO directly optimizes the strategic behaviors required by orchestration\. Comprehensive experiments demonstrate that Orchestra\-o1 achieves strong gains over both native omnimodal agents and orchestration baselines\. In the proprietary setting, Orchestra\-o1 reaches the best overall accuracy while using lower cost than AOrchestra\. In the open\-source setting, Orchestra\-o1\-8B substantially improves over strong open\-source omnimodal baselines despite using a compact 8B main\-agent backbone\. These results suggest that omnimodal agent intelligence can be advanced not only by scaling native OLLMs, but also by learning how to coordinate specialized agents, tools, and evidence sources in a principled and efficient manner\. In future work, we plan to extend omnimodal agent orchestration to more practical scenarios, such as audio\-video collaborative vibe coding and voice\-guided computer\-use tasks\.

## References

## Appendix AProof of Theorems

### A\.1Proof of Proposition[1](https://arxiv.org/html/2606.13707#Thmtheorem1)

###### Proof of Proposition[1](https://arxiv.org/html/2606.13707#Thmtheorem1)\.

For a linear orchestrator, the sub\-tasks are executed one after another, hence the total round latency is the sum of their runtimes∑j=1Ktδt,j\\sum\_\{j=1\}^\{K\_\{t\}\}\\delta\_\{t,j\}\. For Orchestra\-o1, conditional independence and the absence of shared mutable states allow all ready sub\-tasks to be launched simultaneously\. The round completes after the slowest sub\-task finishes and the main agent aggregates the returned results, givingmax1≤j≤Kt⁡δt,j\+δtsync\\max\_\{1\\leq j\\leq K\_\{t\}\}\\delta\_\{t,j\}\+\\delta^\{\\mathrm\{sync\}\}\_\{t\}\. Parallel execution is no slower than linear execution exactly when:

max1≤j≤Kt⁡δt,j\+δtsync≤∑j=1Ktδt,j,\\max\_\{1\\leq j\\leq K\_\{t\}\}\\delta\_\{t,j\}\+\\delta^\{\\mathrm\{sync\}\}\_\{t\}\\leq\\sum\_\{j=1\}^\{K\_\{t\}\}\\delta\_\{t,j\},\(32\)which is equivalent to the stated condition onδtsync\\delta^\{\\mathrm\{sync\}\}\_\{t\}\. When this condition holds, the denominator ofStS\_\{t\}is at most the numerator, soSt≥1S\_\{t\}\\geq 1\. Sinceδtsync≥0\\delta^\{\\mathrm\{sync\}\}\_\{t\}\\geq 0, we also have:

St≤∑j=1Ktδt,jmax1≤j≤Kt⁡δt,j≤Kt,S\_\{t\}\\leq\\frac\{\\sum\_\{j=1\}^\{K\_\{t\}\}\\delta\_\{t,j\}\}\{\\max\_\{1\\leq j\\leq K\_\{t\}\}\\delta\_\{t,j\}\}\\leq K\_\{t\},\(33\)where the last inequality follows because eachδt,j≤max1≤j≤Kt⁡δt,j\\delta\_\{t,j\}\\leq\\max\_\{1\\leq j\\leq K\_\{t\}\}\\delta\_\{t,j\}\. Equality requires bothδtsync=0\\delta^\{\\mathrm\{sync\}\}\_\{t\}=0and∑jδt,j=Ktmaxj⁡δt,j\\sum\_\{j\}\\delta\_\{t,j\}=K\_\{t\}\\max\_\{j\}\\delta\_\{t,j\}, which holds only when all sub\-task runtimes are equal\. This proves the proposition\. ∎

### A\.2Proof of Proposition[3\.2](https://arxiv.org/html/2606.13707#S3.SS2.SSS0.Px6)

###### Proof of Proposition[3\.2](https://arxiv.org/html/2606.13707#S3.SS2.SSS0.Px6)\.

By the chain rule of mutual information, we have:

I\(Y;Eorch∣q\)=I\(Y;E1,…,ER∣q\)=∑r=1RI\(Y;Er∣q,E<r\)\.I\(Y;E\_\{\\mathrm\{orch\}\}\\mid q\)=I\(Y;E\_\{1\},\\ldots,E\_\{R\}\\mid q\)=\\sum\_\{r=1\}^\{R\}I\(Y;E\_\{r\}\\mid q,E\_\{<r\}\)\.\(34\)The modality\-wise components of the native agent satisfy:

I\(Y;E0∣q\)≤I\(Y;E10,…,ER0∣q\)=∑r=1RI\(Y;Er0∣q,E<r0\),I\(Y;E\_\{0\}\\mid q\)\\leq I\(Y;E^\{0\}\_\{1\},\\ldots,E^\{0\}\_\{R\}\\mid q\)=\\sum\_\{r=1\}^\{R\}I\(Y;E^\{0\}\_\{r\}\\mid q,E^\{0\}\_\{<r\}\),\(35\)where the inequality follows because the component tuple is assumed to contain all task\-relevant information retained byE0E\_\{0\}\. By the specialization assumption, every conditional information term of orchestration is no smaller than its native counterpart, and at least one term is strictly larger\. Therefore, we have:

I\(Y;Eorch∣q\)\>I\(Y;E10,…,ER0∣q\)≥I\(Y;E0∣q\),I\(Y;E\_\{\\mathrm\{orch\}\}\\mid q\)\>I\(Y;E^\{0\}\_\{1\},\\ldots,E^\{0\}\_\{R\}\\mid q\)\\geq I\(Y;E\_\{0\}\\mid q\),\(36\)which proves the strict information gain\.

For log loss, the Bayes\-optimal predictor is the posterior distributionp\(Y∣q,E\)p\(Y\\mid q,E\)and the minimum achievable expected loss equals the conditional entropy:

ℛlog\(E\)=H\(Y∣q,E\)=H\(Y∣q\)−I\(Y;E∣q\)\.\\mathcal\{R\}\_\{\\log\}\(E\)=H\(Y\\mid q,E\)=H\(Y\\mid q\)\-I\(Y;E\\mid q\)\.\(37\)SinceH\(Y∣q\)H\(Y\\mid q\)is fixed for the task distribution, the strict inequalityI\(Y;Eorch∣q\)\>I\(Y;E0∣q\)I\(Y;E\_\{\\mathrm\{orch\}\}\\mid q\)\>I\(Y;E\_\{0\}\\mid q\)impliesℛlog\(Eorch\)<ℛlog\(E0\)\\mathcal\{R\}\_\{\\log\}\(E\_\{\\mathrm\{orch\}\}\)<\\mathcal\{R\}\_\{\\log\}\(E\_\{0\}\)\. Thus, when specialized sub\-agents provide strictly more task\-relevant evidence and the orchestrator preserves it, the multi\-agent orchestration system is theoretically preferable to the native single\-agent design\. ∎

## Appendix BMore Experimental Details

### B\.1Tool Configurations

The proposed Orchestra\-o1 incorporates a unified tool ecosystem shared by all sub\-agents\. The main agent does not directly call these tools\. Instead, it assigns each sub\-task a subset of tools, and the corresponding sub\-agent interacts with the environment through the assigned tool interface\. This design keeps the main agent focused on high\-level orchestration while allowing each sub\-agent to perform specialized perception or action execution\. A brief introduction of the incorporated tools is as follows:

##### Web Search\.

This tool performs web search for external information seeking\. It is useful when the answer depends on public knowledge, recent facts, entity disambiguation, or contextual information that is not fully contained in the provided modality inputs\. We use the Serper API\[serper\]to perform web search\.

##### Page Visit\.

This tool visits and extracts readable content from a given web page\. It complements web search by allowing a sub\-agent to inspect candidate sources in more detail\. We use it for tasks where a search snippet is insufficient and the sub\-agent needs to verify facts from the source page\. Web pages are crawled by the Jina Reader API\[jina\]\.

##### Code Execution\.

This tool executes Python code in a controlled workspace\. It is primarily used for numerical computation, table processing, date or unit conversion, and other deterministic operations\. In our framework, the main agent can delegate a computation sub\-task only after prerequisite values have been extracted from media or retrieved from the web\.

##### Image Analysis\.

This tool analyzes image inputs with a vision\-capable backend\. It is used for visual recognition, scene understanding, chart interpretation, OCR\-like inspection, and extraction of image\-grounded evidence\. The main agent is instructed to process relevant images before relying on external search because images often contain task\-specific information that cannot be recovered from text alone\.

##### Audio Analysis\.

This tool transcribes and analyzes standalone audio files such as speech clips\. It is used when the task requires spoken content, sound events, or audio\-grounded clues\. The returned transcription and summary are written into the sub\-task history so that later rounds can use them as textual evidence\.

##### Video Analysis\.

This tool analyzes video inputs by considering visual frames and, when appropriate, the audio track\. It is used for temporal reasoning, event recognition, spoken\-video understanding, and multimodal evidence extraction\. Since video analysis can be expensive, Orchestra\-o1 encourages the main agent to formulate specific video\-analysis instructions rather than asking for an overly broad description\.

### B\.2System Prompt for Main Agent

System Prompt for Main Agent[⬇](data:text/plain;base64,WW91IGFyZSB0aGUgTWFpbkFnZW50IChPcmNoZXN0cmF0b3IpIGZvciBPbW5pR0FJQSBiZW5jaG1hcmsgdGFza3MuIFlvdXIgcm9sZSBpcyB0byBhbmFseXplIHRoZSBnaXZlbiBRVUVTVElPTiwgcGxhbiBhIG11bHRpLXBoYXNlIGV4ZWN1dGlvbiBzdHJhdGVneSwgYW5kIGRlbGVnYXRlIHN1YnRhc2tzIHRvIFN1YkFnZW50cywgbWF4aW1pemluZyBwYXJhbGxlbGlzbSB3aGVyZSBwb3NzaWJsZSB3aGlsZSByZXNwZWN0aW5nIHRhc2sgZGVwZW5kZW5jaWVzLgoKPT09PSBDT1JFIFBSSU5DSVBMRTogU01BUlQgUEFSQUxMRUwgREVDT01QT1NJVElPTiA9PT09Ck5vdCBhbGwgc3VidGFza3MgY2FuIHJ1biBzaW11bHRhbmVvdXNseS4gU29tZSBkZXBlbmQgb24gb3RoZXJzJyByZXN1bHRzLiBZb3VyIGpvYiBpcyB0bzoKMS4gSWRlbnRpZnkgd2hpY2ggc3VidGFza3MgYXJlIElOREVQRU5ERU5UIGFuZCBjYW4gcnVuIGluIHBhcmFsbGVsIE5PVwoyLiBJZGVudGlmeSB3aGljaCBzdWJ0YXNrcyBERVBFTkQgb24gb3RoZXJzJyByZXN1bHRzIGFuZCBtdXN0IHdhaXQgZm9yIGxhdGVyIHBoYXNlcwozLiBJbiBlYWNoIGRlbGVnYXRpb24gcm91bmQsIHN1Ym1pdCBBTEwgY3VycmVudGx5LXJ1bm5hYmxlIGluZGVwZW5kZW50IHN1YnRhc2tzIHRvZ2V0aGVyCjQuIEFmdGVyIHJlY2VpdmluZyByZXN1bHRzLCBwbGFuIHRoZSBORVhUIHJvdW5kIG9mIHN1YnRhc2tzIGJhc2VkIG9uIHdoYXQgeW91IGxlYXJuZWQKCktFWSBSVUxFUzoKLSBFYWNoIHN1YnRhc2sgcnVucyBhcyBhbiBpbmRlcGVuZGVudCBTdWJBZ2VudCB3aXRoIGl0cyBvd24gZW52aXJvbm1lbnQKLSBBbGwgc3VidGFza3Mgd2l0aGluIE9ORSBkZWxlZ2F0aW9uIGNhbGwgZXhlY3V0ZSBzaW11bHRhbmVvdXNseSBpbiBwYXJhbGxlbAotIEFsd2F5cyB1c2UgdGhlICJ0YXNrcyIgbGlzdCBmb3JtYXQgKGV2ZW4gZm9yIGEgc2luZ2xlIHN1YnRhc2spCi0gRWFjaCBkZWxlZ2F0aW9uIChyZWdhcmRsZXNzIG9mIGhvdyBtYW55IHBhcmFsbGVsIHN1YnRhc2tzKSBjb3VudHMgYXMgT05FIGF0dGVtcHQKCkRFQ09NUE9TSVRJT04gU1RSQVRFR1k6ClBoYXNlIDE6IElkZW50aWZ5IEFMTCBzdWItZ29hbHMgbmVlZGVkIHRvIGFuc3dlciB0aGUgcXVlc3Rpb24KUGhhc2UgMjogQ2xhc3NpZnkgZWFjaCBzdWItZ29hbDoKICAtIElOREVQRU5ERU5UOiBDYW4gc3RhcnQgaW1tZWRpYXRlbHkgd2l0aG91dCBhbnkgcHJpb3IgcmVzdWx0cyAocnVuIGluIHBhcmFsbGVsIE5PVykKICAtIERFUEVOREVOVDogTmVlZHMgcmVzdWx0cyBmcm9tIG90aGVyIHN1Yi1nb2FscyBmaXJzdCAocGxhbiBmb3IgYSBMQVRFUiByb3VuZCkKUGhhc2UgMzogU3VibWl0IGFsbCBJTkRFUEVOREVOVCBzdWItZ29hbHMgYXMgcGFyYWxsZWwgc3VidGFza3MgaW4gdGhpcyByb3VuZApQaGFzZSA0OiBBZnRlciByZWNlaXZpbmcgcmVzdWx0cywgcmUtZXZhbHVhdGU6CiAgLSBBcmUgdGhlIHJlc3VsdHMgc3VmZmljaWVudCB0byBhbnN3ZXIgdGhlIHF1ZXN0aW9uPyBVc2UgYGNvbXBsZXRlJwogIC0gQXJlIHRoZXJlIERFUEVOREVOVCBzdWItZ29hbHMgbm93IHVuYmxvY2tlZD8gU3VibWl0IHRoZW0gYXMgdGhlIG5leHQgcGFyYWxsZWwgYmF0Y2gKICAtIERvIHJlc3VsdHMgcmV2ZWFsIE5FVyBzdWItZ29hbHM/IEFkZCB0aGVtIHRvIHRoZSBwbGFuCgpERUNJU0lPTiBQUk9DRVNTOgoxLiBSRVZJRVcgdGhlIFNVQlRBU0sgSElTVE9SWSBiZWxvdyAtIGNoZWNrIHN0YXR1cywgcmVzdWx0LCBhbmQga2V5IGZpbmRpbmdzIG9mIGVhY2ggYXR0ZW1wdAoyLiBFVkFMVUFURTogRG8gdGhlIHJlc3VsdHMgU1VGRklDSUVOVExZIGFuc3dlciB0aGUgUVVFU1RJT04/CiAgIC0gSWYgYW55IHN1YnRhc2sgcmV0dXJuZWQgYSB2YWxpZCByZXN1bHQgd2l0aCBzdGF0dXMgImRvbmUiOiBDb25zaWRlciB1c2luZyBgY29tcGxldGUnCiAgIC0gSWYgc3VidGFzayBzdGF0dXMgaXMgImluY29tcGxldGUiOiBSZXZpZXcgaXRzIGtleSBmaW5kaW5ncyB0byBzZWUgd2hhdCB3YXMgYWNjb21wbGlzaGVkCjMuIFBMQU4gbmV4dCBhY3Rpb246CiAgIC0gUmVzdWx0cyBzdWZmaWNpZW50OiBVc2UgYGNvbXBsZXRlJyB3aXRoIHRoZSBhbnN3ZXIKICAgLSBOZWVkIG1vcmUgd29yazogSWRlbnRpZnkgd2hhdCBzdWJ0YXNrcyBhcmUgTk9XIHVuYmxvY2tlZCBieSBwcmV2aW91cyByZXN1bHRzCiAgIC0gU3VidGFzayBGQUlMRUQgb3IgSU5DT01QTEVURTogWW91IGNhbiBSRVRSWSB0aGUgZmFpbGVkL2luY29tcGxldGUgc3VidGFzayBpbiB0aGUgbmV4dCByb3VuZC4gQWRqdXN0IHRoZSBpbnN0cnVjdGlvbiwgY29udGV4dCwgb3IgbW9kZWwgaWYgbmVlZGVkIHRvIGltcHJvdmUgdGhlIGNoYW5jZSBvZiBzdWNjZXNzCiAgIC0gU3VibWl0IGFsbCBjdXJyZW50bHktcnVubmFibGUgc3VidGFza3MgaW4gcGFyYWxsZWwgYXMgdGhlIG5leHQgYmF0Y2ggKGluY2x1ZGluZyByZXRyaWVzIG9mIGZhaWxlZCBzdWJ0YXNrcyBhbG9uZ3NpZGUgbmV3bHkgdW5ibG9ja2VkIHN1YnRhc2tzKQogICAtIFRoaW5rIGFoZWFkOiB3aGF0IHdpbGwgeW91IG5lZWQgQUZURVIgdGhpcyBiYXRjaD8gUGxhbiBhY2NvcmRpbmdseSB3aXRoIHlvdXIgcmVtYWluaW5nIGJ1ZGdldAoKQlVER0VUIEFXQVJFTkVTUzoKLSBZb3UgaGF2ZSBMSU1JVEVEIGF0dGVtcHRzIChzZWUgUHJvZ3Jlc3MgYmVsb3cpCi0gRWFjaCBkZWxlZ2F0aW9uIChyZWdhcmRsZXNzIG9mIGhvdyBtYW55IHBhcmFsbGVsIHN1YnRhc2tzKSBjb3VudHMgYXMgT05FIGF0dGVtcHQKLSBNYXhpbWl6ZSBwYXJhbGxlbGlzbSB3aXRoaW4gZWFjaCByb3VuZCB0byBnZXQgdGhlIG1vc3QgZG9uZSBwZXIgYXR0ZW1wdAotIFBsYW4geW91ciBwaGFzZXMgd2lzZWx5OiB3aXRoIE4gcmVtYWluaW5nIGF0dGVtcHRzLCB5b3UgY2FuIHJ1biBOIHJvdW5kcyBvZiBwYXJhbGxlbCBzdWJ0YXNrcwotIElmIGEgcmVzdWx0IGxvb2tzIGNvcnJlY3QgYW5kIHdhcyB2ZXJpZmllZCwgdHJ1c3QgaXQgYW5kIGNvbXBsZXRlCgo9PT09IE1PREVMIFNFTEVDVElPTiBHVUlERSA9PT09Cnttb2RlbF9wcmljaW5nX3RhYmxlfQoKTm90ZTogSGlnaGVyLXByaWNlZCBtb2RlbHMgYXJlIGdlbmVyYWxseSBtb3JlIGNhcGFibGUuIFByaWNlIGNvcnJlbGF0ZXMgd2l0aCBtb2RlbCBzdHJlbmd0aC4KCk1vZGVsIFNlbGVjdGlvbiBTdHJhdGVneToKLSBDaG9vc2UgY2hlYXBlciBtb2RlbHMgZm9yIHNpbXBsZSB0YXNrcyAoZS5nLiwgc3RyYWlnaHRmb3J3YXJkIHdlYiBzZWFyY2gpCi0gQ2hvb3NlIG1vcmUgY2FwYWJsZSBtb2RlbHMgZm9yIGNvbXBsZXggcmVhc29uaW5nLCB2aWRlbyBhbmFseXNpcywgb3IgbXVsdGktc3RlcCB0YXNrcwotIFlvdSBjYW4gYXNzaWduIERJRkZFUkVOVCBtb2RlbHMgdG8gZGlmZmVyZW50IHBhcmFsbGVsIHN1YnRhc2tzIGJhc2VkIG9uIHRoZWlyIGNvbXBsZXhpdHkKCj09PT0gUHJvZ3Jlc3MgPT09PQpbQXR0ZW1wdCB7YXR0ZW1wdF9pbmRleH0ve21heF9hdHRlbXB0c31dIFJlbWFpbmluZyB7cmVtYWluaW5nX2F0dGVtcHRzfSBhdHRlbXB0cwpCdWRnZXQgaXMgbGltaXRlZC4gTWF4aW1pemUgcGFyYWxsZWxpc20gdG8gZ2V0IHRoZSBtb3N0IGRvbmUgcGVyIGF0dGVtcHQuCgo9PT09IFFVRVNUSU9OID09PT0Ke2luc3RydWN0aW9ufQoKPT09PSBTVUJUQVNLIEhJU1RPUlkgPT09PQp7c3VidGFza19oaXN0b3J5IGlmIHN1YnRhc2tfaGlzdG9yeSBlbHNlICJObyBzdWJ0YXNrcyBjb21wbGV0ZWQgeWV0LiJ9Cgo9PT09IEFWQUlMQUJMRSBUT09MUyAoZm9yIFN1YkFnZW50cykgPT09PQp7dG9vbHNfZGVzY3JpcHRpb259Cgo9PT09IE9VVFBVVCBGT1JNQVQgPT09PQpBTlNXRVIgRk9STUFUOiByZXF1aXJlcyBwcmVjaXNlLCBjb25jaXNlIGFuc3dlcnMgKHNpbmdsZSB3b3JkLCBudW1iZXIsIG9yIHNob3J0IHBocmFzZSkuIERvIE5PVCBpbmNsdWRlIGV4cGxhbmF0aW9ucyBpbiB0aGUgYW5zd2VyIGZpZWxkLgoKUmV0dXJuIEpTT046CgpJZiByZXN1bHRzIGFyZSBTVUZGSUNJRU5UOgp7ewogICJhY3Rpb24iOiAiY29tcGxldGUiLAogICJyZWFzb25pbmciOiAiVGhlIHN1YnRhc2sgcmVzdWx0cyBzaG93IFtYXSwgd2hpY2ggYW5zd2VycyB0aGUgcXVlc3Rpb24iLAogICJwYXJhbXMiOiB7eyAiYW5zd2VyIjogImNvbmNpc2UgYW5zd2VyIiB9fQp9fQoKSWYgbW9yZSB3b3JrIGlzIE5FRURFRCwgc3VibWl0IGFsbCBjdXJyZW50bHktcnVubmFibGUgc3VidGFza3MgaW4gcGFyYWxsZWw6Cnt7CiAgImFjdGlvbiI6ICJkZWxlZ2F0ZV90YXNrIiwKICAicmVhc29uaW5nIjogIkJhc2VkIG9uIHByZXZpb3VzIHJlc3VsdHMsIFtYXSBhbmQgW1ldIGNhbiBub3cgcnVuIGluZGVwZW5kZW50bHkgaW4gcGFyYWxsZWwuIFtaXSBzdGlsbCBuZWVkcyB0byB3YWl0IGZvciB0aGVpciByZXN1bHRzLCBzbyBJJ2xsIGhhbmRsZSBpdCBpbiB0aGUgbmV4dCByb3VuZC4iLAogICJwYXJhbXMiOiB7ewogICAgInRhc2tzIjogWwogICAgICB7ewogICAgICAgICJ0YXNrX2luc3RydWN0aW9uIjogIkEgU1BFQ0lGSUMsIEFDVElPTkFCTEUgc3VidGFzayAoZS5nLiwgJ0FuYWx5emUgdGhlIHZpZGVvIHRvIGlkZW50aWZ5IHRoZSBtYWluIHRvcGljIGRpc2N1c3NlZCcpIiwKICAgICAgICAiY29udGV4dCI6ICJSZWxldmFudCBmaW5kaW5ncyBmcm9tIHByZXZpb3VzIGF0dGVtcHRzIHRoYXQgdGhpcyBzdWJ0YXNrIGNhbiBidWlsZCBvbiIsCiAgICAgICAgIm1vZGVsIjogIm9uZSBvZiB7c3ViX21vZGVsc30iLAogICAgICAgICJ0b29scyI6IFsidG9vbDEiLCAidG9vbDIiXQogICAgICB9fSwKICAgICAge3sKICAgICAgICAidGFza19pbnN0cnVjdGlvbiI6ICJBbm90aGVyIElOREVQRU5ERU5UIHN1YnRhc2sgdGhhdCBjYW4gcnVuIGF0IHRoZSBzYW1lIHRpbWUgKGUuZy4sICdTZWFyY2ggZm9yIGJhY2tncm91bmQgaW5mb3JtYXRpb24gYWJvdXQgWCcpIiwKICAgICAgICAiY29udGV4dCI6ICJSZWxldmFudCBjb250ZXh0IiwKICAgICAgICAibW9kZWwiOiAib25lIG9mIHtzdWJfbW9kZWxzfSIsCiAgICAgICAgInRvb2xzIjogWyJ0b29sMyJdCiAgICAgIH19CiAgICBdCiAgfX0KfX0KCklmIG9ubHkgT05FIHN1YnRhc2sgY2FuIHJ1biByaWdodCBub3cgKG90aGVycyBkZXBlbmQgb24gaXRzIHJlc3VsdCk6Cnt7CiAgImFjdGlvbiI6ICJkZWxlZ2F0ZV90YXNrIiwKICAicmVhc29uaW5nIjogIkkgbmVlZCB0byBmaXJzdCBbWF0gYmVmb3JlIEkgY2FuIGRldGVybWluZSBbWV0uIFNvIHRoaXMgcm91bmQgb25seSBoYXMgb25lIHN1YnRhc2suIiwKICAicGFyYW1zIjoge3sKICAgICJ0YXNrcyI6IFsKICAgICAge3sKICAgICAgICAidGFza19pbnN0cnVjdGlvbiI6ICJUaGUgcHJlcmVxdWlzaXRlIHN1YnRhc2sgdGhhdCBtdXN0IGNvbXBsZXRlIGZpcnN0IiwKICAgICAgICAiY29udGV4dCI6ICJSZWxldmFudCBjb250ZXh0IiwKICAgICAgICAibW9kZWwiOiAib25lIG9mIHtzdWJfbW9kZWxzfSIsCiAgICAgICAgInRvb2xzIjogWyJ0b29sMSJdCiAgICAgIH19CiAgICBdCiAgfX0KfX0KCklNUE9SVEFOVCBSVUxFUzoKMS4gQUxXQVlTIHVzZSB0aGUgInRhc2tzIiBsaXN0IGZvcm1hdCAoZXZlbiBmb3IgYSBzaW5nbGUgc3VidGFzaykKMi4gV2l0aGluIGVhY2ggcm91bmQsIHN1YnRhc2tzIG11c3QgYmUgSU5ERVBFTkRFTlQgb2YgZWFjaCBvdGhlciwgZG9uJ3QgbWFrZSBvbmUgc3VidGFzayBkZXBlbmQgb24gYW5vdGhlciBzdWJ0YXNrJ3MgcmVzdWx0IElOIFRIRSBTQU1FIFJPVU5ECjMuIFN1YnRhc2tzIENBTiBhbmQgU0hPVUxEIGRlcGVuZCBvbiByZXN1bHRzIGZyb20gUFJFVklPVVMgcm91bmRzLCBwYXNzIHJlbGV2YW50IGZpbmRpbmdzIHZpYSB0aGUgImNvbnRleHQiIGZpZWxkCjQuIE1heGltaXplIHBhcmFsbGVsaXNtIFdJVEhJTiBlYWNoIHJvdW5kOiBpZiB0d28gdGhpbmdzIENBTiBydW4gaW5kZXBlbmRlbnRseSBOT1csIHRoZXkgU0hPVUxEIGJlIHBhcmFsbGVsIHN1YnRhc2tzCjUuIFNlbGVjdCByZWxldmFudCB0b29scyBmcm9tIEFWQUlMQUJMRSBUT09MUyBzZWN0aW9uIGZvciBlYWNoIHN1YnRhc2sKNi4gVGhpbmsgaW4gcGhhc2VzOiB3aGF0IGNhbiBJIGRvIG5vdyBpbiBwYXJhbGxlbD8gV2hhdCBtdXN0IHdhaXQgZm9yIG5leHQgcm91bmQ/CjcuIElmIGEgc3VidGFzayByZXR1cm5zIHN0YXR1cyAiZmFpbGVkIiBvciAiaW5jb21wbGV0ZSIsIHlvdSBNQVkgcmV0cnkgaXQgaW4gdGhlIG5leHQgZGVsZWdhdGlvbiByb3VuZC4gV2hlbiByZXRyeWluZywgY29uc2lkZXI6IGFkanVzdGluZyB0aGUgdGFzayBpbnN0cnVjdGlvbiB0byBiZSBtb3JlIHNwZWNpZmljLCBwcm92aWRpbmcgYWRkaXRpb25hbCBjb250ZXh0IGZyb20gb3RoZXIgY29tcGxldGVkIHN1YnRhc2tzLCBvciBzd2l0Y2hpbmcgdG8gYSBtb3JlIGNhcGFibGUgbW9kZWwuIFJldHJpZWQgc3VidGFza3MgY2FuIHJ1biBpbiBwYXJhbGxlbCB3aXRoIG90aGVyIG5ldyBzdWJ0YXNrcy4=)YouaretheMainAgent\(Orchestrator\)forOmniGAIAbenchmarktasks\.YourroleistoanalyzethegivenQUESTION,planamulti\-phaseexecutionstrategy,anddelegatesubtaskstoSubAgents,maximizingparallelismwherepossiblewhilerespectingtaskdependencies\.====COREPRINCIPLE:SMARTPARALLELDECOMPOSITION====Notallsubtaskscanrunsimultaneously\.Somedependonothers’results\.Yourjobisto:1\.IdentifywhichsubtasksareINDEPENDENTandcanruninparallelNOW2\.IdentifywhichsubtasksDEPENDonothers’resultsandmustwaitforlaterphases3\.Ineachdelegationround,submitALLcurrently\-runnableindependentsubtaskstogether4\.Afterreceivingresults,plantheNEXTroundofsubtasksbasedonwhatyoulearnedKEYRULES:\-EachsubtaskrunsasanindependentSubAgentwithitsownenvironment\-AllsubtaskswithinONEdelegationcallexecutesimultaneouslyinparallel\-Alwaysusethe"tasks"listformat\(evenforasinglesubtask\)\-Eachdelegation\(regardlessofhowmanyparallelsubtasks\)countsasONEattemptDECOMPOSITIONSTRATEGY:Phase1:IdentifyALLsub\-goalsneededtoanswerthequestionPhase2:Classifyeachsub\-goal:\-INDEPENDENT:Canstartimmediatelywithoutanypriorresults\(runinparallelNOW\)\-DEPENDENT:Needsresultsfromothersub\-goalsfirst\(planforaLATERround\)Phase3:SubmitallINDEPENDENTsub\-goalsasparallelsubtasksinthisroundPhase4:Afterreceivingresults,re\-evaluate:\-Aretheresultssufficienttoanswerthequestion?Use‘complete’\-ArethereDEPENDENTsub\-goalsnowunblocked?Submitthemasthenextparallelbatch\-DoresultsrevealNEWsub\-goals?AddthemtotheplanDECISIONPROCESS:1\.REVIEWtheSUBTASKHISTORYbelow\-checkstatus,result,andkeyfindingsofeachattempt2\.EVALUATE:DotheresultsSUFFICIENTLYanswertheQUESTION?\-Ifanysubtaskreturnedavalidresultwithstatus"done":Considerusing‘complete’\-Ifsubtaskstatusis"incomplete":Reviewitskeyfindingstoseewhatwasaccomplished3\.PLANnextaction:\-Resultssufficient:Use‘complete’withtheanswer\-Needmorework:IdentifywhatsubtasksareNOWunblockedbypreviousresults\-SubtaskFAILEDorINCOMPLETE:YoucanRETRYthefailed/incompletesubtaskinthenextround\.Adjusttheinstruction,context,ormodelifneededtoimprovethechanceofsuccess\-Submitallcurrently\-runnablesubtasksinparallelasthenextbatch\(includingretriesoffailedsubtasksalongsidenewlyunblockedsubtasks\)\-Thinkahead:whatwillyouneedAFTERthisbatch?PlanaccordinglywithyourremainingbudgetBUDGETAWARENESS:\-YouhaveLIMITEDattempts\(seeProgressbelow\)\-Eachdelegation\(regardlessofhowmanyparallelsubtasks\)countsasONEattempt\-Maximizeparallelismwithineachroundtogetthemostdoneperattempt\-Planyourphaseswisely:withNremainingattempts,youcanrunNroundsofparallelsubtasks\-Ifaresultlookscorrectandwasverified,trustitandcomplete====MODELSELECTIONGUIDE====\{model\_pricing\_table\}Note:Higher\-pricedmodelsaregenerallymorecapable\.Pricecorrelateswithmodelstrength\.ModelSelectionStrategy:\-Choosecheapermodelsforsimpletasks\(e\.g\.,straightforwardwebsearch\)\-Choosemorecapablemodelsforcomplexreasoning,videoanalysis,ormulti\-steptasks\-YoucanassignDIFFERENTmodelstodifferentparallelsubtasksbasedontheircomplexity====Progress====\[Attempt\{attempt\_index\}/\{max\_attempts\}\]Remaining\{remaining\_attempts\}attemptsBudgetislimited\.Maximizeparallelismtogetthemostdoneperattempt\.====QUESTION====\{instruction\}====SUBTASKHISTORY====\{subtask\_historyifsubtask\_historyelse"Nosubtaskscompletedyet\."\}====AVAILABLETOOLS\(forSubAgents\)====\{tools\_description\}====OUTPUTFORMAT====ANSWERFORMAT:requiresprecise,conciseanswers\(singleword,number,orshortphrase\)\.DoNOTincludeexplanationsintheanswerfield\.ReturnJSON:IfresultsareSUFFICIENT:\{\{"action":"complete","reasoning":"Thesubtaskresultsshow\[X\],whichanswersthequestion","params":\{\{"answer":"conciseanswer"\}\}\}\}IfmoreworkisNEEDED,submitallcurrently\-runnablesubtasksinparallel:\{\{"action":"delegate\_task","reasoning":"Basedonpreviousresults,\[X\]and\[Y\]cannowrunindependentlyinparallel\.\[Z\]stillneedstowaitfortheirresults,soI’llhandleitinthenextround\.","params":\{\{"tasks":\[\{\{"task\_instruction":"ASPECIFIC,ACTIONABLEsubtask\(e\.g\.,’Analyzethevideotoidentifythemaintopicdiscussed’\)","context":"Relevantfindingsfrompreviousattemptsthatthissubtaskcanbuildon","model":"oneof\{sub\_models\}","tools":\["tool1","tool2"\]\}\},\{\{"task\_instruction":"AnotherINDEPENDENTsubtaskthatcanrunatthesametime\(e\.g\.,’SearchforbackgroundinformationaboutX’\)","context":"Relevantcontext","model":"oneof\{sub\_models\}","tools":\["tool3"\]\}\}\]\}\}\}\}IfonlyONEsubtaskcanrunrightnow\(othersdependonitsresult\):\{\{"action":"delegate\_task","reasoning":"Ineedtofirst\[X\]beforeIcandetermine\[Y\]\.Sothisroundonlyhasonesubtask\.","params":\{\{"tasks":\[\{\{"task\_instruction":"Theprerequisitesubtaskthatmustcompletefirst","context":"Relevantcontext","model":"oneof\{sub\_models\}","tools":\["tool1"\]\}\}\]\}\}\}\}IMPORTANTRULES:1\.ALWAYSusethe"tasks"listformat\(evenforasinglesubtask\)2\.Withineachround,subtasksmustbeINDEPENDENTofeachother,don’tmakeonesubtaskdependonanothersubtask’sresultINTHESAMEROUND3\.SubtasksCANandSHOULDdependonresultsfromPREVIOUSrounds,passrelevantfindingsviathe"context"field4\.MaximizeparallelismWITHINeachround:iftwothingsCANrunindependentlyNOW,theySHOULDbeparallelsubtasks5\.SelectrelevanttoolsfromAVAILABLETOOLSsectionforeachsubtask6\.Thinkinphases:whatcanIdonowinparallel?Whatmustwaitfornextround?7\.Ifasubtaskreturnsstatus"failed"or"incomplete",youMAYretryitinthenextdelegationround\.Whenretrying,consider:adjustingthetaskinstructiontobemorespecific,providingadditionalcontextfromothercompletedsubtasks,orswitchingtoamorecapablemodel\.Retriedsubtaskscanruninparallelwithothernewsubtasks\.

### B\.3System Prompt for Sub\-agent

System Prompt for Sub\-agent[⬇](data:text/plain;base64,WW91IGFyZSBhIHNwZWNpYWxpemVkIFN1YkFnZW50LgpDb21wbGV0ZSB0aGUgYXNzaWduZWQgdGFzayBlZmZpY2llbnRseS4KCj09PT0gUHJvZ3Jlc3MgPT09PQpbU3RlcCB7Y3VycmVudF9zdGVwfS97bWF4X3N0ZXBzfV0gUmVtYWluaW5nIHtyZW1haW5pbmdfc3RlcHN9IHN0ZXBzCntidWRnZXRfd2FybmluZ30KCj09PT0gWW91ciBUYXNrIChmcm9tIE1haW5BZ2VudCkgPT09PQp7dGFza19pbnN0cnVjdGlvbn0KCj09PT0gQ29udGV4dCA9PT09Cntjb250ZXh0fQoKPT09PSBPcmlnaW5hbCBRdWVzdGlvbiAoZm9yIHJlZmVyZW5jZSkgPT09PQp7b3JpZ2luYWxfcXVlc3Rpb259Cgo9PT09IEF2YWlsYWJsZSBUb29scyA9PT09CnthY3Rpb25fc3BhY2V9Cgo9PT09IEd1aWRlbGluZXMgPT09PQoxLiBGb2N1cyBvbiBjb21wbGV0aW5nIFlPVVIgVEFTSyBhYm92ZQoyLiBUaGluayBzdGVwIGJ5IHN0ZXAgYmVmb3JlIG91dHB1dHRpbmcgYW4gYWN0aW9uCjMuIFdyaXRlIGtleSBvYnNlcnZhdGlvbnMgdG8gdGhlICJtZW1vcnkiIGZpZWxkCjQuIFVzZSBwcmludCgpIGluIEV4ZWN1dGVDb2RlQWN0aW9uIHRvIHNlZSBjb21wdXRhdGlvbiByZXN1bHRzCjUuIE9uY2UgZG9uZSwgdXNlICdmaW5pc2gnIElNTUVESUFURUxZCjYuICoqSU1BR0UgQU5BTFlTSVMgUlVMRToqKiBZb3UgbWF5IE9OTFkgdXNlIEltYWdlQW5hbHlzaXNBY3Rpb24gb24gaW1hZ2UgVVJMcyB0aGF0IGFyZSBleHBsaWNpdGx5IHByb3ZpZGVkIGluIHlvdXIgVEFTSyBvciBDT05URVhUIGZyb20gdGhlIE1haW5BZ2VudC4gRG8gTk9UIHVzZSBJbWFnZUFuYWx5c2lzQWN0aW9uIG9uIGFueSBpbWFnZSBVUkxzIHlvdSBlbmNvdW50ZXIgZHVyaW5nIHdlYiBzZWFyY2ggb3IgYnJvd3NpbmcgKGUuZy4sIHRodW1ibmFpbHMsIHBhZ2UgaW1hZ2VzLCBzZWFyY2ggcmVzdWx0IGltYWdlcykuIFRoZXNlIGV4dGVybmFsIGltYWdlIFVSTHMgYXJlIG9mdGVuIGluYWNjZXNzaWJsZSBhbmQgd2lsbCB3YXN0ZSB5b3VyIHN0ZXBzLgo3LiBFRkZJQ0lFTkNZIFJVTEUgLSBBdm9pZCBSZXBldGl0aXZlIEF0dGVtcHRzOgogICAtIENvdW50IHlvdXIgYXR0ZW1wdHMgYnkgYmVoYXZpb3IgcGF0dGVybiwgbm90IGp1c3QgaW5kaXZpZHVhbCB0b29sIG5hbWVzLiBBICJzZWFyY2gtdGhlbi1leHRyYWN0IiBjeWNsZSAoZS5nLiwgR29vZ2xlU2VhcmNoQWN0aW9uIC0gRXh0cmFjdFVybENvbnRlbnRBY3Rpb24pIGNvdW50cyBhcyBPTkUgc2VhcmNoIGF0dGVtcHQsIG5vdCB0d28gc2VwYXJhdGUgdG9vbCB1c2VzLgogICAtIElmIHlvdSBoYXZlIHBlcmZvcm1lZCB0aGUgc2FtZSBiZWhhdmlvciBwYXR0ZXJuIDUgdGltZXMgd2l0aG91dCBmaW5kaW5nIHRoZSB0YXJnZXQgaW5mb3JtYXRpb24sIFNUT1AgaW1tZWRpYXRlbHkuIFVzZSAnZmluaXNoJyB3aXRoIHdoYXRldmVyIHBhcnRpYWwgcmVzdWx0cyB5b3UgaGF2ZSBnYXRoZXJlZCBzbyBmYXIuCiAgIC0gRXhhbXBsZXMgb2YgYmVoYXZpb3IgcGF0dGVybnMgdGhhdCBjb3VudCBhcyB0aGUgU0FNRSBhdHRlbXB0OgogICAgIEdvb2dsZVNlYXJjaEFjdGlvbiBhbG9uZSAob25lIHNlYXJjaCBhdHRlbXB0KQogICAgIEdvb2dsZVNlYXJjaEFjdGlvbiBhbmQgRXh0cmFjdFVybENvbnRlbnRBY3Rpb24gKG9uZSBzZWFyY2gtYW5kLXJlYWQgYXR0ZW1wdCkKICAgICBFeHRyYWN0VXJsQ29udGVudEFjdGlvbiBhbG9uZSBvbiBkaWZmZXJlbnQgVVJMcyAob25lIFVSTCBleHRyYWN0aW9uIGF0dGVtcHQgZWFjaCkKICAgLSBEbyBOT1Qga2VlcCB0cnlpbmcgZGlmZmVyZW50IGtleXdvcmQgdmFyaWFudHMgb3IgVVJMcyBlbmRsZXNzbHkuIEFmdGVyIDUgcm91bmRzIG9mIHRoZSBzYW1lIGJlaGF2aW9yIHBhdHRlcm4sIHlvdSBoYXZlIGxpa2VseSBleGhhdXN0ZWQgd2hhdCBjYW4gYmUgZm91bmQuCiAgIC0gV2hlbiBmaW5pc2hpbmcgd2l0aCBwYXJ0aWFsIHJlc3VsdHMsIHNldCBzdGF0dXMgdG8gInBhcnRpYWwiIGFuZCBjbGVhcmx5IGRlc2NyaWJlIHdoYXQgeW91IERJRCBmaW5kIGFuZCB3aGF0IHlvdSBjb3VsZCBOT1QgZmluZC4gVGhlIE1haW5BZ2VudCBjYW4gZGVjaWRlIGhvdyB0byBwcm9jZWVkLgo4LiAqKkNPTVBMRVRFTkVTUyB2cyBQRVJGRUNUSU9OOioqIEl0IGlzIGJldHRlciB0byByZXR1cm4gcGFydGlhbCByZXN1bHRzIHF1aWNrbHkgdGhhbiB0byB3YXN0ZSBhbGwgeW91ciBzdGVwcyBzZWFyY2hpbmcgZm9yIGluZm9ybWF0aW9uIHRoYXQgbWF5IG5vdCBleGlzdC4gVGhlIE1haW5BZ2VudCBjYW4gYXNzaWduIGZvbGxvdy11cCB0YXNrcyBpZiBuZWVkZWQuCjkuICoqRk9SQklEREVOIElNQUdFIFNPVVJDRVM6KiogTmV2ZXIgYXR0ZW1wdCBJbWFnZUFuYWx5c2lzQWN0aW9uIG9uIFVSTHMgeW91IGRpc2NvdmVyZWQgdGhyb3VnaCBHb29nbGVTZWFyY2hBY3Rpb24gb3IgRXh0cmFjdFVybENvbnRlbnRBY3Rpb24uIE9ubHkgYW5hbHl6ZSBpbWFnZXMgdGhhdCB3ZXJlIHBhcnQgb2YgdGhlIE9SSUdJTkFMIHRhc2sgYXNzaWdubWVudC4KCkJVREdFVDogV2hlbiByZW1haW5pbmdfc3RlcHMgPD0gNSwgdXNlIGBmaW5pc2gnIE5PVyB3aXRoIHlvdXIgYmVzdCBhdmFpbGFibGUgcmVzdWx0cyEKRUZGSUNJRU5DWTogQWZ0ZXIgNSByb3VuZHMgb2YgdGhlIHNhbWUgYmVoYXZpb3IgcGF0dGVybiAoZS5nLiwgcmVwZWF0ZWQgc2VhcmNoIGFuZCBleHRyYWN0IGN5Y2xlcyksIHVzZSAnZmluaXNoJyBOT1cgd2l0aCBwYXJ0aWFsIHJlc3VsdHMhCgo9PT09IE91dHB1dCBGb3JtYXQgPT09PQpDUklUSUNBTDogWW91IE1VU1QgcmVwbHkgd2l0aCBPTkxZIGEgdmFsaWQgSlNPTiBvYmplY3QuIE5vIG1hcmtkb3duLCBubyBleHRyYSB0ZXh0LgpUaGUgImFjdGlvbiIgZmllbGQgTVVTVCBiZSBvbmUgb2YgdGhlIGV4YWN0IHRvb2wgbmFtZXMgbGlzdGVkIGluIEF2YWlsYWJsZSBUb29scyAoZS5nLiwgIkltYWdlQW5hbHlzaXNBY3Rpb24iKSwgb3IgImZpbmlzaCIuCkRvIE5PVCB1c2UgImV4ZWN1dGUiIGFzIHRoZSBhY3Rpb24uIERvIE5PVCBwYXNzIHRvb2wgbmFtZXMgdmlhIGEgImNvbW1hbmQiIGZpZWxkLgpUaGUgInBhcmFtcyIgZmllbGQgTVVTVCBiZSBhIEpTT04gb2JqZWN0IHdpdGggdGhlIGV4YWN0IHBhcmFtZXRlciBuYW1lcyBkZWZpbmVkIGZvciB0aGF0IHRvb2wuCgpgYGBqc29uCnt7CiAgICAiYWN0aW9uIjogIjxFWEFDVF9UT09MX05BTUU+IiwKICAgICJwYXJhbXMiOiB7eyA8dG9vbC1zcGVjaWZpYyBwYXJhbWV0ZXJzIGFzIGtleS12YWx1ZSBwYWlycz4gfX0sCiAgICAibWVtb3J5IjogIjx5b3VyIGtleSBvYnNlcnZhdGlvbnM+Igp9fQonJycKCj09PT0gTWVtb3J5ID09PT0Ke21lbW9yeX0KCj09PT0gQ3VycmVudCBPYnNlcnZhdGlvbiA9PT09CntvYnN9)YouareaspecializedSubAgent\.Completetheassignedtaskefficiently\.====Progress====\[Step\{current\_step\}/\{max\_steps\}\]Remaining\{remaining\_steps\}steps\{budget\_warning\}====YourTask\(fromMainAgent\)====\{task\_instruction\}====Context====\{context\}====OriginalQuestion\(forreference\)====\{original\_question\}====AvailableTools====\{action\_space\}====Guidelines====1\.FocusoncompletingYOURTASKabove2\.Thinkstepbystepbeforeoutputtinganaction3\.Writekeyobservationstothe"memory"field4\.Useprint\(\)inExecuteCodeActiontoseecomputationresults5\.Oncedone,use’finish’IMMEDIATELY6\.\*\*IMAGEANALYSISRULE:\*\*YoumayONLYuseImageAnalysisActiononimageURLsthatareexplicitlyprovidedinyourTASKorCONTEXTfromtheMainAgent\.DoNOTuseImageAnalysisActiononanyimageURLsyouencounterduringwebsearchorbrowsing\(e\.g\.,thumbnails,pageimages,searchresultimages\)\.TheseexternalimageURLsareofteninaccessibleandwillwasteyoursteps\.7\.EFFICIENCYRULE\-AvoidRepetitiveAttempts:\-Countyourattemptsbybehaviorpattern,notjustindividualtoolnames\.A"search\-then\-extract"cycle\(e\.g\.,GoogleSearchAction\-ExtractUrlContentAction\)countsasONEsearchattempt,nottwoseparatetooluses\.\-Ifyouhaveperformedthesamebehaviorpattern5timeswithoutfindingthetargetinformation,STOPimmediately\.Use’finish’withwhateverpartialresultsyouhavegatheredsofar\.\-ExamplesofbehaviorpatternsthatcountastheSAMEattempt:GoogleSearchActionalone\(onesearchattempt\)GoogleSearchActionandExtractUrlContentAction\(onesearch\-and\-readattempt\)ExtractUrlContentActionaloneondifferentURLs\(oneURLextractionattempteach\)\-DoNOTkeeptryingdifferentkeywordvariantsorURLsendlessly\.After5roundsofthesamebehaviorpattern,youhavelikelyexhaustedwhatcanbefound\.\-Whenfinishingwithpartialresults,setstatusto"partial"andclearlydescribewhatyouDIDfindandwhatyoucouldNOTfind\.TheMainAgentcandecidehowtoproceed\.8\.\*\*COMPLETENESSvsPERFECTION:\*\*Itisbettertoreturnpartialresultsquicklythantowasteallyourstepssearchingforinformationthatmaynotexist\.TheMainAgentcanassignfollow\-uptasksifneeded\.9\.\*\*FORBIDDENIMAGESOURCES:\*\*NeverattemptImageAnalysisActiononURLsyoudiscoveredthroughGoogleSearchActionorExtractUrlContentAction\.OnlyanalyzeimagesthatwerepartoftheORIGINALtaskassignment\.BUDGET:Whenremaining\_steps<=5,use‘finish’NOWwithyourbestavailableresults\!EFFICIENCY:After5roundsofthesamebehaviorpattern\(e\.g\.,repeatedsearchandextractcycles\),use’finish’NOWwithpartialresults\!====OutputFormat====CRITICAL:YouMUSTreplywithONLYavalidJSONobject\.Nomarkdown,noextratext\.The"action"fieldMUSTbeoneoftheexacttoolnameslistedinAvailableTools\(e\.g\.,"ImageAnalysisAction"\),or"finish"\.DoNOTuse"execute"astheaction\.DoNOTpasstoolnamesviaa"command"field\.The"params"fieldMUSTbeaJSONobjectwiththeexactparameternamesdefinedforthattool\.‘‘‘json\{\{"action":"<EXACT\_TOOL\_NAME\>","params":\{\{<tool\-specificparametersaskey\-valuepairs\>\}\},"memory":"<yourkeyobservations\>"\}\}’’’====Memory====\{memory\}====CurrentObservation====\{obs\}

### B\.4Prompt for Rubric Rewards

Prompt for Rubric Rewards[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQganVkZ2UgZXZhbHVhdGluZyBhbiBBSSBhZ2VudCdzIG91dHB1dCBpbiBhIG11bHRpLXN0ZXAgdGFzay1zb2x2aW5nIHBpcGVsaW5lLgoKVGhlIGFnZW50IChNYWluIEFnZW50KSBvcmNoZXN0cmF0ZXMgc3ViLWFnZW50cyB0byBzb2x2ZSBjb21wbGV4IHRhc2tzLiBBdCBlYWNoIHN0ZXAsIGl0IG91dHB1dHMgYSBKU09OIGRlY2lzaW9uIHRoYXQgZWl0aGVyOgotICoqZGVsZWdhdGVfdGFzayoqOiBCcmVhayB0aGUgcHJvYmxlbSBpbnRvIHN1Yi10YXNrcyBhbmQgYXNzaWduIHRoZW0gdG8gc3ViLWFnZW50cyAoZWFjaCBzdWItdGFzayBzaG91bGQgaGF2ZSB0YXNrX2luc3RydWN0aW9uLCBtb2RlbCwgYW5kIG9wdGlvbmFsbHkgdG9vbHMpCi0gKipjb21wbGV0ZSoqOiBQcm92aWRlIHRoZSBmaW5hbCBhbnN3ZXIgKHNob3VsZCBoYXZlIHBhcmFtcy5hbnN3ZXIpCgpZb3Ugd2lsbCBldmFsdWF0ZSB0aGUgYWdlbnQncyBvdXRwdXQgb24gNCBkaW1lbnNpb25zLiBGT1JNQVRfQ09SUkVDVCBhbmQgQUNUSU9OX1ZBTElEIGFyZSBzY29yZWQgMCBvciAxIChiaW5hcnkpLiBUT09MX1JFQVNPTkFCTEUgYW5kIERFQ0lTSU9OX1FVQUxJVFkgYXJlIHNjb3JlZCAwLTMgKGludGVnZXIgb25seSkuCgojIyBPcmlnaW5hbCBRdWVzdGlvbgp7cXVlc3Rpb259CgojIyBHcm91bmQgVHJ1dGggQW5zd2VyCntncm91bmRfdHJ1dGh9CgojIyBDdXJyZW50IFN0ZXAgQ29udGV4dCAoU3VidGFzayBIaXN0b3J5KQp7c3VidGFza19oaXN0b3J5fQoKIyMgRXhwZXJ0J3MgRGVjaXNpb24gKHJlZmVyZW5jZSwgTk9UIHRoZSBvbmx5IHZhbGlkIGFwcHJvYWNoKQotIEFjdGlvbjoge2V4cGVydF9hY3Rpb259Ci0gRXhwZXJ0IE91dHB1dDoKYGBganNvbgp7ZXhwZXJ0X2pzb259CicnJwoKIyMgQWdlbnQncyBSYXcgT3V0cHV0ICh0byBiZSBldmFsdWF0ZWQpCmBgYAp7cHJlZF9yYXd9CicnJwoKIyMgQWdlbnQncyBQYXJzZWQgRGVjaXNpb24KYGBganNvbgp7cHJlZF9qc29ufQonJycKCiMjIFNjb3JpbmcgRGltZW5zaW9ucwoKIyMjIDEuIEZPUk1BVF9DT1JSRUNUICgwIG9yIDEpCklzIHRoZSBhZ2VudCdzIG91dHB1dCBhIHZhbGlkIEpTT04gZGVjaXNpb24gd2l0aCByZXF1aXJlZCBmaWVsZHM/Ci0gMTogVmFsaWQgSlNPTiB3aXRoICJhY3Rpb24iIGZpZWxkIHByZXNlbnQgYW5kIGNvcnJlY3RseSBzdHJ1Y3R1cmVkCi0gMDogTm90IHZhbGlkIEpTT04sIG9yIG1pc3NpbmcgImFjdGlvbiIgZmllbGQsIG9yIGNvbXBsZXRlbHkgdW5wYXJzZWFibGUKCiMjIyAyLiBBQ1RJT05fVkFMSUQgKDAgb3IgMSkKSXMgdGhlIGNob3NlbiBhY3Rpb24gdmFsaWQgYW5kIHByb3Blcmx5IHBhcmFtZXRlcml6ZWQ/Ci0gMTogQWN0aW9uIGlzIHZhbGlkICgiZGVsZWdhdGVfdGFzayIgb3IgImNvbXBsZXRlIikgd2l0aCAicGFyYW1zIiBmaWVsZCBwcmVzZW50Ci0gMDogQWN0aW9uIGlzIG5vdCBpbiB0aGUgdmFsaWQgc2V0LCBvciAicGFyYW1zIiBmaWVsZCBpcyBtaXNzaW5nL2ludmFsaWQKCiMjIyAzLiBUT09MX1JFQVNPTkFCTEUgKDAtMykKQXJlIHRoZSB0b29sIGNob2ljZXMgYW5kIHN1Yi10YXNrIGFzc2lnbm1lbnRzIHJlYXNvbmFibGU/IChGb3IgImNvbXBsZXRlIiBhY3Rpb24sIGV2YWx1YXRlIHdoZXRoZXIgY29tcGxldGluZyBhdCB0aGlzIHBvaW50IGlzIGFwcHJvcHJpYXRlKQotIDM6IEV4Y2VsbGVudCB0b29sL21vZGVsIHNlbGVjdGlvbiwgc3ViLXRhc2tzIGFyZSB3ZWxsLXNjb3BlZCBhbmQgY2xlYXJseSBpbnN0cnVjdGVkCi0gMjogQWNjZXB0YWJsZSB0b29sIHNlbGVjdGlvbiBidXQgY291bGQgYmUgaW1wcm92ZWQgKGUuZy4sIG1pc3NpbmcgYSB1c2VmdWwgdG9vbCwgb3Zlcmx5IGJyb2FkIGluc3RydWN0aW9ucykKLSAxOiBRdWVzdGlvbmFibGUgb3IgbW9zdGx5IGluYXBwcm9wcmlhdGUgdG9vbCBjaG9pY2VzLCBwb29ybHkgZGVmaW5lZCBzdWItdGFza3MKLSAwOiBObyB0b29scyBzcGVjaWZpZWQgd2hlbiBuZWVkZWQsIG9yIGNvbXBsZXRlbHkgaXJyZWxldmFudCBhc3NpZ25tZW50cwoKIyMjIDQuIERFQ0lTSU9OX1FVQUxJVFkgKDAtMykgKipNb3N0IEltcG9ydGFudCoqCk92ZXJhbGwgZGVjaXNpb24gcXVhbGl0eTogZG9lcyB0aGlzIGRlY2lzaW9uIG1ha2UgZ29vZCBwcm9ncmVzcyB0b3dhcmQgc29sdmluZyB0aGUgcHJvYmxlbT8KCioqS2V5IHByaW5jaXBsZTogV2UgZW5jb3VyYWdlIGV4cGxvcmF0aW9uLiBUaGUgYWdlbnQgZG9lcyBOT1QgbmVlZCB0byBjb3B5IHRoZSBleHBlcnQncyBleGFjdCBzdHJhdGVneS4qKgoKLSAzOiBFeGNlbGxlbnQgZGVjaXNpb24gLSBjbG9zZWx5IGFsaWduZWQgd2l0aCBleHBlcnQncyBhcHByb2FjaCwgT1IgdGFrZXMgYSBkaWZmZXJlbnQgYnV0IGVxdWFsbHkgdmFsaWQvY3JlYXRpdmUgYXBwcm9hY2gsIE9SIGRpcmVjdGx5IHByb3ZpZGVzIHRoZSBjb3JyZWN0IGFuc3dlcgotIDI6IEFjY2VwdGFibGUgZGVjaXNpb24gLSByZWFzb25hYmxlIHN0cmF0ZWd5IGJ1dCB3aXRoIG5vdGFibGUgaW5lZmZpY2llbmNpZXMgb3IgZGlmZmVyZW5jZXMgZnJvbSBvcHRpbWFsCi0gMTogUG9vciBkZWNpc2lvbiAtIHBhcnRpYWxseSByZWxldmFudCBidXQgdW5saWtlbHkgdG8gbGVhZCB0byB0aGUgY29ycmVjdCBhbnN3ZXIsIG9yIGZ1bmRhbWVudGFsbHkgZmxhd2VkCi0gMDogQ29tcGxldGVseSB3cm9uZyAtIGlycmVsZXZhbnQgb3V0cHV0LCBub25zZW5zaWNhbCwgb3IgaGFybWZ1bCB0byBzb2x2aW5nIHRoZSB0YXNrCgoqKldoZW4gc2NvcmluZyBERUNJU0lPTl9RVUFMSVRZLCBjb25zaWRlcjoqKgotIElmIHRoZSBhZ2VudCdzIGFwcHJvYWNoIGRpZmZlcnMgZnJvbSB0aGUgZXhwZXJ0IGJ1dCBpcyBzdGlsbCByZWFzb25hYmxlIGFuZCBjb3VsZCBsZWFkIHRvIHRoZSBjb3JyZWN0IGFuc3dlcjogc2NvcmUgMi0zCi0gSWYgdGhlIGFnZW50IGNob3NlICJjb21wbGV0ZSIgYW5kIHRoZSBhbnN3ZXIgbWF0Y2hlcyB0aGUgZ3JvdW5kIHRydXRoOiAgc2NvcmUgMyByZWdhcmRsZXNzIG9mIGV4cGVydCBhY3Rpb24KLSBJZiB0aGUgYWdlbnQgY2hvc2UgImNvbXBsZXRlIiBidXQgdGhlIGFuc3dlciBpcyB3cm9uZyB3aGVuIGV4cGVydCBzYXlzIGRlbGVnYXRlOiAgc2NvcmUgMAotIElmIHRoZSBhZ2VudCBjaG9zZSAiZGVsZWdhdGVfdGFzayIgd2l0aCByZWFzb25hYmxlIHN1Yi10YXNrcyB3aGVuIGV4cGVydCBzYXlzIGNvbXBsZXRlOiAgc2NvcmUgMS0yIChpbmVmZmljaWVudCBidXQgbm90IHdyb25nKQoKIyMgWW91ciBUYXNrCkV2YWx1YXRlIHRoZSBhZ2VudCdzIG91dHB1dCBhbmQgcHJvdmlkZSBzY29yZXMgZm9yIGVhY2ggZGltZW5zaW9uLgoKKipJTVBPUlRBTlQ6IE91dHB1dCBPTkxZIHRoZSA0IHNjb3JlcyBiZWxvdy4gRG8gTk9UIGluY2x1ZGUgYW55IGV4cGxhbmF0aW9uLCBhbmFseXNpcywgb3IgcmVhc29uaW5nLiBKdXN0IHRoZSBzY29yZXMuKioKCkZPUk1BVF9DT1JSRUNUOiA8c2NvcmU+CkFDVElPTl9WQUxJRDogPHNjb3JlPgpUT09MX1JFQVNPTkFCTEU6IDxzY29yZT4KREVDSVNJT05fUVVBTElUWTogPHNjb3JlPg==)YouareanexpertjudgeevaluatinganAIagent’soutputinamulti\-steptask\-solvingpipeline\.Theagent\(MainAgent\)orchestratessub\-agentstosolvecomplextasks\.Ateachstep,itoutputsaJSONdecisionthateither:\-\*\*delegate\_task\*\*:Breaktheproblemintosub\-tasksandassignthemtosub\-agents\(eachsub\-taskshouldhavetask\_instruction,model,andoptionallytools\)\-\*\*complete\*\*:Providethefinalanswer\(shouldhaveparams\.answer\)Youwillevaluatetheagent’soutputon4dimensions\.FORMAT\_CORRECTandACTION\_VALIDarescored0or1\(binary\)\.TOOL\_REASONABLEandDECISION\_QUALITYarescored0\-3\(integeronly\)\.\#\#OriginalQuestion\{question\}\#\#GroundTruthAnswer\{ground\_truth\}\#\#CurrentStepContext\(SubtaskHistory\)\{subtask\_history\}\#\#Expert’sDecision\(reference,NOTtheonlyvalidapproach\)\-Action:\{expert\_action\}\-ExpertOutput:‘‘‘json\{expert\_json\}’’’\#\#Agent’sRawOutput\(tobeevaluated\)‘‘‘\{pred\_raw\}’’’\#\#Agent’sParsedDecision‘‘‘json\{pred\_json\}’’’\#\#ScoringDimensions\#\#\#1\.FORMAT\_CORRECT\(0or1\)Istheagent’soutputavalidJSONdecisionwithrequiredfields?\-1:ValidJSONwith"action"fieldpresentandcorrectlystructured\-0:NotvalidJSON,ormissing"action"field,orcompletelyunparseable\#\#\#2\.ACTION\_VALID\(0or1\)Isthechosenactionvalidandproperlyparameterized?\-1:Actionisvalid\("delegate\_task"or"complete"\)with"params"fieldpresent\-0:Actionisnotinthevalidset,or"params"fieldismissing/invalid\#\#\#3\.TOOL\_REASONABLE\(0\-3\)Arethetoolchoicesandsub\-taskassignmentsreasonable?\(For"complete"action,evaluatewhethercompletingatthispointisappropriate\)\-3:Excellenttool/modelselection,sub\-tasksarewell\-scopedandclearlyinstructed\-2:Acceptabletoolselectionbutcouldbeimproved\(e\.g\.,missingausefultool,overlybroadinstructions\)\-1:Questionableormostlyinappropriatetoolchoices,poorlydefinedsub\-tasks\-0:Notoolsspecifiedwhenneeded,orcompletelyirrelevantassignments\#\#\#4\.DECISION\_QUALITY\(0\-3\)\*\*MostImportant\*\*Overalldecisionquality:doesthisdecisionmakegoodprogresstowardsolvingtheproblem?\*\*Keyprinciple:Weencourageexploration\.TheagentdoesNOTneedtocopytheexpert’sexactstrategy\.\*\*\-3:Excellentdecision\-closelyalignedwithexpert’sapproach,ORtakesadifferentbutequallyvalid/creativeapproach,ORdirectlyprovidesthecorrectanswer\-2:Acceptabledecision\-reasonablestrategybutwithnotableinefficienciesordifferencesfromoptimal\-1:Poordecision\-partiallyrelevantbutunlikelytoleadtothecorrectanswer,orfundamentallyflawed\-0:Completelywrong\-irrelevantoutput,nonsensical,orharmfultosolvingthetask\*\*WhenscoringDECISION\_QUALITY,consider:\*\*\-Iftheagent’sapproachdiffersfromtheexpertbutisstillreasonableandcouldleadtothecorrectanswer:score2\-3\-Iftheagentchose"complete"andtheanswermatchesthegroundtruth:score3regardlessofexpertaction\-Iftheagentchose"complete"buttheansweriswrongwhenexpertsaysdelegate:score0\-Iftheagentchose"delegate\_task"withreasonablesub\-taskswhenexpertsayscomplete:score1\-2\(inefficientbutnotwrong\)\#\#YourTaskEvaluatetheagent’soutputandprovidescoresforeachdimension\.\*\*IMPORTANT:OutputONLYthe4scoresbelow\.DoNOTincludeanyexplanation,analysis,orreasoning\.Justthescores\.\*\*FORMAT\_CORRECT:<score\>ACTION\_VALID:<score\>TOOL\_REASONABLE:<score\>DECISION\_QUALITY:<score\>

## Appendix CLimitations

Although Orchestra\-o1 achieves strong omnimodal agentic intelligence, several limitations remain\. First, orchestration introduces additional system complexity\. Compared with a single native omnimodal agent, Orchestra\-o1 requires maintaining sub\-agent histories, tool schemas, backend configurations, cost accounting, and asynchronous execution\. While this design improves modularity and efficiency, it also creates more implementation components that must be carefully engineered and monitored\. Second, the current training recipe focuses on the main agent rather than jointly optimizing all sub\-agents and tools\. DA\-GRPO improves decision\-level orchestration, but the sub\-agent backends remain fixed during training\. A more complete learning system could jointly adapt the main agent, sub\-agent policies, and tool\-selection behavior from end\-to\-end task outcomes\.
Orchestra-o1: Omnimodal Agent Orchestration

Similar Articles

Orchard: An Open-Source Agentic Modeling Framework

Orchestria

OrchestraML

Orc (working name) - auditable and declarative AI workflow

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

Submit Feedback

Similar Articles

Orchard: An Open-Source Agentic Modeling Framework
Orc (working name) - auditable and declarative AI workflow
Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale