INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

arXiv cs.AI 06/11/26, 04:00 AM Papers
Summary
InfraMind introduces an infrastructure-aware multi-agent LLM orchestration framework that uses reinforcement learning to dynamically select models and topologies based on real-time system load, achieving up to 7x lower latency and 99.9% SLO compliance under high load.
arXiv:2606.11440v1 Announce Type: new Abstract: Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:47 PM
# InfraMind: Infrastructure-Aware Multi-Agent Orchestration
Source: [https://arxiv.org/html/2606.11440](https://arxiv.org/html/2606.11440)
Ahasan Kabir Jiaqi Xue Mengxin Zheng Qian Lou University of Central Florida \{ahasan\.kabir, jiaqi\.xue, mengxin\.zheng, qian\.lou\}@ucf\.edu

###### Abstract

Existing multi\-agent LLM orchestration methods, ranging from brute\-force ensembles to learned routers, select models and topologies based on task and model features\. However, these methods do not consider the runtime state of the serving infrastructure\. On shared GPU clusters under concurrent load, this*infrastructure blindness*causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle\. In multi\-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step\. Closing this gap is challenging because the relevant infrastructure signals \(queue depths, KV\-cache pressure, latencies\) are dynamic and noisy, and they must drive three different decisions: planning, per\-step routing, and scheduling\. We introduceInfraMind, a framework that makes the entire multi\-agent stack infrastructure\-aware\. An infra\-aware planner conditions topology and role selection on real\-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load\. An infra\-aware executor then observes per\-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget\-aware scheduler further reorders each model’s queue so that urgent requests are served first\. Cast as a hierarchical constrained MDP and solved end\-to\-end via reinforcement learning, the system learns to balance quality against latency automatically\. Across five benchmarks,InfraMinddelivers up to\+7\.6\+7\.6pp accuracy over the prior baseline at low load with up to 7×\\timeslower latency, and sustains up to99\.9%99\.9\\%SLO compliance under high load where every baseline drops below50%50\\%\.

## 1Introduction

Multi\-agent LLM systems, where multiple models collaborate through debate, review, or sequential chains\(Wuet al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib1); Honget al\.,[2023](https://arxiv.org/html/2606.11440#bib.bib2); Liet al\.,[2023](https://arxiv.org/html/2606.11440#bib.bib3)\), are the dominant paradigm for complex tasks, and recent work has focused on learning the orchestration itself: which models to call, in what topology, and with what roles\(Wanget al\.,[2024a](https://arxiv.org/html/2606.11440#bib.bib6); Zhugeet al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib5); Yueet al\.,[2025](https://arxiv.org/html/2606.11440#bib.bib4)\)\. Yet every existing method selects models from static task features alone, ignoring the*runtime state*of the serving infrastructure\. As multi\-agent workloads move onto shared GPU clusters serving pools of open\-weight models\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib7); Guoet al\.,[2025](https://arxiv.org/html/2606.11440#bib.bib29)\)via vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.11440#bib.bib10)\)and SGLang\(Zhenget al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib11)\), this becomes a critical blind spot\. A model that is “fast” in isolation may have hundreds of requests queued while an idle alternative could respond instantly, and in multi\-agent pipelines a bottleneck at any step delays every downstream agent\. We call this failure mode*infrastructure blindness*\.

Figure[1](https://arxiv.org/html/2606.11440#S1.F1)exposes the symptoms of infrastructure blindness\. We profile MasRouter\(Yueet al\.,[2025](https://arxiv.org/html/2606.11440#bib.bib4)\), the state\-of\-the\-art task\-adaptive router, on a shared pool of five models under Poisson load, and observe three failure patterns that recur across regimes\. First, static routing produces extreme load imbalance: preferred small models accumulate queues exceeding 130 requests while equally capable large models sit nearly idle \(Figure[1](https://arxiv.org/html/2606.11440#S1.F1)a\)\. Second, this imbalance translates directly into avoidable latency: congested models incur\>\>30 s end\-to\-end delays on queries that an idle alternative could answer in under 10 s \(Figure[1](https://arxiv.org/html/2606.11440#S1.F1)b\)\. Third, the failure inverts at low load: 67% of large\-model GPU capacity goes unused, leaving quality on the table that deeper reasoning could otherwise harvest \(Figure[1](https://arxiv.org/html/2606.11440#S1.F1)c\)\. In both regimes, the orchestrator commits to routing decisions in a training\-time information regime that is fundamentally disconnected from runtime conditions\.

![Refer to caption](https://arxiv.org/html/2606.11440v1/x1.png)Figure 1:Load\-agnostic routing in practice \(MasRouter on MATH, Poisson arrivals\)\.\(a\)Per\-model queue depth\.\(b\)Per\-model end\-to\-end latency\.\(c\)Large\-model GPU utilization at low load\.Solving these problems is hard for three reasons\. First,*planning the reasoning structure*\(topology, agent count, and roles\) from the current runtime state is challenging because the infrastructure state is dynamic and changes during execution\. Prior work sidesteps this and decides the structure from task features alone\. Second,*picking which model to call and how deeply it should reason at each step*is a fine\-grained decision over noisy, fast\-changing runtime signals \(queue depths, KV pressure, latencies\)\. Prior routers ignore these signals: they choose the model from the query alone, overloading preferred ones \(Figure[1](https://arxiv.org/html/2606.11440#S1.F1)a,b\), and never adapt reasoning depth to resource availability, leaving idle capacity untapped \(Figure[1](https://arxiv.org/html/2606.11440#S1.F1)c\)\. Third,*prioritizing among the many multi\-agent steps that arrive together*is hard because each step carries its own remaining budget and urgency\. Prior work defaults to FCFS and ignores both, so tight\-budget requests wait behind relaxed ones and miss their SLOs\. The three decisions are also coupled: a choice at any layer reshapes the runtime state the others must respond to\. Heuristics tuned one layer at a time therefore leave the cross\-layer interactions on the table\.

![Refer to caption](https://arxiv.org/html/2606.11440v1/x2.png)Figure 2:InfraMindreads live system metrics and routes around congestion while adapting reasoning depth \(Flash/Concise/DeepThink\) to current capacity\.We proposeInfraMind, which places an infrastructure\-aware component at each of these decision points\. At*query arrival*, an infrastructure\-aware*planner*conditions topology, agent\-count, and role choices on a summary of current load and remaining budget, biasing toward simpler graphs when the system is congested and richer ones when capacity is available\. At*each agent step*, an infrastructure\-aware*executor*reads per\-model queue depths, KV cache utilization, and end\-to\-end latencies, then jointly selects the target model and reasoning depth \(Flash / Concise / DeepThink\)\. Figure[2](https://arxiv.org/html/2606.11440#S1.F2)contrasts this with a load\-agnostic router: the baseline \(left\) picks by quality alone and stacks requests onto the same preferred large model, while our executor \(right\) sees that the large model’s queue is saturated, redirects the call to an idle smaller model, and invests in DeepThink, using the time saved by skipping the queue to compensate for the smaller model’s lower headline quality, so the answer arrives at comparable accuracy and a fraction of the latency\. At*each model’s queue*, a budget\-aware*Earliest\-Deadline\-First scheduler*reorders pending requests so tight\-budget queries are not blocked behind relaxed ones\. The three components are cast as a hierarchical constrained MDP and trained end\-to\-end via reinforcement learning, automatically discovering the quality–latency trade\-off across load levels\.

Contributions\.\(1\) We identify*infrastructure blindness*as a systematic failure of multi\-agent LLM systems and quantify it empirically \(§[1](https://arxiv.org/html/2606.11440#S1)\)\. \(2\) We proposeInfraMind, the first end\-to\-end infrastructure\-aware multi\-agent orchestrator, comprising an infra\-aware planner, executor, and EDF scheduler trained jointly as a single hierarchical RL policy under a shared budget constraint \(§[3](https://arxiv.org/html/2606.11440#S3), §[4](https://arxiv.org/html/2606.11440#S4)\)\. \(3\) Across five benchmarks,InfraMinddelivers up to\+7\.6\+7\.6pp accuracy over the strongest baseline at low load with up to 7×\\timeslower latency, and sustains up to99\.9%99\.9\\%SLO compliance under high load where every baseline drops below50%50\\%\(§[5](https://arxiv.org/html/2606.11440#S5)\)\.

## 2Related Work

#### Multi\-agent LLM orchestration\.

Multi\-agent systems improve task performance by orchestrating multiple LLM instances in structured collaboration topologies\. Research in this area has progressively introduced more intelligence into the orchestration layer, but a critical dimension, runtime infrastructure state, remains entirely unaddressed\.

Three representative systems illustrate increasing task\-level sophistication\.Mixture\-of\-Agents \(MoA\)\(Wanget al\.,[2024a](https://arxiv.org/html/2606.11440#bib.bib6)\)runs every model in the pool in parallel and synthesises via a fixed aggregator, providing zero routing intelligence: the slowest model bottlenecks every response and queue congestion on any single model degrades the entire system\.GPTSwarm\(Zhugeet al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib5)\)models multi\-agent collaboration as a directed graph with REINFORCE\-learned edge weights, but the graph is frozen at test time: edge weights are fixed after training, so the system cannot reroute when a preferred model becomes congested during deployment\.MasRouter\(Yueet al\.,[2025](https://arxiv.org/html/2606.11440#bib.bib4)\)introduces the most sophisticated task\-adaptive orchestration to date: a VAE\-based cascaded controller that jointly determines topology, agent count, role assignments, and per\-role model from the query embedding, enabling task\-specific routing\. Yet its decisions are based entirely on static task features, with no mechanism to distinguish an idle model from a saturated one and a fixed prompting strategy regardless of whether the budget is tight or generous\.

#### LLM routing and cost\-aware serving\.

Outside the multi\-agent setting, a growing body of work addresses the cost and quality of single\-model routing\. RouteLLM\(Onget al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib8)\)learns a quality\-based router that directs queries to either a strong or weak model based on predicted difficulty, achieving cost savings without significant quality loss\. TREACLE\(Zhanget al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib9)\)extends this to budget\-constrained LLM cascades with joint model and prompt selection\. R2\-Router\(Xueet al\.,[2026](https://arxiv.org/html/2606.11440#bib.bib28)\)further refines query\-conditioned routing by incorporating reasoning\-aware difficulty signals to choose between models\. However, these systems operate over single\-turn, single\-model calls; they do not handle the multi\-step, multi\-agent workflows where routing decisions at each step affect the quality and latency of downstream agents\.

On the serving infrastructure side, vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.11440#bib.bib10)\)introduced PagedAttention for efficient KV cache management and supports continuous batching, priority\-based scheduling, and detailed telemetry via Prometheus endpoints\. Sarathi\-Serve\(Agrawalet al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib12)\)further optimizes prefill–decode scheduling with chunked prefills\. These systems expose precisely the signals \(queue depth, cache utilization, per\-request latency\) that an infrastructure\-aware orchestrator needs, but they optimize inference*within*a single model and make no cross\-model routing decisions\.InfraMindsits at the intersection: it consumes the telemetry that serving systems expose to make the cross\-model routing decisions that those systems do not\.

Table 1:How each orchestration system behaves at decision time and under stress\.

## 3Problem Formulation

Multi\-agent orchestration involves decisions at two timescales:*what reasoning structure to use*\(topology, roles, agent count\), chosen once per query primarily from query semantics, and*how to execute each step*\(which model, how much reasoning\), chosen repeatedly under fast\-changing per\-model queues, latencies, and remaining budget\. We formalize this as a hierarchical Constrained Markov Decision Process \(CMDP\)\(Altman,[2021](https://arxiv.org/html/2606.11440#bib.bib19)\)in which both levels see infrastructure state, but at appropriate granularities\.

#### State\.

ConsiderNNLLM servicesℳ=\{m1,…,mN\}\\mathcal\{M\}=\\\{m\_\{1\},\\ldots,m\_\{N\}\\\}on shared GPUs and prompting strategies𝒮=\{Flash,Concise,DeepThink\}\\mathcal\{S\}=\\\{\\text\{Flash\},\\text\{Concise\},\\text\{DeepThink\}\\\}\. A queryqqarrives with a time budgetβ\\beta\. At each agent stepkk, the executor observes:

sk=\[𝐞q,𝐞rk⏟what to solve,bk⏟time left,𝐝queue,𝐝e2e,𝐝kv⏟system load\]s\_\{k\}=\\big\[\\underbrace\{\\mathbf\{e\}\_\{q\},\\;\\mathbf\{e\}\_\{r\_\{k\}\}\}\_\{\\text\{what to solve\}\},\\;\\;\\underbrace\{b\_\{k\}\}\_\{\\text\{time left\}\},\\;\\;\\underbrace\{\\mathbf\{d\}^\{\\text\{queue\}\},\\;\\mathbf\{d\}^\{\\text\{e2e\}\},\\;\\mathbf\{d\}^\{\\text\{kv\}\}\}\_\{\\text\{system load\}\}\\big\]\(1\)where𝐞q,𝐞rk∈ℝ384\\mathbf\{e\}\_\{q\},\\mathbf\{e\}\_\{r\_\{k\}\}\\in\\mathbb\{R\}^\{384\}are Sentence\-BERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.11440#bib.bib23)\)query and role embeddings,bkb\_\{k\}is the normalized remaining budget, and𝐝queue,𝐝e2e,𝐝kv∈ℝN\\mathbf\{d\}^\{\\text\{queue\}\},\\mathbf\{d\}^\{\\text\{e2e\}\},\\mathbf\{d\}^\{\\text\{kv\}\}\\in\\mathbb\{R\}^\{N\}are per\-model queue depth, end\-to\-end latency, and KV cache utilization polled from vLLM’s/metricsendpoint\.

#### Action\.

The executor selects a joint actionak∈\{0,…,N\|𝒮\|−1\}a\_\{k\}\\in\\\{0,\\ldots,N\|\\mathcal\{S\}\|\-1\\\}, decoded as modelmk=⌊ak/\|𝒮\|⌋m\_\{k\}=\\lfloor a\_\{k\}/\|\\mathcal\{S\}\|\\rfloorand strategyσk=akmod\|𝒮\|\\sigma\_\{k\}=a\_\{k\}\\bmod\|\\mathcal\{S\}\|\.

#### Objective\.

We maximize quality subject to a latency budget\. With quality rewardRk∈\{0,1\}R\_\{k\}\\in\\\{0,1\\\}and costCk=ℓk/βC\_\{k\}=\\ell\_\{k\}/\\beta\(step latency as a fraction of the budget\):

π∗=arg⁡maxπ⁡𝔼π\[∑kRk\]s\.t\.𝔼π\[∑kCk\]≤1\\pi^\{\*\}=\\arg\\max\_\{\\pi\}\\;\\mathbb\{E\}\_\{\\pi\}\\\!\\Big\[\\textstyle\\sum\_\{k\}R\_\{k\}\\Big\]\\quad\\text\{s\.t\.\}\\quad\\mathbb\{E\}\_\{\\pi\}\\\!\\Big\[\\textstyle\\sum\_\{k\}C\_\{k\}\\Big\]\\leq 1\(2\)A single Lagrange multiplierλ\\lambdaconverts this constraint into a learnable quality–latency trade\-off:

ℒ\(π,λ\)=𝔼π\[∑kRk−λ⋅Ck\]\+λ\\mathcal\{L\}\(\\pi,\\lambda\)=\\mathbb\{E\}\_\{\\pi\}\\\!\\Big\[\\textstyle\\sum\_\{k\}R\_\{k\}\-\\lambda\\cdot C\_\{k\}\\Big\]\+\\lambda\(3\)The dual update \(Eq\.[7](https://arxiv.org/html/2606.11440#S4.E7), §[4\.4](https://arxiv.org/html/2606.11440#S4.SS4)\) drivesλ\\lambdaup when the policy overspends and down when it has slack, automatically discovering the trade\-off across budget tiers and load levels\.

#### Hierarchical decomposition\.

We split the policy into aplannerπplan\(τ,K,𝐫∣𝐞q,𝐳0\)\\pi\_\{\\text\{plan\}\}\(\\tau,K,\\mathbf\{r\}\\mid\\mathbf\{e\}\_\{q\},\\mathbf\{z\}\_\{0\}\)that selects topologyτ\\tau, agent countKK, and roles𝐫\\mathbf\{r\}att=0t\{=\}0from query semantics plus a low\-dimensional summary𝐳0=\[b0,𝐝queue,𝐝e2e,𝐝kv\]\\mathbf\{z\}\_\{0\}=\[b\_\{0\},\\mathbf\{d\}^\{\\text\{queue\}\},\\mathbf\{d\}^\{\\text\{e2e\}\},\\mathbf\{d\}^\{\\text\{kv\}\}\], and anexecutorπexec\(ak∣sk\)\\pi\_\{\\text\{exec\}\}\(a\_\{k\}\\mid s\_\{k\}\)that selects the \(model, strategy\) pair at each step from the full state\. The planner commits once to a feasible reasoning structure; the executor adapts execution as queues build and drain\.

## 4Method

InfraMindthreads infrastructure awareness through every layer of orchestration \(Figure[3](https://arxiv.org/html/2606.11440#S4.F3)\): an*infra\-aware planner*chooses a topology, agent count, and roles conditioned on a coarse summary of system load and budget \(§[4\.1](https://arxiv.org/html/2606.11440#S4.SS1)\); an*infra\-aware executor*then makes per\-step model and strategy decisions from the full real\-time state \(§[4\.2](https://arxiv.org/html/2606.11440#S4.SS2)\); and a*budget\-aware scheduler*manages request priority within each model’s queue \(§[4\.3](https://arxiv.org/html/2606.11440#S4.SS3)\)\. All three components are trained jointly as a single hierarchical RL policy under a shared budget constraint \(§[4\.4](https://arxiv.org/html/2606.11440#S4.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2606.11440v1/x3.png)Figure 3:InfraMindarchitecture\.\(1\) Planner:given the query, budget, and a System Monitor snapshot \(queue depth, KV cache, E2E latency\), it selects topology and roles, with FiLM modulation biasing toward simpler graphs under congestion and richer ones under slack\.\(2\) Executor:at each step, a dual\-pathway RL policy reads the role, query, remaining budget, and live metrics to jointly select a target model and reasoning strategy \(Flash/Concise/DeepThink\)\.\(3\) EDF scheduler:reorders each model’s queue by deadline, preventing tight\-budget requests from being blocked behind loose ones\.ASystem Monitorcontinuously polls per\-model queue depth, KV cache utilization, and end\-to\-end latency, exposing the state vector consumed by the planner \(once att=0t\{=\}0\) and the executor \(every step\)\.

### 4\.1Infrastructure\-Aware Planner

When planning the multi\-agent topology and role allocation for an incoming query, the planner considers the current state of the system \(load and budget\) so the topology it chooses is one downstream stages can actually carry out smoothly\. We adopt a cascaded controller \(task classifier→\\rightarrowcollaboration→\\rightarrowagent count→\\rightarrowroles\) following MasRouter\(Yueet al\.,[2025](https://arxiv.org/html/2606.11440#bib.bib4)\), and inject infrastructure awareness by conditioning every head on a system summary𝐳0\\mathbf\{z\}\_\{0\}\(budget plus per\-model queue, E2E latency, and KV cache vectors; cf\. Eq\.[1](https://arxiv.org/html/2606.11440#S3.E1)\) via Feature\-wise Linear Modulation:

𝐞~q=𝜸\(𝐳0\)⊙𝐞q\+𝜷\(𝐳0\)\\tilde\{\\mathbf\{e\}\}\_\{q\}=\\boldsymbol\{\\gamma\}\(\\mathbf\{z\}\_\{0\}\)\\odot\\mathbf\{e\}\_\{q\}\+\\boldsymbol\{\\beta\}\(\\mathbf\{z\}\_\{0\}\)\(4\)A small MLP maps𝐳0\\mathbf\{z\}\_\{0\}to the scale and shift\(𝜸,𝜷\)\(\\boldsymbol\{\\gamma\},\\boldsymbol\{\\beta\}\), which modulate the query embedding before it flows through the cascade, so all four heads inherit a single, coherent view of system state and learn to bias toward simpler chains under tight budgets or congestion and richer debate topologies under slack\.

### 4\.2Infrastructure\-Aware Executor

At every agentic step, the executor selects a model and reasoning strategy for the next call, adapting both to current conditions for better resource utilization\. It reads two complementary signals: a*semantic signal*\(query plus current role; what the task needs\) and a*resource signal*\(remaining budget plus live system metrics; what the system can currently deliver\)\. Each is routed through its own pathway before merging, so the joint action head conditions on both\.

#### Semantic pathway\.

𝐡sem=LN\(ReLU\(𝐖sem\[𝐞q∥𝐞rk\]\)\)∈ℝ128\\mathbf\{h\}\_\{\\text\{sem\}\}=\\text\{LN\}\(\\text\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{sem\}\}\[\\mathbf\{e\}\_\{q\}\\\|\\mathbf\{e\}\_\{r\_\{k\}\}\]\)\)\\in\\mathbb\{R\}^\{128\}\.

#### Resource pathway\.

A budget head𝐡bud=ReLU\(𝐖budbk\)\\mathbf\{h\}\_\{\\text\{bud\}\}=\\text\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{bud\}\}b\_\{k\}\)and a system head𝐡sys=ReLU\(𝐖sys\[𝐝queue‖𝐝e2e‖𝐝kv\]\)\\mathbf\{h\}\_\{\\text\{sys\}\}=\\text\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{sys\}\}\[\\mathbf\{d\}^\{\\text\{queue\}\}\\\|\\mathbf\{d\}^\{\\text\{e2e\}\}\\\|\\mathbf\{d\}^\{\\text\{kv\}\}\]\)\(∈ℝ16\\in\\mathbb\{R\}^\{16\}each\) combine into𝐡res=LN\(ReLU\(𝐖res\[𝐡bud∥𝐡sys\]\)\)∈ℝ64\\mathbf\{h\}\_\{\\text\{res\}\}=\\text\{LN\}\(\\text\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{res\}\}\[\\mathbf\{h\}\_\{\\text\{bud\}\}\\\|\\mathbf\{h\}\_\{\\text\{sys\}\}\]\)\)\\in\\mathbb\{R\}^\{64\}\.

#### Decision\.

Pathways merge into𝐡=ReLU\(𝐖merge\[𝐡sem∥𝐡res\]\)∈ℝ128\\mathbf\{h\}=\\text\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{merge\}\}\[\\mathbf\{h\}\_\{\\text\{sem\}\}\\\|\\mathbf\{h\}\_\{\\text\{res\}\}\]\)\\in\\mathbb\{R\}^\{128\}, yielding policyπ\(ak\|sk\)=softmax\(𝐖act𝐡\)\\pi\(a\_\{k\}\|s\_\{k\}\)=\\text\{softmax\}\(\\mathbf\{W\}\_\{\\text\{act\}\}\\mathbf\{h\}\)and valueV\(sk\)=𝐰val⊤𝐡V\(s\_\{k\}\)=\\mathbf\{w\}\_\{\\text\{val\}\}^\{\\top\}\\mathbf\{h\}overN×\|𝒮\|N\\times\|\\mathcal\{S\}\|joint actions \(5 models×\\times3 strategies\)\. Strategies modulate reasoning depth on top of CoT prompting\(Weiet al\.,[2022](https://arxiv.org/html/2606.11440#bib.bib22)\):Flash\(direct answer\),Concise\(2–3 steps\), andDeepThink\(thorough reasoning with verification\)\. The executor learns when each is worth the latency cost\.

### 4\.3Budget\-Aware Priority Scheduling

Even with optimal cross\-model routing, a tight\-budget request can stall behind a relaxed one inside a single model’s queue, so urgency must propagate from the user’s contract into the serving layer itself\. We attach a deadlinetarrive\+βt\_\{\\text\{arrive\}\}\+\\betato each query, every LLM call its agents generate inherits that deadline, and each model’s queue serves them Earliest\-Deadline\-First \(EDF\)\(Liu and Layland,[1973](https://arxiv.org/html/2606.11440#bib.bib21)\)\. Routing then handles*cross\-model*load balancing while EDF handles*within\-model*deadline ordering; the two layers cover orthogonal axes of waiting\.

### 4\.4Joint Hierarchical RL Training

The planner and executor make decisions at different timescales \(one per query vs\. one per step\) and so receive different learning signals, but they share an outcome\. We train them jointly with a hierarchical policy gradient: the executor uses the PPO clipped surrogate\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.11440#bib.bib18)\)on its per\-step trajectories, the planner uses baseline\-normalized REINFORCE\(Williams,[1992](https://arxiv.org/html/2606.11440#bib.bib20)\)on the episode return, and a single Lagrange multiplierλ\\lambdaenforces the budget constraint across both:

ℒexec=−𝔼\[min⁡\(ρkA^k,clip\(ρk,1−ϵ,1\+ϵ\)A^k\)\]\+0\.5ℒvalue−αHℋ\[π\],\\mathcal\{L\}\_\{\\text\{exec\}\}=\-\\mathbb\{E\}\\\!\\left\[\\min\\\!\\left\(\\rho\_\{k\}\\hat\{A\}\_\{k\},\\;\\text\{clip\}\(\\rho\_\{k\},1\{\-\}\\epsilon,1\{\+\}\\epsilon\)\\hat\{A\}\_\{k\}\\right\)\\right\]\+0\.5\\,\\mathcal\{L\}\_\{\\text\{value\}\}\-\\alpha\_\{H\}\\mathcal\{H\}\[\\pi\],\(5\)ℒplan=−log⁡πplan⋅Ui−U¯σU\+ℒtask\+αVAEℒVAE,\\mathcal\{L\}\_\{\\text\{plan\}\}=\-\\log\\pi\_\{\\text\{plan\}\}\\cdot\\frac\{U\_\{i\}\-\\bar\{U\}\}\{\\sigma\_\{U\}\}\+\\mathcal\{L\}\_\{\\text\{task\}\}\+\\alpha\_\{\\text\{VAE\}\}\\mathcal\{L\}\_\{\\text\{VAE\}\},\(6\)where the executor rewardrk=𝟙\[solved\]−λℓk/βr\_\{k\}=\\mathbb\{1\}\[\\text\{solved\}\]\-\\lambda\\ell\_\{k\}/\\betaand planner utilityUi=𝟙\[solved\]−λLtotal/βU\_\{i\}=\\mathbb\{1\}\[\\text\{solved\}\]\-\\lambda L\_\{\\text\{total\}\}/\\betashare the same penalty term\. The planner additionally inherits a task\-classification lossℒtask\\mathcal\{L\}\_\{\\text\{task\}\}and a VAE regularizerℒVAE\\mathcal\{L\}\_\{\\text\{VAE\}\}from its cascaded controller, providing a stable initialization signal that the policy gradient then refines under the budget constraint\.

After each batch,λ\\lambdaadapts to the average constraint violationC¯\\bar\{C\}\(whereC¯=∑kℓk/β\\bar\{C\}=\\sum\_\{k\}\\ell\_\{k\}/\\betaover the batch\):

λ←clip\(λ\+ηλ\(C¯−1\),0,λmax\)\.\\lambda\\leftarrow\\text\{clip\}\\\!\\left\(\\lambda\+\\eta\_\{\\lambda\}\(\\bar\{C\}\-1\),\\;0,\\;\\lambda\_\{\\max\}\\right\)\.\(7\)The dual update closes the loop: persistent overspend pushes both policies toward faster choices, slack lets them invest in quality\. Episodes sweep across budget tiers and Poisson arrival rates with inter\-sweep queue draining, so the executor sees the full distribution of congestion regimes it must adapt to at deployment\. Hyperparameters are in Appendix[A](https://arxiv.org/html/2606.11440#A1)\.

## 5Experiments

### 5\.1Setup

#### Benchmarks\.

To probe routing behavior across distinct reasoning regimes, we evaluate on five established benchmarks spanning code generation, mathematical reasoning, and knowledge\-intensive QA:MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2606.11440#bib.bib13)\),HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2606.11440#bib.bib14)\),GSM\-Hard\(Gaoet al\.,[2023](https://arxiv.org/html/2606.11440#bib.bib15)\),MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.11440#bib.bib16)\), andMMLU\-Pro\(Wanget al\.,[2024b](https://arxiv.org/html/2606.11440#bib.bib17)\)\. Dataset sizes and splits are in Appendix[B](https://arxiv.org/html/2606.11440#A2)\.

#### Model pool\.

We assemble a deliberately heterogeneous pool \(a reasoning specialist, a code specialist, a general\-purpose mid\-size model, and two small generalists\) spanning a10×10\\timesparameter range so the executor faces a non\-trivial capability/cost trade\-off: DeepSeek\-R1\-Distill\-Qwen\-32B\(Guoet al\.,[2025](https://arxiv.org/html/2606.11440#bib.bib29)\), Mistral\-Small\-24B\(Jianget al\.,[2023](https://arxiv.org/html/2606.11440#bib.bib31)\), Qwen2\.5\-Coder\-14B\(Huiet al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib30)\), Llama\-3\.1\-8B and Llama\-3\.2\-3B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib7)\), all served via vLLM on two NVIDIA B200 GPUs\.

#### Baselines\.

We compare against representative multi\-agent orchestrators that span the design spectrum from no routing to learned task\-adaptive routing:MoA\(Wanget al\.,[2024a](https://arxiv.org/html/2606.11440#bib.bib6)\)\(brute\-force ensemble, no routing\),GPTSwarm\(Zhugeet al\.,[2024](https://arxiv.org/html/2606.11440#bib.bib5)\)\(learned topology, fixed at test time\), andMasRouter\(Yueet al\.,[2025](https://arxiv.org/html/2606.11440#bib.bib4)\)\(task\-adaptive routing, no infrastructure awareness\)\.

#### Evaluation protocol\.

To stress\-test behavior across load regimes rather than a single operating point, all methods are evaluated under Poisson arrivals at low, moderate, and high rates \(\{10,50,100\}\\\{10,50,100\\\}req/min\) on the same shared model pool\. We report accuracy \(solve rate\), mean latency \(seconds\), and SLO compliance \(% of queries completing within 300 s\)\.

### 5\.2Main Results

Table[2](https://arxiv.org/html/2606.11440#S5.T2)and Figure[4](https://arxiv.org/html/2606.11440#S5.F4)present the central comparison\. The figure visualizes the quality–latency trade\-off: each subplot shows one dataset at one arrival rate, with background shading indicating SLO compliance zones\.

Table 2:Accuracy \(%\), mean latency \(s\), and SLO compliance \(%, budget≤\\leq300 s\) across three load levels\.Bold= best per column\.#### Latency under load: the central result\.

Figure[4](https://arxiv.org/html/2606.11440#S5.F4)reveals the core pattern\. As arrival rate increases \(top to bottom rows\), baseline points migrate rightward into the “over budget” and “severely congested” zones \(MoA and GPTSwarm exceed 1 000 s on several benchmarks at 100 req/min\), whileInfraMindstays under 300 s and remains in or near the SLO\-compliant zone across all five benchmarks\. This is the queue imbalance from §[1](https://arxiv.org/html/2606.11440#S1)manifesting at scale: baselines keep routing to the same models regardless of congestion, and the queuing delay compounds with load\.

The practical impact is stark: at 100 req/min with a 300 s SLO budget,InfraMindachieves up to 99\.9% SLO compliance \(HumanEval\), while baselines collapse: MoA and GPTSwarm fall below 12% on most benchmarks\. MASRouter fares slightly better but still drops below 50% SLO on four of five benchmarks\.

#### Quality holds at every load\.

At low load,InfraMindachieves the highest accuracy on all five benchmarks, with margins of\+7\.6\+7\.6pp on MATH \(82\.082\.0vs\.74\.474\.4for MoA\) and\+7\.4\+7\.4pp on GSM\-Hard \(62\.062\.0vs\.54\.654\.6for MASRouter\), while running up to14×14\\timesfaster than MASRouter on HumanEval \(55s vs\.7070s\) and6\.3×6\.3\\timesfaster on MBPP \(4040s vs\.253253s\)\. This turns available slack into both quality \(DeepThink on capable models\) and headroom\. As load rises, accuracy degrades gracefully rather than collapsing alongside latency: at 100 req/min,InfraMindremains the most accurate method on MATH, GSM\-Hard, HumanEval, and MMLU\-Pro, narrowly trailing MASRouter by0\.40\.4pp on MBPP; that0\.40\.4pp comes at985985s mean latency and2626% SLO compliance, a trade no production system would accept\.

![Refer to caption](https://arxiv.org/html/2606.11440v1/x4.png)Figure 4:Quality–latency trade\-off across datasets \(*columns*\) and arrival rates \(*rows*\)\. Each point is one method; thexx\-axis is mean latency \(log scale\)\. Background shading indicates latency zones:blue= SLO compliant \(<<100 s\),peach= over budget \(100–500 s\),salmon= severely congested \(\>\>500 s\)\.InfraMind\(vermillion diamonds\) consistently occupies the SLO\-compliant zone with competitive accuracy, while baselines migrate into the congested zone as load increases\.

### 5\.3Analysis and Ablations

Table 3:Ablation ofInfraMind’s three mechanisms\. Each row disables one mechanism in isolation; we report the change vs\.InfraMindon the indicated workload\.Figure[5](https://arxiv.org/html/2606.11440#S5.F5)shows the executor is budget\-aware: under low load on MATH, accuracy rises monotonically with the time budget as the executor progressively shifts to larger models and DeepThink reasoning\. This emerges from end\-to\-end constrained RL; no budget\-specific rules are hand\-coded\.

Table[3](https://arxiv.org/html/2606.11440#S5.T3)disables each ofInfraMind’s three mechanisms in turn while holding the rest fixed, and each contributes a distinct, non\-overlapping gain\. Without infra\-aware routing, traffic concentrates on preferred models: queues build on the favorites while alternatives sit idle, inflating step latency in a way no quality\-only router can avoid on shared infrastructure\. Replacing EDF with FCFS lets tight\-budget requests stall behind relaxed ones under mixed\-budget workloads, sharply increasing tail latency where routing alone cannot help\. Forcing a single reasoning depth \(Flash on every step\) collapses accuracy on knowledge tasks like MMLU\-Pro, confirming that adaptive depth is a genuine quality lever rather than just a latency knob\. Removing any one mechanism degrades the corresponding axis \(mean latency, tail latency, or accuracy\) while leaving the others intact, evidence that the three components fix different failure modes rather than redundantly fixing the same one\.

![Refer to caption](https://arxiv.org/html/2606.11440v1/x5.png)Figure 5:Accuracy vs\. time budget on MATH \(μ=10\\mu\{=\}10r/m\)\. Accuracy rises from62\.6%62\.6\\%to82\.0%82\.0\\%\(\+19\.4\+19\.4pp\) as the executor shifts to larger models and DeepThink, emergent from end\-to\-end RL with no hand\-coded rules\.
### 5\.4Extension to Blackbox and Hybrid Pools

The same infrastructure\-awareness principle extends to API\-only and mixed deployments where serving internals are not observable\. We construct two client\-side proxies that require no server\-side access: an exponential moving average of observed end\-to\-end latency, and an artificial congestion signaldiqueue=recent\_requestsi/RPM\_limitid^\{\\text\{queue\}\}\_\{i\}=\\text\{recent\\\_requests\}\_\{i\}/\\text\{RPM\\\_limit\}\_\{i\}derived from each provider’s rate limit\. A model at90%90\\%of its RPM behaves as “congested” as a whitebox model with a deep queue\. The executor architecture is unchanged\.

Table[4](https://arxiv.org/html/2606.11440#S5.T4)reports both settings on GSM\-Hard\. In a*hybrid*pool \(three whitebox \+ two API models\),InfraMindprefers the whitebox models at low load and overflows to APIs only as whitebox queues build up; in a*pure\-API*pool with heterogeneous RPM limits, it redistributes across providers to keep latency near budget\. In both settings, baselines collapse under load whileInfraMindsustains accuracy and SLO, confirming that the gains from infrastructure awareness do not require privileged server access; they require only a signal that correlates with current responsiveness\.

Table 4:Hybrid \(3 whitebox \+ 2 OpenRouter API, budget100100s\) and pure\-API \(5 OpenRouter, budget500500s\) pools on GSM\-Hard\. Acc \(%\), Lat \(s\), SLO \(%\)\.Bold= best per metric per load per pool\.

## 6Conclusion

Existing multi\-agent orchestration makes routing decisions without observing the queues, cache pressure, or latencies of the models it routes to, a blindness that no amount of task\-level intelligence can close on shared GPU infrastructure\.InfraMindaddresses this by making infrastructure awareness first\-class at every layer: an infra\-aware planner adapts topology and roles to current load, an infra\-aware executor redistributes traffic and reasoning depth per step, and a deadline\-aware scheduler resolves head\-of\-line blocking within each model’s queue, all jointly trained as a single hierarchical RL policy under a shared budget constraint\. Across five benchmarks,InfraMinddelivers up to\+7\.6\+7\.6pp accuracy over the strongest baseline at low load with up to 7×\\timeslower latency, and sustains up to99\.9%99\.9\\%SLO compliance under high load where every baseline drops below50%50\\%\. The same principle extends to blackbox API pools via client\-side proxies\.

#### Limitations and future work\.

InfraMindcommits to a topology at planning time and assumes a fixed model pool\. Two natural extensions follow from the same formulation:*runtime topology revision*, where the planner re\-decides collaboration structure mid\-workflow as load or partial outputs evolve, and*dynamic hardware configurations*, where the system adapts to elastic pools \(autoscaling, hot model swaps\) rather than a static set of replicas\. Both relax assumptions of the present work without changing the underlying state and action spaces\.

#### Broader impacts\.

Multi\-agent LLM systems are reaching production scale across enterprise\. JPMorgan Chase’s internal LLM Suite serves∼200\{\\sim\}200K employeesWilkinson \([2024](https://arxiv.org/html/2606.11440#bib.bib24)\), Bloomberg’s AskB routes queries across internal and open\-weight models within Bloomberg’s own infrastructureKahn \([2026](https://arxiv.org/html/2606.11440#bib.bib26)\), and web\-scale companies such as Uber standardize on open\-source serving stacks like vLLM in productionLinget al\.\([2026](https://arxiv.org/html/2606.11440#bib.bib25)\)\. Gartner projects this trajectory will reach40%40\\%of enterprise applications by 2026, up from under5%5\\%in 2025Gartner \([2025](https://arxiv.org/html/2606.11440#bib.bib27)\)\. These deployments increasingly run on operator\-managed GPU pools, where queue contention and cache pressure surface as tail latency that existing multi\-agent routers ignore\.InfraMinddrops into any such deployment to make routing infra\-aware, and because the blackbox/hybrid extension also covers API\-only and mixed pools, the approach is orthogonal to the choice of serving backend, an immediate path to lower tail latency and higher SLO compliance for production multi\-agent workflows\.

## References

- Taming\{\\\{throughput\-latency\}\\\}tradeoff in\{\\\{llm\}\\\}inference with\{\\\{sarathi\-serve\}\\\}\.In18th USENIX symposium on operating systems design and implementation \(OSDI 24\),pp\. 117–134\.Cited by:[§2](https://arxiv.org/html/2606.11440#S2.SS0.SSS0.Px2.p2.1)\.
- E\. Altman \(2021\)Constrained markov decision processes\.Routledge\.Cited by:[§3](https://arxiv.org/html/2606.11440#S3.p1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[Table 6](https://arxiv.org/html/2606.11440#A2.T6.4.3.2.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px1.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[Table 6](https://arxiv.org/html/2606.11440#A2.T6.4.5.4.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px1.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)Pal: program\-aided language models\.InInternational conference on machine learning,pp\. 10764–10799\.Cited by:[Table 6](https://arxiv.org/html/2606.11440#A2.T6.4.4.3.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px1.p1.1)\.
- Gartner \(2025\)Gartner Predicts 40% of Enterprise Apps Will Feature Task\-Specific AI Agents by 2026, Up from Less Than 5% in 2025\.Note:Gartner Press Release\.[https://www\.gartner\.com/en/newsroom/press\-releases/2025\-08\-26\-gartner\-predicts\-40\-percent\-of\-enterprise\-apps\-will\-feature\-task\-specific\-ai\-agents\-by\-2026\-up\-from\-less\-than\-5\-percent\-in\-2025](https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025)Accessed: 2026\-04\-29Cited by:[§6](https://arxiv.org/html/2606.11440#S6.SS0.SSS0.Px2.p1.3)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[Table 6](https://arxiv.org/html/2606.11440#A2.T6.4.2.1.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px1.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin,et al\.\(2023\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InThe twelfth international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu,et al\.\(2024\)Qwen2\. 5\-coder technical report\.arXiv preprint arXiv:2409\.12186\.Cited by:[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px2.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de Las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.ArXivabs/2310\.06825\.External Links:[Link](https://api.semanticscholar.org/CorpusID:263830494)Cited by:[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px2.p1.1)\.
- J\. Kahn \(2026\)Bloomberg, the OG of financial data firms, has a potent new AI agent\. How it built it holds lessons for other companies\.Note:Fortune\.[https://fortune\.com/2026/04/28/bloomberg\-askb\-ai\-agents\-lessons\-from\-bloomberg\-cto\-shawn\-edwards\-eye\-on\-ai/](https://fortune.com/2026/04/28/bloomberg-askb-ai-agents-lessons-from-bloomberg-cto-shawn-edwards-eye-on-ai/)Accessed: 2026\-04\-29Cited by:[§6](https://arxiv.org/html/2606.11440#S6.SS0.SSS0.Px2.p1.3)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th symposium on operating systems principles,pp\. 611–626\.Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1),[§2](https://arxiv.org/html/2606.11440#S2.SS0.SSS0.Px2.p2.1)\.
- G\. Li, H\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem \(2023\)Camel: communicative agents for" mind" exploration of large language model society\.Advances in neural information processing systems36,pp\. 51991–52008\.Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1)\.
- B\. Ling, J\. Huang, B\. Liu,et al\.\(2026\)Open Source and In\-House: How Uber Optimizes LLM Training\.Note:Uber Engineering Blog\.[https://www\.uber\.com/us/en/blog/open\-source\-and\-in\-house\-how\-uber\-optimizes\-llm\-training/](https://www.uber.com/us/en/blog/open-source-and-in-house-how-uber-optimizes-llm-training/)Accessed: 2026\-04\-29Cited by:[§6](https://arxiv.org/html/2606.11440#S6.SS0.SSS0.Px2.p1.3)\.
- C\. L\. Liu and J\. W\. Layland \(1973\)Scheduling algorithms for multiprogramming in a hard\-real\-time environment\.Journal of the ACM \(JACM\)20\(1\),pp\. 46–61\.Cited by:[§4\.3](https://arxiv.org/html/2606.11440#S4.SS3.p1.1)\.
- I\. Ong, A\. Almahairi, V\. Wu, W\. Chiang, T\. Wu, J\. E\. Gonzalez, M\. W\. Kadous, and I\. Stoica \(2024\)Routellm: learning to route llms with preference data\.arXiv preprint arXiv:2406\.18665\.Cited by:[§2](https://arxiv.org/html/2606.11440#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§3](https://arxiv.org/html/2606.11440#S3.SS0.SSS0.Px1.p1.9)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§4\.4](https://arxiv.org/html/2606.11440#S4.SS4.p1.1)\.
- J\. Wang, J\. Wang, B\. Athiwaratkun, C\. Zhang, and J\. Zou \(2024a\)Mixture\-of\-agents enhances large language model capabilities\.arXiv preprint arXiv:2406\.04692\.Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1),[§2](https://arxiv.org/html/2606.11440#S2.SS0.SSS0.Px1.p2.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px3.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024b\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[Table 6](https://arxiv.org/html/2606.11440#A2.T6.4.6.5.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§4\.2](https://arxiv.org/html/2606.11440#S4.SS2.SSS0.Px3.p1.5)\.
- L\. Wilkinson \(2024\)JPMorgan Chase to equip 140K workers with generative AI tool\.Note:CIO Dive\.[https://www\.ciodive\.com/news/JPMorgan\-Chase\-LLM\-Suite\-generative\-ai\-employee\-tool/726772/](https://www.ciodive.com/news/JPMorgan-Chase-LLM-Suite-generative-ai-employee-tool/726772/)Accessed: 2026\-04\-29Cited by:[§6](https://arxiv.org/html/2606.11440#S6.SS0.SSS0.Px2.p1.3)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine learning8\(3\),pp\. 229–256\.Cited by:[§4\.4](https://arxiv.org/html/2606.11440#S4.SS4.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2024\)Autogen: enabling next\-gen llm applications via multi\-agent conversations\.InFirst conference on language modeling,Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1)\.
- J\. Xue, Q\. Lou, J\. Xing, and H\. Huang \(2026\)R2\-router: a new paradigm for llm routing with reasoning\.arXiv preprint arXiv:2602\.02823\.Cited by:[§2](https://arxiv.org/html/2606.11440#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Yue, G\. Zhang, B\. Liu, G\. Wan, K\. Wang, D\. Cheng, and Y\. Qi \(2025\)Masrouter: learning to route llms for multi\-agent systems\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15549–15572\.Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1),[§1](https://arxiv.org/html/2606.11440#S1.p2.1),[§2](https://arxiv.org/html/2606.11440#S2.SS0.SSS0.Px1.p2.1),[§4\.1](https://arxiv.org/html/2606.11440#S4.SS1.p1.4),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px3.p1.1)\.
- X\. Zhang, Z\. Huang, E\. O\. Taga, C\. Joe\-Wong, S\. Oymak, and J\. Chen \(2024\)Efficient contextual llm cascades through budget\-constrained policy learning\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 91691–91722\.External Links:[Document](https://dx.doi.org/10.52202/079017-2910),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/a6deba3b2408af45b3f9994c2152b862-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2606.11440#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez,et al\.\(2024\)Sglang: efficient execution of structured language model programs\.Advances in neural information processing systems37,pp\. 62557–62583\.Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1)\.
- M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber \(2024\)Gptswarm: language agents as optimizable graphs\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.11440#S1.p1.1),[§2](https://arxiv.org/html/2606.11440#S2.SS0.SSS0.Px1.p2.1),[§5\.1](https://arxiv.org/html/2606.11440#S5.SS1.SSS0.Px3.p1.1)\.

## Appendix ATraining Hyperparameters

Table 5:Training hyperparameters forInfraMind\.
## Appendix BDataset Details

Table 6:Dataset splits used in experiments\.
## Appendix CBlackbox Extension Details

The blackbox extension is described in §[5\.4](https://arxiv.org/html/2606.11440#S5.SS4)\. Here we provide additional implementation details\.

#### Sliding\-window RPM tracking\.

We use a 60\-second sliding window per model\. Each outgoing request is timestamped; the congestion signal is the count of timestamps within the window divided by the model’s RPM limit\. When this ratio exceeds 1\.0, excess requests are held in a local queue and dispatched as capacity frees up, creating backpressure visible to the executor\.

#### Dual Lagrange multipliers\.

For hybrid pools, we maintainλtime\\lambda\_\{\\text\{time\}\}andλmoney\\lambda\_\{\\text\{money\}\}with independent dual updates:

λtime\\displaystyle\\lambda\_\{\\text\{time\}\}←clip\(λtime\+ηλ⋅\(C¯time−1\),0,λmax\)\\displaystyle\\leftarrow\\text\{clip\}\(\\lambda\_\{\\text\{time\}\}\+\\eta\_\{\\lambda\}\\cdot\(\\bar\{C\}\_\{\\text\{time\}\}\-1\),\\;0,\\;\\lambda\_\{\\max\}\)\(8\)λmoney\\displaystyle\\lambda\_\{\\text\{money\}\}←clip\(λmoney\+ηλ⋅\(C¯money−1\),0,λmax\)\\displaystyle\\leftarrow\\text\{clip\}\(\\lambda\_\{\\text\{money\}\}\+\\eta\_\{\\lambda\}\\cdot\(\\bar\{C\}\_\{\\text\{money\}\}\-1\),\\;0,\\;\\lambda\_\{\\max\}\)\(9\)whereC¯time=∑ℓk/βtime\\bar\{C\}\_\{\\text\{time\}\}=\\sum\\ell\_\{k\}/\\beta\_\{\\text\{time\}\}andC¯money=∑ck/βmoney\\bar\{C\}\_\{\\text\{money\}\}=\\sum c\_\{k\}/\\beta\_\{\\text\{money\}\}\. Withβmoney=∞\\beta\_\{\\text\{money\}\}=\\infty\(no monetary budget\) and no blackbox models,λmoney\\lambda\_\{\\text\{money\}\}stays at zero and the formulation reduces exactly to the whitebox\-only system in the main text\.
INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

Similar Articles

@anyscalecompute: Most agent frameworks solve orchestration and leave infrastructure completely unresolved. New blog: production-ready AI…

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

Routing agent work across 4 LLM tiers: orchestrator, advisor, deep reasoning, premier

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

Orchestra-o1: Omnimodal Agent Orchestration

Submit Feedback

Similar Articles

@anyscalecompute: Most agent frameworks solve orchestration and leave infrastructure completely unresolved. New blog: production-ready AI…
Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale
Routing agent work across 4 LLM tiers: orchestrator, advisor, deep reasoning, premier
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems
Orchestra-o1: Omnimodal Agent Orchestration