Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
Summary
This paper introduces LQM-ContextRoute, a contextual bandit router for selecting between functionally equivalent tool providers in LLM agents, balancing latency and answer quality. It outperforms baselines on web-search and retriever benchmarks.
View Cached Full Text
Cached at: 05/15/26, 06:28 AM
# Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
Source: [https://arxiv.org/html/2605.14241](https://arxiv.org/html/2605.14241)
Dawei XiangWei Zhang University of Connecticut \{kexin\.chu, ieb24002, wei\.13\.zhang\}@uconn\.edu
###### Abstract
Tool\-augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web\-search APIs, retrievers, or LLM backends exposed behind a shared interface\. This creates a provider\-routing problem under runtime load: the router must choose among providers that differ in latency, reliability, and answer quality, often without gold labels at deployment time\. We introduceLQM\-ContextRoute, a contextual bandit router for same\-function tool providers\. Its key design is latency\-quality matching: instead of letting low latency offset poor answers in an additive reward, the router ranks providers by expected answer quality per service cycle\. It combines this capacity\-aware score with query\-specific quality estimation and LLM\-as\-judge feedback, allowing it to adapt online to both load changes and provider\-quality differences\. On the main web\-search load benchmark,LQM\-ContextRouteimproves F1 by\+2\.18\+2\.18pp over SW\-UCB while staying on the latency\-quality frontier\. In a high\-heterogeneity StrategyQA setting,LQM\-ContextRouteavoids additive\-reward collapse and improves accuracy by up to\+18\+18pp over SW\-UCB; on heterogeneous retriever pools, it improves NDCG by\+2\.91\+2\.91–\+3\.22\+3\.22pp over SW\-UCB\. These results show that same\-function tool routing benefits from treating latency as service capacity, especially when runtime pressure and provider\-quality heterogeneity coexist\.
Latency\-Quality Routing for Functionally Equivalent Tools in LLM Agents
Kexin Chu and Dawei Xiang and Wei ZhangUniversity of Connecticut\{kexin\.chu, ieb24002, wei\.13\.zhang\}@uconn\.edu
## 1Introduction
Tool\-augmented LLM agents increasingly call external services through shared tool interfaces\. A singleweb\_searchcall, for example, may be served by Tavily, Brave, or DuckDuckGo behind an MCP\(Anthropic,[2024](https://arxiv.org/html/2605.14241#bib.bib2)\)or function\-calling interface\. These providers are*functionally equivalent*in the API sense\. Each accepts a query string and returns ranked snippets\. They are not operationally equivalent: latency, rate limits, and answer contribution vary across providers and across load regimes\. Recent industry stress tests of MCP servers and search APIs\(Digital Applied,[2026](https://arxiv.org/html/2605.14241#bib.bib1); Gupta,[2026](https://arxiv.org/html/2605.14241#bib.bib10)\)report runtime failures from timeouts, quota exhaustion, upstream errors, and rate limiting\. Our live profile of270270calls across three search providers shows distinct latency shapes even when all calls succeed \(App\.[C](https://arxiv.org/html/2605.14241#A3.SS0.SSS0.Px5)\)\. Provider choice affects the agent’s runtime behavior; it is not just a deployment detail\.
Prior work on tool\-augmented agents \(Toolformer\(Schick and others,[2023](https://arxiv.org/html/2605.14241#bib.bib17)\), Gorilla\(Patil and others,[2023](https://arxiv.org/html/2605.14241#bib.bib18)\), ToolLLM\(Qin and others,[2023](https://arxiv.org/html/2605.14241#bib.bib19)\), ReAct\(Yao and others,[2022](https://arxiv.org/html/2605.14241#bib.bib20)\)\) primarily studies*which tool type*an agent should invoke\. We study the gateway decision that follows tool selection: which provider of that tool should receive the call under current load, when online gold labels are unavailable? This provider\-level view matches how gateways are configured in practice: routing is usually scoped within a selected interface, so the candidate set is the provider pool for that interface rather than the agent’s full tool inventory\. Production routers such as Portkey\(Portkey,[2026](https://arxiv.org/html/2605.14241#bib.bib5)\)and LiteLLM\(BerriAI,[2026](https://arxiv.org/html/2605.14241#bib.bib7)\)address availability with priority lists, cooldowns, and fallbacks\. These mechanisms are useful, but they are largely reactive and quality\-blind\. In a K=10 priority\-router replay, a production\-style priority\+cooldown policy reaches0\.6100\.610F1 with a hand\-tuned priority order but falls to0\.0050\.005F1 under a mismatched order, indicating that availability signals alone do not recover provider quality\.
A natural learning alternative is to make a bandit load\-aware with an additive latency\-quality reward,r=αu−\(1−α\)τ~r=\\alpha u\-\(1\{\-\}\\alpha\)\\tilde\{\\tau\}\. Such rewards are simple and reasonable when providers have similar answer quality, but they can fail when quality is heterogeneous: low latency can compensate for a provider that contributes little to the answer\. In that case, the router may satisfy an SLA while silently degrading task performance\. The central question is not only whether the router adapts to load, but how latency should enter the learning objective\.
We proposeLQM\-ContextRoute, a contextual bandit router for same\-function providers\. Its main design choice is*latency\-quality matching*: instead of treating latency as an additive penalty, the router scores each provider by a renewal\-reward rate,u^i/\(1\+τ^i/Lref\)\\hat\{u\}\_\{i\}/\(1\+\\hat\{\\tau\}\_\{i\}/L\_\{\\text\{ref\}\}\)\. This treats latency as service\-cycle cost\. One call consumes one accounting unit plus a service\-time penalty proportional toτ/Lref\\tau/L\_\{\\text\{ref\}\}, so a fast low\-quality provider is not rewarded for being fast alone\. The router combines this objective with a LinUCB quality head and LLM\-as\-judge feedback, allowing it to learn query\-specific quality estimates without online gold labels\.
#### Contributions\.
- •We formalize same\-function provider routing under runtime load, separating provider choice from the upstream decision of which tool type to invoke\.
- •We identify additive latency\-quality compensation as a concrete failure mode and propose a throughput\-style objective that treats latency as service\-cycle cost rather than as an additive reward penalty\.
- •We evaluateLQM\-ContextRouteacross web\-search load traces, high\-heterogeneity question answering, retriever pools, and LLM\-provider pools\. It cuts static\-provider latency by5050–67%67\\%on the main load benchmark, improves F1 by\+2\.18\+2\.18pp overSW\-UCB, improves StrategyQA accuracy by up to\+18\+18pp, and adds\+2\.91\+2\.91–\+3\.22\+3\.22pp NDCG on retriever pools\.
## 2Background and Problem Formulation
#### Task\.
A selected function categoryCCis served by a pool𝒯C=\{T1,…,TK\}\\mathcal\{T\}\_\{C\}\\\!=\\\!\\\{T\_\{1\},\\ldots,T\_\{K\}\\\}ofKKfunctionally equivalent tools: each arm implements the same external interface, but may differ in answer quality, latency, rate limits, or backend modality\. At roundtt, a queryqtq\_\{t\}arrives; the router selectsit∈\[K\]i\_\{t\}\\\!\\in\\\!\[K\], observes latencyτit\(t\)≥0\\tau\_\{i\_\{t\}\}\(t\)\\\!\\geq\\\!0immediately, and later receives a quality scalaruit\(t\)∈\[0,1\]u\_\{i\_\{t\}\}\(t\)\\\!\\in\\\!\[0,1\]from an online evaluator \(an LLM judge in our experiments\)\. The goal is to choose providers that deliver high expected quality while respecting an operator\-set service\-time budgetLrefL\_\{\\text\{ref\}\}\.
#### Why the setting is different\.
This provider\-routing problem differs from both tool selection and a vanilla contextual bandit\. The upstream agent has already selected the tool type; the gateway only chooses among providers of that interface\. Feedback is partial because the router observes latency and quality only for the provider it calls\. Load is non\-stationary, quality labels are not available online, and provider preferences can depend on the query\. Providers may be heterogeneous: a low\-latency provider can be a poor answer source for some tasks, while a slower provider can be substantially more useful\.
#### Why the setting matters\.
External stress reports motivate the load side of the problem:Digital Applied \([2026](https://arxiv.org/html/2605.14241#bib.bib1)\)find that62%62\\%of MCP\-server failures are runtime/load\-related and that reliability drops by1818percentage points as concurrency increases from 1 to 32 requests, whileGupta \([2026](https://arxiv.org/html/2605.14241#bib.bib10)\)report rate limiting as the most damaging injected fault\. Although our live search profile is smaller than production traffic, it shows the provider\-level variation that a gateway sees even without failures: across270270calls, Tavily has a warmp50p\_\{50\}of7676ms with a cold\-start tail, Brave has a warmp50p\_\{50\}of316316ms and a tighter spread, and DDG remains around2\.52\.5s \(App\.[C](https://arxiv.org/html/2605.14241#A3.SS0.SSS0.Px5)\)\. These observations motivate a router that tracks service state while still accounting for answer quality\.
#### Why additive rewards are misaligned\.
A standard load\-aware reward\(Poonet al\.,[2025](https://arxiv.org/html/2605.14241#bib.bib25); Chadderwala,[2025](https://arxiv.org/html/2605.14241#bib.bib29)\)isr=αu−\(1−α\)τ~r\\\!=\\\!\\alpha u\\\!\-\\\!\(1\{\-\}\\alpha\)\\tilde\{\\tau\}withτ~\\tilde\{\\tau\}normalized to\[0,1\]\[0,1\]\. The two axes*compensate*: atα=0\.4\\alpha\{=\}0\.4,τ~=0,u=0\.1\\tilde\{\\tau\}\{=\}0,u\{=\}0\.1scores0\.040\.04, whileτ~=1,u=0\.65\\tilde\{\\tau\}\{=\}1,u\{=\}0\.65scores−0\.34\-0\.34, so the lower\-quality low\-latency provider is ranked higher\. We instead frame routing as the constrained problem
π⋆=argmaxπ𝔼i∼π\[ui\]s\.t\.𝔼i∼π\[τi\]≤Lref,\\pi^\{\\star\}\\\!=\\\!\\arg\\max\_\{\\pi\}\\mathbb\{E\}\_\{i\\sim\\pi\}\[u\_\{i\}\]\\;\\;\\text\{s\.t\.\}\\;\\;\\mathbb\{E\}\_\{i\\sim\\pi\}\[\\tau\_\{i\}\]\\leq L\_\{\\text\{ref\}\},\(1\)whereLrefL\_\{\\text\{ref\}\}is an operator\-set service\-time budget \(the SLA threshold in our experiments\)\. This objective makes the trade\-off explicit: latency constrains how often useful answers can be delivered, but it should not directly compensate for low answer quality\.
## 3Method:LQM\-ContextRoute
### 3\.1Router Overview
LQM\-ContextRouteis a router for a fixed pool of functionally equivalent providers\. At roundtt, it receives a query representation𝐱t\\mathbf\{x\}\_\{t\}, chooses one provideriti\_\{t\}, observes its latencyτit\(t\)\\tau\_\{i\_\{t\}\}\(t\), receives an online quality scoreuit\(t\)u\_\{i\_\{t\}\}\(t\), and updates only the selected provider\. The router maintains three quantities per provider: a query\-conditional quality estimate, a provider\-level latency estimate, and an uncertainty estimate for exploration\. The selection rule combines them into one score: predicted quality is divided by current service cost, and an optimism bonus encourages exploration when quality is uncertain\.
The separation mirrors the deployment setting\. Load is primarily a property of an upstream service at a given time, so latency is tracked at the provider level\. Quality is query\-conditional because two providers that share an API can still differ by question type, retrieval coverage, or reasoning fit\. The renewal\-rate term is the objective; the LinUCB head estimates query\-specific quality; the EMA tracks current service cost\. Table[1](https://arxiv.org/html/2605.14241#S3.T1)summarises these roles\.
Table 1:Per\-provider state used by the router\.
### 3\.2Latency\-Quality Matching
The core design choice is the selection objective\. Additive rewardsαu−\(1−α\)τ~\\alpha u\-\(1\{\-\}\\alpha\)\\tilde\{\\tau\}allow a low\-quality provider to be rescued by low latency\. We instead score a provider by the renewal\-reward rate
Vi=ui1\+τi/Lref,V\_\{i\}\\;=\\;\\frac\{u\_\{i\}\}\{1\+\\tau\_\{i\}/L\_\{\\text\{ref\}\}\},\(2\)whereLrefL\_\{\\text\{ref\}\}is the operator\-set latency budget\. This score implements the constrained goal in Eq\.[1](https://arxiv.org/html/2605.14241#S2.E1): quality is the reward, while latency consumes service capacity\. Under the linear\-cycle calibration, a call takes one accounting unit plus a service\-time penaltyτi/Lref\\tau\_\{i\}/L\_\{\\text\{ref\}\}\. The renewal\-reward theorem gives the long\-run reward rateui/\(1\+τi/Lref\)u\_\{i\}/\(1\+\\tau\_\{i\}/L\_\{\\text\{ref\}\}\)\(Ross,[1996](https://arxiv.org/html/2605.14241#bib.bib39), Thm\. 3\.6\.1\)\. Appendix[B](https://arxiv.org/html/2605.14241#A2)gives the formal characterisation and discusses alternative calibrations\. The important property for routing is non\-compensation: ifuiu\_\{i\}is near zero, the score is near zero regardless of how fast the provider is\.
This objective also givesLrefL\_\{\\text\{ref\}\}a direct operational meaning: it is the latency scale at which one call consumes roughly one extra unit of service time\. A smaller value makes the router more latency\-sensitive, while a larger value approaches quality\-first routing\. In the experiments we setLrefL\_\{\\text\{ref\}\}to the SLA threshold and check sensitivity in Appendix[C](https://arxiv.org/html/2605.14241#A3.SS0.SSS0.Px3)\.
### 3\.3Contextual Quality and Latency Estimates
LQM\-ContextRouteestimates quality with one LinUCB head per provider and estimates latency with an exponential moving average\. For providerii, letu^i\(𝐱t\)=𝐱t⊤Ai−1𝐛i\\hat\{u\}\_\{i\}\(\\mathbf\{x\}\_\{t\}\)=\\mathbf\{x\}\_\{t\}^\{\\top\}A\_\{i\}^\{\-1\}\\mathbf\{b\}\_\{i\}be the contextual quality estimate andτ^i\\hat\{\\tau\}\_\{i\}the current latency estimate\. After selecting provideriti\_\{t\}, the router updatesτ^it\\hat\{\\tau\}\_\{i\_\{t\}\}using the observed latency and updates the selected LinUCB head with the pair\(𝐱t,uit\(t\)\)\(\\mathbf\{x\}\_\{t\},u\_\{i\_\{t\}\}\(t\)\)\. Unselected providers are not updated, matching the bandit feedback available in the gateway\.
### 3\.4Selection Rule
The router selects the provider with the largest score
si\(t,𝐱t\)=u^i\(𝐱t\)1\+τ^i/Lref\+αucb𝐱t⊤Ai−1𝐱t1\+λΔi\(𝐱t\),s\_\{i\}\(t,\\mathbf\{x\}\_\{t\}\)\\;=\\;\\frac\{\\hat\{u\}\_\{i\}\(\\mathbf\{x\}\_\{t\}\)\}\{1\+\\hat\{\\tau\}\_\{i\}/L\_\{\\text\{ref\}\}\}\+\\frac\{\\alpha\_\{\\text\{ucb\}\}\\sqrt\{\\mathbf\{x\}\_\{t\}^\{\\top\}A\_\{i\}^\{\-1\}\\mathbf\{x\}\_\{t\}\}\}\{1\+\\lambda\\,\\Delta\_\{i\}\(\\mathbf\{x\}\_\{t\}\)\},\(3\)whereΔi\(𝐱t\)=max\(0,maxju^j\(𝐱t\)−u^i\(𝐱t\)\)\\Delta\_\{i\}\(\\mathbf\{x\}\_\{t\}\)=\\max\(0,\\max\_\{j\}\\hat\{u\}\_\{j\}\(\\mathbf\{x\}\_\{t\}\)\-\\hat\{u\}\_\{i\}\(\\mathbf\{x\}\_\{t\}\)\)\. The first term is the latency\-quality matching score applied to contextual quality; the second is the LinUCB exploration bonus, deflated for arms whose estimated quality is already dominated\. We useλ=1\\lambda=1in all experiments\. Sliding\-window Sherman–Morrison updates maintainAi−1A\_\{i\}^\{\-1\}efficiently; Algorithm[1](https://arxiv.org/html/2605.14241#alg1)summarises the online loop\.
The deflation term is deliberately asymmetric\. It does not punish a slow provider merely for being slow, because a slow provider may be the only one with useful evidence for a query\. It only reduces exploration when the current contextual model already estimates that another provider has higher quality\. This is the mechanism that distinguishesLQM\-ContextRoutefrom a pure load balancer: the router may spend latency when the expected quality return is large, while reducing exploration of arms whose predicted quality is already dominated for the current query distribution\.
Algorithm 1LQM\-ContextRouteonline routing1:init
Ai←λrIA\_\{i\}\\leftarrow\\lambda\_\{r\}I,
𝐛i←𝟎\\mathbf\{b\}\_\{i\}\\leftarrow\\mathbf\{0\},
τ^i←τ0\\hat\{\\tau\}\_\{i\}\\leftarrow\\tau\_\{0\}for each provider
ii
2:for
t=1,…,Tt=1,\\ldots,Tdo
3:Embed query
qtq\_\{t\}as
𝐱t\\mathbf\{x\}\_\{t\}
4:foreach active provider
iido
5:
u^i←𝐱t⊤Ai−1𝐛i\\hat\{u\}\_\{i\}\\leftarrow\\mathbf\{x\}\_\{t\}^\{\\top\}A\_\{i\}^\{\-1\}\\mathbf\{b\}\_\{i\}
6:
ci←αucb𝐱t⊤Ai−1𝐱tc\_\{i\}\\leftarrow\\alpha\_\{\\text\{ucb\}\}\\sqrt\{\\mathbf\{x\}\_\{t\}^\{\\top\}A\_\{i\}^\{\-1\}\\mathbf\{x\}\_\{t\}\}
7:
Δi←max\(0,maxju^j−u^i\)\\Delta\_\{i\}\\leftarrow\\max\(0,\\max\_\{j\}\\hat\{u\}\_\{j\}\-\\hat\{u\}\_\{i\}\)
8:
si←u^i/\(1\+τ^i/Lref\)\+ci/\(1\+λΔi\)s\_\{i\}\\leftarrow\\hat\{u\}\_\{i\}/\(1\+\\hat\{\\tau\}\_\{i\}/L\_\{\\text\{ref\}\}\)\+c\_\{i\}/\(1\+\\lambda\\Delta\_\{i\}\)
9:endfor
10:Select
it←argmaxisii\_\{t\}\\leftarrow\\arg\\max\_\{i\}s\_\{i\}and call provider
iti\_\{t\}
11:Observe latency
τt\\tau\_\{t\}and quality score
utu\_\{t\}
12:Update
τ^it\\hat\{\\tau\}\_\{i\_\{t\}\}by EMA; update
Ait−1,𝐛itA\_\{i\_\{t\}\}^\{\-1\},\\mathbf\{b\}\_\{i\_\{t\}\}with
\(𝐱t,ut\)\(\\mathbf\{x\}\_\{t\},u\_\{t\}\)
13:endfor
### 3\.5Online Update and Feedback
After a provider returns, the latency observation updatesτ^it\\hat\{\\tau\}\_\{i\_\{t\}\}and an online evaluator suppliesuit\(t\)∈\[0,1\]u\_\{i\_\{t\}\}\(t\)\\in\[0,1\]from the chosen provider’s response\. In our experiments this evaluator is a localLlama\-3\.2\-1Bjudge \(mean6464ms, P95115115ms\)\. The judge is a deployment\-time reward proxy, not an oracle gold label; we validate it against a 30B reference in App\.[C](https://arxiv.org/html/2605.14241#A3.SS0.SSS0.Px1)\.
The proposed router throughout the paper isLQM\-ContextRoute\. We use one ablation,LQM\-only, only to isolate the scoring objective from the contextual quality model\. It removes𝐱t\\mathbf\{x\}\_\{t\}and replaces the LinUCB quality head with EMA quality:
si\(t\)=u^i1\+τ^i/Lref\+βlog\(t\)/\(nw,i\+1\)1\+λΔi\.s\_\{i\}\(t\)\\;=\\;\\frac\{\\hat\{u\}\_\{i\}\}\{1\+\\hat\{\\tau\}\_\{i\}/L\_\{\\text\{ref\}\}\}\\;\+\\;\\frac\{\\beta\\sqrt\{\\log\(t\)/\(n\_\{w,i\}\{\+\}1\)\}\}\{1\+\\lambda\\,\\Delta\_\{i\}\}\.\(4\)When the pool is quality\-homogeneous, Eq\.[4](https://arxiv.org/html/2605.14241#S3.E4)behaves like load\-greedy SW\-UCB; when one provider dominates on both axes, the quality term drives selection\. These limits are tested in §[4\.4](https://arxiv.org/html/2605.14241#S4.SS4)and §[4\.3](https://arxiv.org/html/2605.14241#S4.SS3)\.
### 3\.6Properties
Theorem[2](https://arxiv.org/html/2605.14241#Thmtheorem2)\(App\.[B](https://arxiv.org/html/2605.14241#A2)\) states the main score\-level distinction: for everyα∈\(0,1\)\\alpha\\in\(0,1\), there are 2\-provider instances where an additive composite ranks the lower\-quality fast provider above the higher\-quality provider, while Eq\.[2](https://arxiv.org/html/2605.14241#S3.E2)ranks them correctly\. This theorem explains the failure mode targeted by the experiments; bandit\-level performance still depends on exploration\.
For the unmodulated optimistic bonus \(λ=0\\lambda=0\), the plug\-in throughput estimator concentrates at the EMA rate with Lipschitz constantscu≤1c\_\{u\}\\leq 1andcτ≤Lref−1c\_\{\\tau\}\\leq L\_\{\\text\{ref\}\}^\{\-1\}, yielding
RT≤∑i:ΔiV\>0C\(1\+Lref−1\)2logTΔiV\+o\(logT\)\.R\_\{T\}\\leq\\sum\_\{i:\\Delta\_\{i\}^\{V\}\>0\}\\frac\{C\(1\+L\_\{\\text\{ref\}\}^\{\-1\}\)^\{2\}\\log T\}\{\\Delta\_\{i\}^\{V\}\}\+o\(\\log T\)\.Appendix[B](https://arxiv.org/html/2605.14241#A2)gives the proof sketch and states the additional assumptions; theλ\>0\\lambda\>0deflation used in Eq\.[3](https://arxiv.org/html/2605.14241#S3.E3)is treated as an empirical exploration control rather than as a separate optimism guarantee\.
## 4Evaluation
The method in §[3](https://arxiv.org/html/2605.14241#S3)makes two linked claims: same\-function provider routing should improve service behavior under runtime load, and latency should affect selection through a capacity\-aware quality rate rather than an additive penalty\. A complete evaluation must also identify the boundary cases where this design should be neutral\. We test three claims:
- •C1: Load\-aware routing\.Under non\-stationary provider load, routing should reduce latency and SLA misses without sacrificing answer quality\.
- •C2: Latency\-quality matching\.When fast providers are not also high\-quality providers,LQM\-ContextRouteshould avoid the additive\-compensation failure predicted by Theorem[2](https://arxiv.org/html/2605.14241#Thmtheorem2)\.
- •C3: Transfer and limits\.The latency\-quality objective should transfer beyond web search when provider quality differs, and it should remain stable rather than report spurious gains when one provider is already dominant\.
Web search is the main testbed because it naturally exposes multiple interchangeable providers with different latency and quality profiles\. We use it as the deployment\-shaped benchmark, then place it in a broader same\-function provider matrix with QA, LLM\-backend, and retriever pools that vary in provider\-quality heterogeneity\. We report both task quality and latency: a latency\-only router can satisfy the SLA while returning poor answers, whereas a quality\-only router can exceed the service budget\.
### 4\.1Setup
#### Main benchmark\.
The main benchmark uses 200 questions \(100 HotpotQA \+ 100 TriviaQA\) and three web\-search providers\. For each \(query, provider\) pair, we execute the provider, run a Qwen3\-30B reader over the returned evidence, and score the resulting answer with F1\. The resulting provider\-response score table gives every router the same observed answer surface, so differences come from routing decisions rather than reader noise\.
#### Runtime load\.
Latency is drawn online from calibrated service profiles anchored by vendor medians and our live Tavily/Brave/DDG measurements\. The profiles define warm, loaded, and overloaded states with measured\-scale latency distributions and failure probabilities\. We then evaluate four non\-stationary load patterns, which are reused in the web\-search and retriever tables\. Instep, the preferred provider becomes overloaded atT/2T/2and recovers at3T/43T/4\. Inrotation, the overloaded provider changes everyT/KT/Krounds\. Inspike, a provider experiences random 15–40\-round overload bursts\. Ingradual, provider latency follows a smooth sinusoidal drift\. We treat SLA as the fraction of calls below1\.51\.5seconds, matchingLrefL\_\{\\text\{ref\}\}\.
#### Baselines\.
The baselines cover three prior\-work styles\. First,Static\-T1\(always Tavily\),Round\-Robin, andReactive\-Cooldownrepresent production routing policies: fixed priority, load spreading, and LiteLLM\-style priority\+cooldown failover\. Second,EMA\-Greedy,SW\-UCB\(Garivier and Moulines,[2011](https://arxiv.org/html/2605.14241#bib.bib35)\), andContextRouterepresent academic online routing policies with load\-aware or contextual selection\. Third, additive contextual variants in Table[7](https://arxiv.org/html/2605.14241#S4.T7)isolate the common prior objective that combines answer quality and latency as an additive reward\. We compare these policies withLQM\-ContextRoute, and includeLQM\-onlyonly as a scoring ablation \(Lref=1500L\_\{\\text\{ref\}\}=1500ms,λ=1\\lambda=1\)\.Latency\-OracleandF1\-Oraclebracket the attainable latency and quality\. Unless otherwise noted, each router runs forT=200T=200rounds over 50 seeds\. Recent LLM endpoint routers such as RouteLLM, PILOT, OmniRouter, BEST\-Route, and ParetoBandit optimize model selection, dollar cost, or global budget allocation rather than load\-aware routing among providers behind one selected tool interface\. We therefore compare to their relevant policy classes: contextual selection, additive cost/reward objectives, production fallback, and oracle frontiers\.
#### Unified provider\-pool matrix\.
Table[2](https://arxiv.org/html/2605.14241#S4.T2)gives the cross\-pool view\. It uses the same router family, four non\-stationary load patterns, and the same quality/latency/SLA metrics across six same\-function provider pools\. The first row is the deployment\-shaped search benchmark; the next three rows vary provider\-quality heterogeneity in QA and LLM\-backend pools; the last two rows test whether the latency\-quality matching objective transfers to retriever providers\. All rows reportLQM\-ContextRoute, the final contextual router\.
Table 2:Unified same\-function provider\-pool results\. Deltas are quality points relative to SW\-UCB andContextRoute\.
### 4\.2Q1: Does routing help under load?
The WebSearch row of Table[2](https://arxiv.org/html/2605.14241#S4.T2)is the deployment\-shaped load benchmark: three web\-search providers behind one logical interface, measured provider\-response scores, and non\-stationary load\. Table[3](https://arxiv.org/html/2605.14241#S4.T3)breaks this result down by load pattern\. The Tavily/Brave quality gap is only about11pp, so this panel is primarily a latency/SLA test rather than the largest expected quality gain\. All learning routers address the load problem: againstStatic\-T1, they cut mean latency by5050–67%67\\%onstep, lift SLA from89%89\\%to≥98%\\geq 98\\%, and stay within11pp F1 of the latency oracle\. Within the learning routers,LQM\-ContextRouteadds\+0\.85\+0\.85pp F1 overContextRouteaveraged over patterns and\+2\.18\+2\.18pp overSW\-UCB, although individual load patterns still expose the expected latency\-quality trade\-off\.
Figure[1](https://arxiv.org/html/2605.14241#S4.F1)makes the latency\-quality trade\-off explicit\.LQM\-ContextRoutesits on the empirical Pareto frontier of the learning routers: relative toSW\-UCB, it spends latency for\+2\.18\+2\.18pp F1; relative toContextRoute, it gains\+0\.85\+0\.85pp F1 with slightly lower mean latency\. A sweep overLref∈\{750,1500,3000\}L\_\{\\text\{ref\}\}\\in\\\{750,1500,3000\\\}changesLQM\-ContextRouteby at most0\.290\.29pp F1 and keeps SLA in97\.997\.9–98\.0%98\.0\\%\(App\.[C](https://arxiv.org/html/2605.14241#A3.SS0.SSS0.Px3)\)\.
Thespikepattern, omitted from Table[3](https://arxiv.org/html/2605.14241#S4.T3)for space, follows the same reading:LQM\-ContextRouteobtains F10\.5910\.591at387387ms mean latency, compared withContextRouteat0\.5830\.583and415415ms,SW\-UCBat0\.5790\.579and263263ms, andStatic\-T1at0\.5720\.572and580580ms\. These results support the load\-aware routing claim: learning routers reduce latency under non\-stationary service states, andLQM\-ContextRouteremains on the quality/latency frontier among learnable policies\. The comparison with static, cooldown, and load\-aware bandit baselines also shows why latency alone is insufficient\.
Table 3:Main web\-search benchmark by load pattern\.spikeis discussed in the text\.Figure 1:Latency\-quality Pareto view of the main benchmark\. Marker size encodes SLA@1\.5s\.
### 4\.3Q2: When does latency\-quality matching matter?
The latency\-quality matching score should matter most when the same interface hides large provider\-quality differences\. In the main benchmark, a slice by per\-query cross\-provider F1 gap tests this mechanism: among 106 high\-gap questions,LQM\-ContextRoutegains\+4\.42\+4\.42pp overSW\-UCBand\+1\.33\+1\.33pp overContextRoutewhile cutting DDG traffic by14\.414\.4and8\.08\.0pp, respectively\. On 85 zero\-gap questions, where no router can exploit provider\-quality heterogeneity, F1 deltas shrink to within±0\.5\\pm 0\.5pp\.
Table 4:Main\-benchmark heterogeneity slice forLQM\-ContextRoutevs\. SW\-UCB\. F1 and DDG\-share deltas are percentage points\.The DDG\-share reduction is diagnostic: additive or load\-greedy policies can overuse the fast but weak provider under stress, whileLQM\-ContextRoutereduces that traffic most in the high\-gap slice\.
StrategyQA\(Gevaet al\.,[2021](https://arxiv.org/html/2605.14241#bib.bib33)\)requires implicit multi\-hop boolean reasoning \(e\.g\. “Was the director of*E\.T\.*born before the founding of Pixar?”\)\. We use it to instantiate the high\-heterogeneity regime in Theorem[2](https://arxiv.org/html/2605.14241#Thmtheorem2): three arms share the same question\-answering interface but have sharply different quality/latency trade\-offs\. The pool containsQwen\-7B\(Qwen2\.5\-7B\-Instruct, direct parametric answer; accuracy = 0\.643\),Llama\-1B\(Llama\-3\.2\-1B\-Instruct; accuracy = 0\.520\), andDDG\+Judge\(DuckDuckGo snippets judged by Qwen\-7B; accuracy = 0\.123\)\. DDG is functionally valid but poorly matched to this task because snippets rarely contain the multi\-step inference chain StrategyQA needs\. This creates a 52 pp quality gap, large enough to make additive compensation visible in aggregate\.
Table[5](https://arxiv.org/html/2605.14241#S4.T5)reports the result\.SW\-UCBaccuracy drops to0\.390\.39–0\.460\.46\(−19\-19to−26\-26pp belowStatic\-T0\) because DDG’s lowτ~\\tilde\{\\tau\}compensates for its near\-zerouuand draws traffic toward the lower\-quality provider\.ContextRoutemitigates this partly \(0\.5210\.521–0\.5800\.580\)\.LQM\-ContextRoutechanges the ranking by treating latency as service\-cycle cost and reaches0\.5740\.574–0\.6310\.631\(\+18\+18pp overSW\-UCB\)\. TheLQM\-onlyablation recovers0\.5470\.547–0\.5880\.588, which confirms that the scoring objective accounts for most of the gain before context is added\. Static\-T0 has the highest pure\-quality ceiling because it always calls Qwen\-7B, but it pays for that choice with12381238ms mean latency and only70%70\\%SLA\. The largest gain appears in the case predicted by the theory: provider quality and latency are misaligned\.
Table 5:StrategyQA high\-heterogeneity routing\.The no\-DDG companions in Table[6](https://arxiv.org/html/2605.14241#S4.T6)replace the weak DDG arm with plausible providers\. With a smaller12\.312\.3pp gap, the gains shrink but remain positive on the three\-provider companion\.
Table[7](https://arxiv.org/html/2605.14241#S4.T7)asks which part of the method produces the gain\. On web search, adding the latency\-quality matching score to the contextual router improves F1 from0\.5730\.573to0\.5810\.581without increasing mean latency; on StrategyQA, where heterogeneity is stronger, the full router reaches the highest accuracy\. The renewal score addresses latency\-quality compensation, while context decides when quality differences are query\-specific\. An additiveα\\alphasweep further checks that this is not merely a mis\-tuned baseline\. Acrossα∈\{0\.1,0\.3,0\.5,0\.7,0\.9\}\\alpha\\in\\\{0\.1,0\.3,0\.5,0\.7,0\.9\\\}, the strongest additive contextual router remains belowLQM\-ContextRouteon both the main benchmark \(0\.58060\.5806vs\.0\.58340\.5834F1\) and StrategyQA \(0\.58940\.5894vs\.0\.59780\.5978accuracy\), while SW\-UCB remains farther behind \(App\.[C](https://arxiv.org/html/2605.14241#A3.SS0.SSS0.Px4)\)\.
Table 6:StrategyQA companion pools with no weak DDG arm\.Table 7:Component ablation; Full router isLQM\-ContextRoute\.
### 4\.4Q3: Does the objective transfer beyond web search?
Q3 asks whether the objective transfers beyond web search and then checks stable\-provider limits\. We run two retriever\-pool checks with provider\-quality differences large enough for the objective to matter\. OnSciFact\(Wadden and others,[2020](https://arxiv.org/html/2605.14241#bib.bib31)\), the providers aregte\-base\(NDCG0\.7620\.762\),bge\-base\(0\.7400\.740\), andminilm\(0\.6450\.645\); per\-retriever NDCG@10 on 300 test claims is from BEIR\(Thakur and others,[2021](https://arxiv.org/html/2605.14241#bib.bib32)\)\. OnNFCorpus, providers retrieve from the same corpus using a slower TF\-IDF fusion index \(0\.3020\.302\), a word TF\-IDF index \(0\.2950\.295\), and a fast title\-only index \(0\.2150\.215\)\. Table[8](https://arxiv.org/html/2605.14241#S4.T8)gives the SciFact breakdown, and Table[2](https://arxiv.org/html/2605.14241#S4.T2)reports both retriever aggregates\. The retriever results support the same mechanism as Q2:LQM\-ContextRoutespends more latency on the stronger retrievers and improves NDCG by\+2\.91\+2\.91pp on SciFact and\+3\.22\+3\.22pp on NFCorpus overSW\-UCB, while remaining above95%95\\%SLA\. TheLQM\-onlyablation is slightly stronger on these retriever pools, which suggests that generic text embeddings are not always the best context for retrieval routing; the proposed router still avoids the latency\-greedy failure\. On NFCorpus, the always\-fusion static provider has the highest raw NDCG but much worse service behavior \(11041104ms mean latency,73%73\\%SLA\), so the routing result should be read as a quality\-latency trade\-off rather than a pure\-quality win\.
Table 8:SciFact heterogeneous retriever pool\.We also run a small live ReAct loop with Tavily/Brave/DDG \(n=30n\{=\}30, max\_steps=3=3, full fallback\)\. This is a stable\-provider regime: Tavily is dominant, andLQM\-ContextRouteties Static\-T1 \(0\.36900\.3690vs\.0\.36810\.3681,p=0\.33p\{=\}0\.33\)\. Bootstrapping the main benchmark from 270 live\-measured calls givesLQM\-ContextRoute\+4\.03\+4\.03pp over Static\-T1 \(App\.[C](https://arxiv.org/html/2605.14241#A3.SS0.SSS0.Px5)\); this uses measured tail profiles while reusing the measured provider\-response score table\. We treat the live ReAct result as a stability check, not as the main evidence forLQM\-ContextRoute’s advantage\.
Outage replay tests a failure regime \(Table[9](https://arxiv.org/html/2605.14241#S4.T9)\)\. With full agent fallback, the static strong provider remains hard to beat; without fallback, the same static policy collapses and the routers recover roughly twice its F1\. This result is a failure\-regime stress test rather than a head\-to\-head comparison of routing objectives\.
Table 9:ReAct outage replay: aggregate F1 under Brave outage\.Table[10](https://arxiv.org/html/2605.14241#S4.T10)checks the last deployment ingredient: online reward\. On K=2 \(Brave \+ DDG\),ContextRoutetrained on a 1B LLM\-judge reward recovers8888–98%98\\%of the F1 obtained when training on a 30B\-oracle reward, evaluated against the same held\-out 30B reference\. This experiment isolates reward\-proxy sufficiency; it is not a comparison of routing objectives\. These checks define the operating regime:LQM\-ContextRoutehelps most when load pressure and quality heterogeneity appear together\. A non\-search LLM\-provider ladder in App\.[D](https://arxiv.org/html/2605.14241#A4.SS0.SSS0.Px2)adds one more same\-function pool: when the same answer interface hides slower but much higher\-quality LLM providers,LQM\-ContextRouteimproves F1 by\+8\.89\+8\.89pp overSW\-UCBand\+4\.64\+4\.64pp overContextRouteby spending latency where the quality return is large\.
Table 10:Closed\-loop reward\-proxy validation forContextRoute\.
## 5Related Work
#### Tool and model routing\.
Tool\-augmented agent work studies which tool an agent should call and how tool outputs enter reasoning\(Schick and others,[2023](https://arxiv.org/html/2605.14241#bib.bib17); Patil and others,[2023](https://arxiv.org/html/2605.14241#bib.bib18); Qin and others,[2023](https://arxiv.org/html/2605.14241#bib.bib19); Yao and others,[2022](https://arxiv.org/html/2605.14241#bib.bib20)\)\. Our setting begins after that decision: a gateway routes one selected tool type among functionally equivalent providers behind the same interface, a pattern made common by MCP\-style tool interfaces\(Anthropic,[2024](https://arxiv.org/html/2605.14241#bib.bib2)\)\. LLM\-routing benchmarks and systems\(Hu and others,[2024](https://arxiv.org/html/2605.14241#bib.bib13); Liet al\.,[2026](https://arxiv.org/html/2605.14241#bib.bib14); Ong and others,[2024](https://arxiv.org/html/2605.14241#bib.bib15); Jitkrittumet al\.,[2025](https://arxiv.org/html/2605.14241#bib.bib16)\)choose among model endpoints, often trading answer quality against inference cost\. Recent online and budgeted routers use bandit feedback, test\-time compute, or constrained allocation\(Zhanget al\.,[2024](https://arxiv.org/html/2605.14241#bib.bib22); Anonymous,[2025b](https://arxiv.org/html/2605.14241#bib.bib24),[a](https://arxiv.org/html/2605.14241#bib.bib23); Poonet al\.,[2025](https://arxiv.org/html/2605.14241#bib.bib25); Dinget al\.,[2025](https://arxiv.org/html/2605.14241#bib.bib26); Meiet al\.,[2025](https://arxiv.org/html/2605.14241#bib.bib27); Taberner\-Miller,[2026](https://arxiv.org/html/2605.14241#bib.bib28)\)\. We route tool providers instead of LLM endpoints, and the key runtime signal is service load rather than only model cost\. Production gateways such as LiteLLM and Portkey provide priority, cooldown, and fallback policies\(BerriAI,[2026](https://arxiv.org/html/2605.14241#bib.bib7); Portkey,[2026](https://arxiv.org/html/2605.14241#bib.bib5),[2025](https://arxiv.org/html/2605.14241#bib.bib6)\);LQM\-ContextRouteadds a quality\-aware selection layer for same\-function pools\.
#### Cost\-aware bandits\.
Non\-stationary and contextual bandits adapt to changing rewards and query context\(Garivier and Moulines,[2011](https://arxiv.org/html/2605.14241#bib.bib35); Kocsis and Szepesvári,[2006](https://arxiv.org/html/2605.14241#bib.bib36); Russacet al\.,[2019](https://arxiv.org/html/2605.14241#bib.bib37); Liet al\.,[2010](https://arxiv.org/html/2605.14241#bib.bib34)\)\. Bandits with knapsacks, constrained bandits, and CVaR bandits model resource limits or risk\(Badanidiyuruet al\.,[2018](https://arxiv.org/html/2605.14241#bib.bib41); Agrawal and Devanur,[2019](https://arxiv.org/html/2605.14241#bib.bib42); Combeset al\.,[2015](https://arxiv.org/html/2605.14241#bib.bib43); Tamkinet al\.,[2019](https://arxiv.org/html/2605.14241#bib.bib44)\)\. These tools do not by themselves address the failure studied here: additive utility\-cost rewards can rank a fast poor provider above a slow useful one\. Our renewal scoreu/\(1\+τ/Lref\)u/\(1\{\+\}\\tau/L\_\{\\text\{ref\}\}\)follows the reward\-per\-service\-cycle view\(Ross,[1996](https://arxiv.org/html/2605.14241#bib.bib39)\)and targets that compensation failure directly\.
## 6Conclusion
We study routing among functionally equivalent tool providers for LLM agents under runtime load\. Additive latency\-quality rewards can prefer a fast low\-quality provider over a slower useful one \(Thm\.[2](https://arxiv.org/html/2605.14241#Thmtheorem2)\)\.LQM\-ContextRoutereplaces reward mixing with the renewal\-reward scoreui/\(1\+τi/Lref\)u\_\{i\}/\(1\+\\tau\_\{i\}/L\_\{\\text\{ref\}\}\)inside a contextual bandit with online judge feedback\. Across search, QA, and retriever pools, the main gains appear when runtime pressure and provider\-quality heterogeneity coexist:LQM\-ContextRoutecuts latency and SLA misses, avoids additive\-reward collapse, and remains neutral in stable\-dominant pools\.
## 7Limitations
The main benchmark records provider\-response scores before the routing runs and then varies runtime load, so it isolates routing decisions rather than full production traffic\. Non\-search evidence comes from retriever pools and an LLM\-provider ladder, not broad live traffic across many tool categories\. The method also assumes a same\-function provider pool and a reliable online quality proxy; it does not address upstream tool selection, multi\-tool planning, or provider schema mismatch\. The results should be read as evidence about when provider routing matters rather than as an estimate of average production lift\.
## References
- J\. Aczél \(1966\)Lectures on functional equations and their applications\.Mathematics in Science and Engineering, Vol\.19,Academic Press\.Cited by:[Appendix B](https://arxiv.org/html/2605.14241#A2.SS0.SSS0.Px2.p2.4)\.
- S\. Agrawal and N\. R\. Devanur \(2019\)Bandits with concave rewards and convex knapsacks\.InACM EC,Cited by:[Appendix A](https://arxiv.org/html/2605.14241#A1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- Anonymous \(2025a\)Learning to route LLMs from bandit feedback \(BaRP\)\.arXiv preprint arXiv:2510\.07429\.Note:Multi\-objective contextual bandit for LLM routing under bandit feedbackCited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- Anonymous \(2025b\)PILOT: adaptive LLM routing under budget constraints\.arXiv preprint arXiv:2508\.21141\.Note:EMNLP 2025 Findings; LinUCB extension with preference prior \+ cost knapsack for LLM routingCited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- Anthropic \(2024\)Model Context Protocol Specification\.Note:[https://modelcontextprotocol\.io/](https://modelcontextprotocol.io/)Accessed: 2026\-05\-02Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p1.1),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Badanidiyuru, R\. Kleinberg, and A\. Slivkins \(2018\)Bandits with knapsacks\.Journal of the ACM65\(3\)\.Cited by:[Appendix A](https://arxiv.org/html/2605.14241#A1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- BerriAI \(2026\)LiteLLM Routing and Load Balancing Documentation\.Note:[https://docs\.litellm\.ai/docs/routing](https://docs.litellm.ai/docs/routing)Accessed: 2026\-05\-02Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p2.2),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- Chadderwala \(2025\)Optimizing life sciences agents in real\-time using reinforcement learning\.arXiv preprint arXiv:2512\.03065\.Note:Thompson Sampling contextual bandit over heterogeneous tools \(PubMed, drug DBs, calculator, web\) with composite reward including latencyCited by:[§2](https://arxiv.org/html/2605.14241#S2.SS0.SSS0.Px4.p1.8)\.
- R\. Combes, C\. Jiang, and R\. Srikant \(2015\)Bandits with budgets: regret lower bounds and optimal algorithms\.InACM SIGMETRICS,Cited by:[Appendix A](https://arxiv.org/html/2605.14241#A1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- Digital Applied \(2026\)MCP Server Reliability: A 100\-Server Stress Test Study\.Note:[https://www\.digitalapplied\.com/blog/mcp\-server\-reliability\-100\-server\-stress\-test\-study](https://www.digitalapplied.com/blog/mcp-server-reliability-100-server-stress-test-study)Accessed: 2026\-05\-02Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p1.1),[§2](https://arxiv.org/html/2605.14241#S2.SS0.SSS0.Px3.p1.8)\.
- D\. Ding, A\. Mallick, S\. Zhang, C\. Wang, D\. Madrigal, M\. H\. Garcia, M\. Xia, L\. Lakshmanan, Q\. Wu, and V\. Ruehle \(2025\)BEST\-Route: adaptive LLM routing with test\-time optimal compute\.arXiv preprint arXiv:2506\.22716\.Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Garivier and E\. Moulines \(2011\)On upper\-confidence bound policies for switching bandit problems\.ALT\.Cited by:[§4\.1](https://arxiv.org/html/2605.14241#S4.SS1.SSS0.Px3.p1.3),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021\)Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.Transactions of the Association for Computational Linguistics9,pp\. 346–361\.Cited by:[§4\.3](https://arxiv.org/html/2605.14241#S4.SS3.p3.1)\.
- A\. Gupta \(2026\)ReliabilityBench: evaluating LLM agent reliability under production\-like stress conditions\.arXiv preprint arXiv:2601\.06112\.Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p1.1),[§2](https://arxiv.org/html/2605.14241#S2.SS0.SSS0.Px3.p1.8)\.
- Q\. J\. Huet al\.\(2024\)RouterBench: a benchmark for multi\-LLM routing system\.arXiv preprint arXiv:2403\.12031\.Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- W\. Jitkrittum, H\. Narasimhan, A\. S\. Rawat, J\. Juneja, C\. Wang, Z\. Wang, A\. Go, C\. Lee, P\. Shenoy, R\. Panigrahy, A\. K\. Menon, and S\. Kumar \(2025\)Universal model routing for efficient LLM inference\.arXiv preprint arXiv:2502\.08773\.Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- L\. Kocsis and C\. Szepesvári \(2006\)Discounted UCB\.InECML,Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- H\. Li, Y\. Zhang, Z\. Guo, C\. Wang, S\. Tang, Q\. Zhang, Y\. Chen, B\. Qi, P\. Ye, L\. Bai, Z\. Wang, and S\. Hu \(2026\)LLMRouterBench: a massive benchmark and unified framework for LLM routing\.arXiv preprint arXiv:2601\.07206\.Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- L\. Li, W\. Chu, J\. Langford, and R\. E\. Schapire \(2010\)A contextual\-bandit approach to personalized news article recommendation\.InWWW,Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- K\. Mei, W\. Xu, M\. Guo, S\. Lin, and Y\. Zhang \(2025\)OmniRouter: budget and performance controllable multi\-LLM routing\.SIGKDD Explorations\.Cited by:[Appendix A](https://arxiv.org/html/2605.14241#A1.SS0.SSS0.Px1.p2.2),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- I\. Onget al\.\(2024\)RouteLLM: learning to route LLMs with preference data\.arXiv preprint arXiv:2406\.18665\.Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- S\. G\. Patilet al\.\(2023\)Gorilla: large language model connected with massive APIs\.arXiv preprint arXiv:2305\.15334\.Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p2.2),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Poon, X\. Dai, X\. Liu, F\. Kong, J\. C\. S\. Lui, and J\. Zuo \(2025\)Online multi\-LLM selection via contextual bandits under unstructured context evolution\.arXiv preprint arXiv:2506\.17670\.Cited by:[§2](https://arxiv.org/html/2605.14241#S2.SS0.SSS0.Px4.p1.8),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- Portkey \(2025\)Failover Routing Strategies for LLMs in Production\.Note:[https://portkey\.ai/blog/failover\-routing\-strategies\-for\-llms\-in\-production/](https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/)Accessed: 2026\-05\-02Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- Portkey \(2026\)The Most Reliable AI Gateway for Production Systems\.Note:[https://portkey\.ai/blog/the\-most\-reliable\-ai\-gateway\-for\-production\-systems/](https://portkey.ai/blog/the-most-reliable-ai-gateway-for-production-systems/)Accessed: 2026\-05\-02Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p2.2),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Qinet al\.\(2023\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.arXiv preprint arXiv:2307\.16789\.Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p2.2),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- S\. M\. Ross \(1996\)Stochastic processes\.2nd edition,Wiley\.Note:Renewal\-reward theorem \(Theorem 3\.6\.1\)Cited by:[Appendix B](https://arxiv.org/html/2605.14241#A2.SS0.SSS0.Px2.p1.6),[§3\.2](https://arxiv.org/html/2605.14241#S3.SS2.p1.5),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Russac, C\. Vernade, and O\. Cappé \(2019\)Weighted linear bandits for non\-stationary environments\.InNeurIPS,Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- T\. Schicket al\.\(2023\)Toolformer: language models can teach themselves to use tools\.arXiv preprint arXiv:2302\.04761\.Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p2.2),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Taberner\-Miller \(2026\)ParetoBandit: budget\-paced adaptive routing for non\-stationary LLM serving\.arXiv preprint arXiv:2604\.00136\.Cited by:[Appendix A](https://arxiv.org/html/2605.14241#A1.SS0.SSS0.Px1.p2.2),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Tamkin, D\. Sahoo,et al\.\(2019\)Distributional reinforcement learning for risk\-sensitive policies\.InNeurIPS,Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px2.p1.1)\.
- N\. Thakuret al\.\(2021\)BEIR: a heterogeneous benchmark for zero\-shot evaluation of information retrieval models\.InNeurIPS Datasets and Benchmarks,Cited by:[§4\.4](https://arxiv.org/html/2605.14241#S4.SS4.p1.11)\.
- D\. Waddenet al\.\(2020\)Fact or fiction: verifying scientific claims\.InEMNLP,Cited by:[§4\.4](https://arxiv.org/html/2605.14241#S4.SS4.p1.11)\.
- S\. Yaoet al\.\(2022\)ReAct: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2605.14241#S1.p2.2),[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
- B\. Zhang, G\. Wang, Q\. Chen, and A\. van den Hengel \(2024\)How do we select right LLM for each query? MAR: multi\-armed recommender for online LLM selection\.Note:OpenReview preprintContextual bandit \+ LLM\-as\-judge for online LLM routing on 4,029\-query WildArena dataset; OpenReview ID AfA3qNY0FqExternal Links:[Link](https://openreview.net/forum?id=AfA3qNY0Fq)Cited by:[§5](https://arxiv.org/html/2605.14241#S5.SS0.SSS0.Px1.p1.1)\.
## Appendix APositioning vs\. prior LLM\-routing work
Table 11:Positioning by the problem axes used in this paper\.#### Constrained, knapsack, and Pareto routing\.
Constrained bandits and bandits with knapsacks model resource limits as budgets over a sequence of decisions\(Badanidiyuruet al\.,[2018](https://arxiv.org/html/2605.14241#bib.bib41); Agrawal and Devanur,[2019](https://arxiv.org/html/2605.14241#bib.bib42); Combeset al\.,[2015](https://arxiv.org/html/2605.14241#bib.bib43)\)\. This view is related to our use of a service\-time budget, but it answers a different question\. A knapsack formulation controls aggregate resource consumption; it does not specify how latency and answer quality should be ranked for two same\-function providers on a single request\. An additive Lagrangian or scalarized reward can still prefer a fast low\-quality arm when latency compensates for low utility, which is the score\-level failure in Theorem[2](https://arxiv.org/html/2605.14241#Thmtheorem2)\.
Pareto routing methods and budget\-paced LLM routers expose a quality\-cost frontier or allocate traffic under a global budget\(Meiet al\.,[2025](https://arxiv.org/html/2605.14241#bib.bib27); Taberner\-Miller,[2026](https://arxiv.org/html/2605.14241#bib.bib28)\)\.LQM\-ContextRouteinstead provides a single online selection rule for a gateway that has already selected the tool type and must choose a provider under current load\. The renewal\-rate scoreu/\(1\+τ/Lref\)u/\(1\+\\tau/L\_\{\\text\{ref\}\}\)can be viewed as one operating point on the frontier, withLrefL\_\{\\text\{ref\}\}setting the latency scale\. This makes the method compatible with budgeted routing layers while keeping the provider\-level objective focused on avoiding latency\-quality compensation\.
## Appendix BProofs and regret sketch
We give the deferred details for the transformed\-reward regret sketch and the two score\-level theorems\. In the stationaryKK\-armed setting, letui∈\[0,1\]u\_\{i\}\\in\[0,1\]be sub\-Gaussian with proxyσu2\\sigma\_\{u\}^\{2\}andτi≥0\\tau\_\{i\}\\geq 0be sub\-Gaussian with proxyστ2\\sigma\_\{\\tau\}^\{2\}\. DefineVi⋆=ui⋆/\(1\+τi⋆/Lref\)V\_\{i\}^\{\\star\}=u\_\{i\}^\{\\star\}/\(1\+\\tau\_\{i\}^\{\\star\}/L\_\{\\text\{ref\}\}\)andΔiV=Vi⋆⋆−Vi⋆\\Delta\_\{i\}^\{V\}=V\_\{i^\{\\star\}\}^\{\\star\}\-V\_\{i\}^\{\\star\}\.
#### Concentration and regret\.
For the plug\-in estimatorV^i=u^i/\(1\+τ^i/Lref\)\\hat\{V\}\_\{i\}=\\hat\{u\}\_\{i\}/\(1\+\\hat\{\\tau\}\_\{i\}/L\_\{\\text\{ref\}\}\), a first\-order Taylor expansion gives, with probability at least1−δ1\-\\delta,
\|V^i−Vi⋆\|≤\(1\+Lref−1\)2σ2log\(1/δ\)ni\+O\(1/ni\),\|\\hat\{V\}\_\{i\}\-V\_\{i\}^\{\\star\}\|\\leq\(1\+L\_\{\\text\{ref\}\}^\{\-1\}\)\\sqrt\{\\frac\{2\\sigma^\{2\}\\log\(1/\\delta\)\}\{n\_\{i\}\}\}\+O\(1/n\_\{i\}\),whereσ2=max\(σu2,στ2\)\\sigma^\{2\}=\\max\(\\sigma\_\{u\}^\{2\},\\sigma\_\{\\tau\}^\{2\}\)\. The unmodulated optimistic version of the scoring ablation \(Eq\.[4](https://arxiv.org/html/2605.14241#S3.E4)withλ=0\\lambda=0\) satisfies
RT≤∑i:ΔiV\>0C\(1\+Lref−1\)2σ2logTΔiV\+o\(logT\)\.R\_\{T\}\\leq\\sum\_\{i:\\Delta\_\{i\}^\{V\}\>0\}\\frac\{C\(1\+L\_\{\\text\{ref\}\}^\{\-1\}\)^\{2\}\\sigma^\{2\}\\log T\}\{\\Delta\_\{i\}^\{V\}\}\+o\(\\log T\)\.Sliding\-window concentration gives the usual non\-stationary additive termO\(TlogT⋅VT\)O\(\\sqrt\{T\\log T\\cdot V\_\{T\}\}\)\. The implementedλ\>0\\lambda\>0quality modulation is not covered by this optimism guarantee: becauseΔi\\Delta\_\{i\}is estimated online, it can suppress exploration after an early quality\-estimation error\. We use it as an empirical deflation heuristic for clearly dominated arms\.
#### Characterisation under linear\-cycle calibration\.
LetS\(u,τ\)=T\(u,z\)S\(u,\\tau\)=T\(u,z\)withz=τ/Lrefz=\\tau/L\_\{\\text\{ref\}\}\. A1–A4 from §[3\.2](https://arxiv.org/html/2605.14241#S3.SS2)imply non\-compensation, scale invariance, monotonicity, and boundedness, but they do not uniquely specify a score; for exampleu\(1\+z\)−αu\(1\+z\)^\{\-\\alpha\}satisfies them for anyα\>0\\alpha\>0\. A5 is the additional structural calibration: a call is a renewal cycle of duration1\+z1\+z, so the reward rate isu/\(1\+z\)u/\(1\+z\)by the renewal\-reward theorem\(Ross,[1996](https://arxiv.org/html/2605.14241#bib.bib39), Thm\. 3\.6\.1\)\.
###### Theorem 1\(Characterisation under linear\-cycle calibration\)\.
Under A1–A4 plus A5, the selected member of the admissible score family \(up to monotone transformation of the quality scale\) isVi=ui/\(1\+τi/Lref\)V\_\{i\}=u\_\{i\}/\(1\+\\tau\_\{i\}/L\_\{\\text\{ref\}\}\)\.
Formally, A5 makes the latency ratioT\(u,z1\)/T\(u,z2\)T\(u,z\_\{1\}\)/T\(u,z\_\{2\}\)depend only on\(1\+z2\)/\(1\+z1\)\(1\+z\_\{2\}\)/\(1\+z\_\{1\}\); standard multiplicative functional\-equation arguments yieldT\(u,z\)=u\(1\+z\)−αT\(u,z\)=u\(1\+z\)^\{\-\\alpha\}\(Aczél,[1966](https://arxiv.org/html/2605.14241#bib.bib40), Ch\. 3\), and the linear renewal cycle fixesα=1\\alpha=1\.
#### Separation from additive composites\.
###### Theorem 2\(Renewal\-reward vs\. additive separation\)\.
Fixα∈\(0,1\)\\alpha\\in\(0,1\)and letriadd\(α\)=αui−\(1−α\)τ~ir\_\{i\}^\{\\mathrm\{add\}\}\(\\alpha\)=\\alpha u\_\{i\}\-\(1\-\\alpha\)\\tilde\{\\tau\}\_\{i\}withτ~=min\{τ/Lref,1\}\\tilde\{\\tau\}=\\min\\\{\\tau/L\_\{\\text\{ref\}\},1\\\}\. There exists a two\-arm instance where the additive score chooses the lower\-quality faster arm whileVi=ui/\(1\+τ~i\)V\_\{i\}=u\_\{i\}/\(1\+\\tilde\{\\tau\}\_\{i\}\)chooses the higher\-quality arm whenever
u2Δτ~1\+τ~2<Δu<1−ααΔτ~\.\\frac\{u\_\{2\}\\Delta\_\{\\tilde\{\\tau\}\}\}\{1\+\\tilde\{\\tau\}\_\{2\}\}<\\Delta\_\{u\}<\\frac\{1\-\\alpha\}\{\\alpha\}\\Delta\_\{\\tilde\{\\tau\}\}\.The interval is non\-empty whenu2/\(1\+τ~2\)<\(1−α\)/αu\_\{2\}/\(1\+\\tilde\{\\tau\}\_\{2\}\)<\(1\-\\alpha\)/\\alpha\.
Proof sketch: in a two\-arm pool withτ~1\>τ~2\\tilde\{\\tau\}\_\{1\}\>\\tilde\{\\tau\}\_\{2\}andΔu=u1−u2\\Delta\_\{u\}=u\_\{1\}\-u\_\{2\}, the additive score chooses the faster lower\-quality arm 2 iffΔu<1−ααΔτ~\\Delta\_\{u\}<\\frac\{1\-\\alpha\}\{\\alpha\}\\Delta\_\{\\tilde\{\\tau\}\}\. The renewal score chooses arm 1 iffΔu\>u2Δτ~/\(1\+τ~2\)\\Delta\_\{u\}\>u\_\{2\}\\Delta\_\{\\tilde\{\\tau\}\}/\(1\+\\tilde\{\\tau\}\_\{2\}\)\. Hence the two rankings disagree exactly when
u2Δτ~1\+τ~2<Δu<1−ααΔτ~,\\frac\{u\_\{2\}\\Delta\_\{\\tilde\{\\tau\}\}\}\{1\+\\tilde\{\\tau\}\_\{2\}\}<\\Delta\_\{u\}<\\frac\{1\-\\alpha\}\{\\alpha\}\\Delta\_\{\\tilde\{\\tau\}\},which is non\-empty wheneveru2/\(1\+τ~2\)<\(1−α\)/αu\_\{2\}/\(1\+\\tilde\{\\tau\}\_\{2\}\)<\(1\-\\alpha\)/\\alpha\.
## Appendix CDeployment and robustness details
#### LLM\-as\-judge closed\-loop validation\.
On K=2 \(Brave \+ DDG, the two providers with stored raw responses\),ContextRoutetrained on the 1B LLM\-judge reward recovers8888–98%98\\%of the F1 obtained when training on the 30B\-oracle reward \(Table[10](https://arxiv.org/html/2605.14241#S4.T10)\)\. Static\-T1 \(always Brave\) wins absolute F1 in this response set because Brave is unusually strong; the point of this experiment is judge sufficiency, not router dominance\.
#### ReAct/outage replay\.
For each of 200 HotpotQA dev questions and each of \{Brave, DDG\}, we ran the Qwen\-2\.5\-7B agent with up to 3 ReAct steps and stored the final answer, F1, and tool latency\. The forced\-search replay in §[4\.4](https://arxiv.org/html/2605.14241#S4.SS4)usesn=100n=100questions, 30 seeds, and a sustained Brave outage atT/2T/2\. Table[9](https://arxiv.org/html/2605.14241#S4.T9)reports the aggregate F1\. Routing helps most when the agent cannot recover through its own retry budget; with full fallback, the static strong arm remains difficult to beat\.
#### Reference\-latency sensitivity\.
The main experiments useLref=1500L\_\{\\text\{ref\}\}=1500ms, matching the SLA threshold\. To check whether this choice drives the result, we rerun onlyLQM\-ContextRouteon the main HotpotQA\+TriviaQA benchmark withLref∈\{750,1500,3000\}L\_\{\\text\{ref\}\}\\in\\\{750,1500,3000\\\}ms\. Mean F1 varies by at most0\.290\.29pp and SLA remains within97\.997\.9–98\.0%98\.0\\%\(Table[12](https://arxiv.org/html/2605.14241#A3.T12)\)\.
Table 12:LQM\-ContextRoutesensitivity to the reference latency on the main benchmark\.
#### Additive\-reward alpha sweep\.
To test additive tuning, we sweep the quality weightα∈\{0\.1,0\.3,0\.5,0\.7,0\.9\}\\alpha\\in\\\{0\.1,0\.3,0\.5,0\.7,0\.9\\\}for both SW\-UCB and the contextual additive router\. Table[13](https://arxiv.org/html/2605.14241#A3.T13)reports the strongest additive variants by quality\. On the main benchmark, the best additive contextual router nearly matchesLQM\-ContextRoutebut has slightly lower F1 and higher latency\. On StrategyQA, where the low\-latency provider has much lower quality, increasingα\\alphahelps but does not remove the additive\-compensation failure\.
Table 13:Best additive baselines under anα\\alphasweep\.
#### Live latency profile and real\-bootstrap setup\.
We profile Tavily, Brave, and DuckDuckGo on 30 HotpotQA queries under three conditions: idle \(sequential\), moderate \(200 ms inter\-call sleep\), and stressed \(3 concurrent threads with 800 ms inter\-batch sleep\)\. All 270 calls succeeded\. The real\-bootstrap replay samples latencies with replacement from the corresponding empirical pool and maps the load bins to idle, moderate, or stressed without interpolation\.
Table 14:Live latency profile\.On HotpotQA K=3 \(T=200T=200, 50 seeds, 4 load patterns\), the real\-bootstrap run givesLQM\-ContextRouteF1 0\.5823 \(Static\-T1 0\.5420,Δ=\+4\.03\\Delta=\+4\.03pp\),SW\-UCB0\.5809,ContextRoute0\.5739, Latency\-Oracle 0\.5418, and F1\-Oracle 0\.6750\. This replay combines measured latency samples with the same provider\-response scores used in the main benchmark, strengthening the load evidence without claiming a full production deployment\.
## Appendix DAdditional diagnostics
#### StrategyQA no\-DDG companions\.
Table[6](https://arxiv.org/html/2605.14241#S4.T6)reports mean accuracy over four load patterns\. Per\-pattern results follow the same trend: on Pool\-synth\-3B,LQM\-ContextRoutewins all four patterns overContextRoute; on Pool\-noddg, it wins all four overContextRouteand three of four overSW\-UCB\.
#### LLM\-provider ladder\.
As a non\-search same\-function pool, we route among 1B, 7B, and 30B pre\-run answer providers over the same 200 questions\. This setting reverses the web\-search trade\-off: the higher\-quality providers are slower\.LQM\-ContextRouteimproves F1 by\+8\.89\+8\.89pp overSW\-UCBand\+4\.64\+4\.64pp overContextRoute, with\+526\+526and\+90\+90ms higher mean latency, respectively\. We treat this as a non\-search same\-function check showing thatLQM\-ContextRoutespends latency budget when the quality return is large\.
#### Reactive\-Cooldown\.
We also compare against a LiteLLM\-style priority\+cooldown router\. On K=10step, its F1 varies by0\.6050\.605across operator priority orderings;ContextRoutematches the priority\-tuned setting without manual ordering\. This supports the claim that quality\-aware online routing complements reactive fallback mechanisms\.Similar Articles
Learning Agent Routing From Early Experience
This paper introduces BoundaryRouter, a training-free framework that optimizes LLM agent usage by routing queries to either lightweight inference or full agent execution based on early experience. It also presents RouteBench, a benchmark for evaluating routing performance, showing significant improvements in speed and accuracy.
Built a routing layer for multi-model pipelines, picks the right LLM per request based on priority
A routing layer that automatically selects the best LLM per request based on priority flags (speed, cost, quality, balanced) using a weighted score, with under 1ms decision time and built-in fallback, caching, and metrics.
Inference-Time Budget Control for LLM Search Agents
This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
This paper introduces RouteProfile, a design space for LLM profiles in routing systems, demonstrating that structured profiles and query-level signals improve routing performance and generalization to new models.
Dynamic Latent Routing
Dynamic Latent Routing (DLR) lets LLMs learn their own inner monologue by composing sub-policies via search, inspired by language compositionality. In low-data fine-tuning, DLR matches or outperforms standard supervised fine-tuning.