Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

arXiv cs.CL 06/17/26, 04:00 AM Papers
enterprise agent-routing tool-calling scaling retrieval shortlisting llm
Summary
This paper studies how routing accuracy degrades as the number of agents scales from 10 to 110 in an enterprise productivity assistant, finding F1 drops of 16–23 percentage points. It diagnoses retrieval and confusion gaps and shows that embedding-based shortlisting recovers 10–11pp F1.
arXiv:2606.17519v1 Announce Type: new Abstract: Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:41 AM
# Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery
Source: [https://arxiv.org/html/2606.17519](https://arxiv.org/html/2606.17519)
###### Abstract

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single\-step routing on a 110\-agent, 584\-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents\. Routing F1 on under\-specified requests drops 16–23 percentage points across models\. An oracle analysis decomposes the degradation into a*retrieval*gap \(the model cannot surface the right tool\) and a*confusion*gap \(even with perfect retrieval, the oracle ceiling drops 10pp\)\. Embedding\-based shortlisting recovers \+10–11pp F1 at full scale across all three models and two providers\. A production annotation study \(1,435 human\-labeled utterances, three annotators\) confirms the recovery on real traffic at \+10–17pp despite 10–15pp lower absolute performance\.

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Kellen Gillespie and Robyn PerrySuperhuman, Inc\.\{kellen\.gillespie, robyn\.perry\}@grammarly\.com

## 1Introduction

LLM\-based assistants increasingly serve as orchestration layers that route user requests to specialized agents for email, project tracking, scheduling, and more\. As organizations add agents to these systems, the routing decision becomes harder and the model must select from a growing catalog of semantically overlapping options\.

This scaling challenge is already driving platform\-level responses\. OpenAI introduced namespace\-based tool search, Anthropic provides BM25 retrieval over tool descriptions, and MCP server registries are growing beyond what flat tool lists can support\. Prior work shows that tool\-calling performance degrades with catalog sizeKateet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib1)\)and that retrieval errors dominate agent failuresMoet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib2)\), but the mechanism \(what breaks, at what scale, and what levers exist\) remains undercharacterized\.

We present a controlled study of single\-step routing accuracy across 10–110 agents on a production\-sourced catalog from a deployed enterprise productivity assistant\. Our analysis has two parts:

1. 1\.Scaling diagnosis\(§[4\.1](https://arxiv.org/html/2606.17519#S4.SS1)\)\. F1 drops 16–23pp, driven primarily by recall\. An oracle analysis decomposes this into a*retrieval*gap \(the model cannot surface the right tool\) and a*confusion*gap \(the oracle ceiling drops from 79% to 69%\)\. The confusion gap is amplified by the enterprise productivity domain, where functionally similar tools \(Gmail/Outlook for email, Improve/Paraphraser/Proofreader for writing, Jira/Asana for project management\) grow naturally with the catalogQinet al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib14)\); Shiet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib8)\); Patilet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib19)\)\.
2. 2\.Shortlisting as intervention\(§[5](https://arxiv.org/html/2606.17519#S5), §[5\.3](https://arxiv.org/html/2606.17519#S5.SS3)\)\. Embedding shortlisting recovers \+10–11pp F1 at full scale across three models from two providers\. The recovery holds on 1,435 human\-annotated production utterances \(\+10–17pp\)\. Tool\-level retrieval outperforms all pack\-level approaches \(hierarchical LLM routing, pack\-level embedding, platform tool search\) by 2–4pp\. An error composition analysis \(§[5\.5](https://arxiv.org/html/2606.17519#S5.SS5)\) shows that shortlisting cuts routing misses from 31% to 10% at the cost of a stable 9% shortlister miss rate\.

## 2Related Work

#### Tool\-count scaling\.

Kateet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib1)\)stress\-test tool calling at 49–741 tools, reporting 7–85% performance drops\. LiveMCPBenchMoet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib2)\)finds retrieval errors account for∼\{\\sim\}50% of agent failures across 527 tools\. ToolshedLumeret al\.\([2025b](https://arxiv.org/html/2606.17519#bib.bib5)\), ScaleMCPLumeret al\.\([2025a](https://arxiv.org/html/2606.17519#bib.bib6)\), MonoScaleShaoet al\.\([2026](https://arxiv.org/html/2606.17519#bib.bib12)\), and RAG\-MCPGan and Sun \([2025](https://arxiv.org/html/2606.17519#bib.bib13)\)document performance collapse with growing tool and agent pools\. We add a precision/recall decomposition and controlled mitigations to this line of work\.

#### Tool retrieval and selection\.

ToolformerSchicket al\.\([2023](https://arxiv.org/html/2606.17519#bib.bib22)\)teaches LMs to insert tool calls during generation\. Retrieve\-then\-route approaches range from document retrieval over API catalogsPatilet al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib20)\); Qinet al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib14)\), fine\-tuned retrieversShiet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib8)\); Zhenget al\.\([2026](https://arxiv.org/html/2606.17519#bib.bib10)\), reranking and query rewritingZhenget al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib3)\); Chenet al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib4)\), and token\-level tool encodingHaoet al\.\([2023](https://arxiv.org/html/2606.17519#bib.bib21)\); Wanget al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib9)\)\. Benchmarks and data generation pipelinesLiuet al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib24)\); Wuet al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib25)\); Luet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib17)\)complement these with evaluation methodology\. ToolScopeLiuet al\.\([2026](https://arxiv.org/html/2606.17519#bib.bib11)\)tackles semantic overlap by merging similar tools at the catalog level\. We show that dense embedding retrieval outperforms both platform approaches and a fine\-tuned retrieverShiet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib8)\)without an LLM call\.

#### Agent system scaling\.

Kimet al\.\([2026](https://arxiv.org/html/2606.17519#bib.bib7)\)study when multi\-agent coordination outperforms single agents, while AgentArchBogavelliet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib18)\)varies agent architecture on fixed tool sets\. HuggingGPTShenet al\.\([2023](https://arxiv.org/html/2606.17519#bib.bib23)\)and AnyToolDuet al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib15)\)route via hierarchical dispatch over models and API tiers respectively\. ScaleCallOsuagwuet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib16)\)evaluates hybrid retrieval for enterprise tool selection\. We hold architecture constant and vary catalog size, comparing hierarchical dispatch against flat embedding retrieval\.

## 3Experimental Setup

### 3\.1Agent Catalog

Our catalog comprises 110 agents and 584 individual tools from a deployed enterprise productivity assistant, ranging from single\-purpose agents \(Weather, Google Translate\) to multi\-tool suites \(Gmail with 15\+ actions, Jira with 20\+\)\. The catalog has natural semantic overlap \(multiple email clients, writing tools, project trackers, and document editors\), creating routing ambiguity uncommon in API\-centric benchmarksQinet al\.\([2024](https://arxiv.org/html/2606.17519#bib.bib14)\); Shiet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib8)\); Patilet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib19)\)\. We evaluateminimal\(name and description\) andrich\(name, description, examples, semantic tags, enriched descriptions\) metadata variants\.

### 3\.2Evaluation Data

#### Synthetic queries\.

4,105 queries generated by GPT\-4o across difficulty levels:*explicit*\(names the tool: “send a Gmail”\) and*implicit*\(describes the need without naming it: “email the team about Monday’s deadline”\)\. Each query has a target tool and*also\-valid*labels enabling dynamic ground truth that adjusts to each sampled tool set\. The routing models \(GPT\-5\.x, Sonnet\) differ from the generation model, though GPT\-4o and GPT\-5\.x share a provider, creating potential distributional affinity\. Cross\-provider replication with Sonnet and the production validation \(§[5\.3](https://arxiv.org/html/2606.17519#S5.SS3)\) on human\-written utterances mitigate this concern\.

#### Production queries\.

1,435 utterances sampled from production traffic of the deployed system, stratified across agents with sufficient traffic \(capped at 100 utterances per agent\) and filtered for quality \(language, safety\)\. Three trained linguists independently annotated each utterance with top\-5 candidates from an LLM\-based shortlister \(GPT\-5\.4\), independent of the embedding retriever evaluated in §[5](https://arxiv.org/html/2606.17519#S5)\. Each annotator rated every candidate as*best option*,*also valid*, or*not applicable*, and could nominate tools outside the shortlisted set; fewer than 1% of gold labels \(13 cases\) required out\-of\-pool nominations\. Gold labels use majority vote \(≥\\geq2/3 annotators\)\. Per\-candidate Krippendorff’sα=0\.68\\alpha\{=\}0\.68\(ordinal\), reflecting the catalog’s semantic overlap: annotators agree on which tools are relevant but often disagree on which is*best*among near\-equivalents\. At the item level, 94% of utterances have at least one shared valid tool across annotators\. All production queries are implicit\.

#### Implicit queries as primary metric\.

Explicit queries are near\-ceiling \(\>\>90% F1\) at all scales\. We report on implicit queries throughout, as they represent realistic production traffic where the user does not name the target tool\.

### 3\.3Models and Routing

We evaluate three frontier models from two providers:GPT\-5\.1andGPT\-5\.4\(OpenAI, function calling via Responses API\) andClaude Sonnet 4\.5\(Anthropic, native tool use\)\. All use function\-calling interfaces where the tool catalog is provided as callable function definitions\.

### 3\.4Scale Points and Sampling

We evaluate at scale points of 10, 20, 30, 40, 60, 80, 100, and 110 agents \(51–584 tools\)\. At each non\-endpoint scale, we samplek=3k\{=\}3agent subsets \(folds\) and report fold\-averaged metrics with bootstrap 95% confidence intervals\. The 110\-agent endpoint is the full catalog \(single subset\), so CIs there are query\-level only\. Infrastructure agents \(general assistant, knowledge search, web search\) are always present\.

### 3\.5Metrics

Multi\-label precision, recall, and F1, computed per\-query against the dynamic valid set \(target tool plus any also\-valid tools present in the current fold\)\.

## 4How Routing Degrades at Scale

### 4\.1Scaling Curves

Figure[1](https://arxiv.org/html/2606.17519#S4.F1)shows routing F1 on implicit queries as catalog size grows from 51 to 584 tools\. Flat tool\-level routing \(GPT\-5\.4\) drops from 58\.2% to 42\.1%\. The degradation is*recall\-driven*: precision drops moderately \(68%→\\rightarrow60%\) while recall drops more than twice as fast \(55%→\\rightarrow37%\)\. As the catalog grows, the model increasingly misses valid tools more often than it selects invalid ones\. A fixed cohort of 731 queries present at every scale point shows the same degradation \(14\.9pp\), confirming it is driven by catalog growth rather than query composition \(Appendix[D](https://arxiv.org/html/2606.17519#A4)\)\.

#### Two\-component decomposition\.

An oracle shortlister \(all dynamically\-valid tools plus random distractors to fill 20 slots\) establishes an upper bound on what routing can achieve with perfect retrieval\. The oracle drops from 79\.0% to 68\.8%, a 10pp decline even with the correct tool always present\. This reveals two independent degradation sources: \(a\) a*retrieval*gap, the 16pp difference between oracle and practical shortlisting at full scale, recoverable by better retrievers; and \(b\) a*confusion*gap, the oracle’s own 10pp decline\. This decline reflects both incomplete coverage of growing equivalence classes \(valid\-set size increases from 1\.6 to 3\.2 tools per query at scale\) and genuine inter\-tool confusion\. Valid\-set growth affects recall but not precision, so the 8pp precision drop \(68%→\\rightarrow60%\) confirms confusion independent of the coverage effect\. The effective confusion gap in practice is likely larger than 10pp, since the oracle’s random distractors are less confusable than a real retriever’s semantically similar candidates\.

#### Cross\-model reproducibility\.

All three models exhibit the same recall\-driven degradation pattern with an elbow at 40–60 agents \(Appendix[A](https://arxiv.org/html/2606.17519#A1)\)\. GPT\-5\.4 outperforms GPT\-5\.1 by 4–8pp across scale\. Sonnet starts higher \(66\.3% at 51 tools\) but degrades faster \(−\-20pp vs\.−\-16pp for GPT\-5\.4\)\. Stronger models provide a constant offset but follow the same curve\.

#### Metadata and tool search\.

Rich metadata \(examples, tags, enriched descriptions\) provides a scale\-increasing benefit at the agent level \(\+\+1\.2pp at 20,\+\+4\.2pp at 110\) but near\-zero effect at the tool level \(<<1pp\)\. Metadata quality complements architectural changes but does not substitute for them\. OpenAI’s namespace\-based tool search provides partial relief at moderate scale but plateaus at larger catalogs \(§[5\.4](https://arxiv.org/html/2606.17519#S5.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2606.17519v1/x1.png)Figure 1:Routing F1 \(implicit queries\) across catalog scale \(GPT\-5\.4\)\. Shaded bands: fold standard deviation\. Flat routing degrades from 58% to 42%\. Embedding shortlisting \(k=20k\{=\}20\) recovers \+10pp at full scale\. Oracle ceiling \(dashed\) drops 10pp, indicating confusion independent of retrieval quality\. Tool search helps above∼\\sim180 tools but plateaus\.

## 5Shortlisting Across Scale

We ask whether pre\-filtering the catalog to a small shortlist can close the retrieval gap responsible for most of the scaling loss\.

### 5\.1Shortlisting Comparison

Figure[1](https://arxiv.org/html/2606.17519#S4.F1)compares four approaches across 51–584 tools\. Embedding shortlisting \(text\-embedding\-3\-large,k=20k\{=\}20tools\) outperforms flat routing at every scale point \(\+\+6–11pp, paired bootstrapp<0\.01p\{<\}0\.01at all points\) and matches or exceeds platform tool search throughout\. We fixk=20k\{=\}20based on a sensitivity analysis \(Appendix[C](https://arxiv.org/html/2606.17519#A3)\) showing that F1 plateaus atk≥10k\{\\geq\}10and is statistically indistinguishable fromk=20k\{=\}20tok=50k\{=\}50\. At 584 tools embedding reaches 52\.5%, while tool search reaches 50\.3% and flat \(no shortlisting\) reaches 42\.1%\.

The 16pp gap between embedding and oracle at full scale reflects retrieval quality, since embedding returns semantically similar distractors that are harder for the router than random ones\. Across approaches, even oracle recall drops 15pp \(Figure[3](https://arxiv.org/html/2606.17519#A1.F3), Appendix[A](https://arxiv.org/html/2606.17519#A1)\)\.

### 5\.2Cross\-Model Reproducibility

Table[1](https://arxiv.org/html/2606.17519#S5.T1)shows the shortlisting recovery across models and providers\. All three converge to∼\{\\sim\}\+10pp at full scale despite different baselines \(GPT\-5\.4: 42\.1%, GPT\-5\.1: 40\.6%, Sonnet: 45\.9%\)\. Sonnet’s smaller delta at 120 tools \(\+\+2\.7pp\) reflects its stronger baseline, leaving less recall to recover\. As Sonnet’s baseline degrades at larger catalogs, the shortlisting benefit grows, matching the GPT models at full scale\.

Table 1:Cross\-model shortlisting recovery \(F1 on implicit queries\)\. GPT\-5\.4 at 325 tools is fold\-averaged \(k=3k\{=\}3,σ=2\.0\\sigma\{=\}2\.0\); all other entries are fold 0\. Sonnet on fold 0 only \(API cost\)\. Paired bootstrap 95% CIs on full\-scale deltas: GPT\-5\.4 \[9\.2, 11\.5\], GPT\-5\.1 \[10\.1, 12\.5\], Sonnet \[8\.9, 11\.1\]; all exclude zero\.
### 5\.3Production Validation

Table[2](https://arxiv.org/html/2606.17519#S5.T2)validates the synthetic findings on 1,435 human\-annotated production utterances\. Despite flat baselines ranging from 28% to 36% F1 at full scale, all three models land within a 2pp band \(44–46%\) once shortlisting is applied\.

Absolute F1 is 10–15pp lower than synthetic, consistent with production traffic being entirely implicit and containing misspellings, fragments, and ambiguous intent absent from synthetic generation\. The synthetic\-production gap is comparable for GPT\-5\.4 \(14pp\) and Sonnet \(14pp\), suggesting it reflects query difficulty rather than distributional affinity between GPT\-4o query generation and GPT routing models\. At the query level, with shortlisting at full scale 80% of synthetic queries and 60% of production queries receive at least one correct tool\.

Table 2:Production validation: F1 on 1,435 human\-annotated production utterances \(implicit only\)\. Scale 325 is fold\-averaged \(k=3k\{=\}3\); endpoints are single\-fold\. Shortlisting recovery replicates the synthetic pattern \(Table[1](https://arxiv.org/html/2606.17519#S5.T1)\): \+10–17pp at full scale, growing with catalog size\. Paired bootstrap 95% CIs on full\-scale deltas exclude zero: \[12\.8, 20\.2\], \[5\.7, 14\.0\], \[9\.9, 16\.8\]\.
### 5\.4Retrieval Method Comparison

#### Tool\-level beats pack\-level\.

For both embedding models, tool\-level retrieval \(k=20k\{=\}20\) outperforms pack\-level \(k=5k\{=\}5packs/agents, expanded to∼\\sim30 tools\) by 2–4pp consistently, since pack expansion loads irrelevant sibling tools that dilute the candidate set\. Part of this edge \(∼\{\\sim\}2pp\) comes from ranked retrieval exploiting the router’s positional bias \(Appendix[F](https://arxiv.org/html/2606.17519#A6)\)\. All pack\-level approaches converge at full scale, with platform tool search \(50\.3%\), pack\-level embedding \(49\.1%\), and hierarchical LLM routing \(47\.9%\) falling within 2\.4pp at 584 tools\. The hierarchical baseline selects only 1\.2 packs on average \(83% hit rate\), making the pack intermediate step a source of unrecoverable error\.

#### Large dense retriever outperforms lexical and fine\-tuned alternatives\.

Text\-embedding\-3\-large \(52\.5%\) outperforms ToolRet\-e5Shiet al\.\([2025](https://arxiv.org/html/2606.17519#bib.bib8)\), a 335M retriever fine\-tuned on 200k tool\-retrieval pairs \(48\.7%\), by 4pp\. The fine\-tuned model was trained on API\-centric data and may face an out\-of\-domain penalty on enterprise productivity tools\. A domain\-matched fine\-tune at comparable scale could close or reverse this gap\. The fine\-tuned retriever runs locally \(∼\{\\sim\}2ms/query\) vs\. an API call \(∼\{\\sim\}50ms\)\. BM25 shortlisting \(k=20k\{=\}20\) falls below flat routing at every scale point \(32\.8% vs\. 42\.1% at 584 tools\), as enterprise productivity tools share vocabulary across agents, making lexical matching ineffective\.

![Refer to caption](https://arxiv.org/html/2606.17519v1/x2.png)Figure 2:Error analysis across catalog scale \(GPT\-5\.4\)\. \(a\) Query\-level outcomes: shortlisting converts routing misses \(red\) into correct predictions \(green\) at the cost of bounded shortlister misses \(purple,∼\{\\sim\}9%\)\. At full scale, 80% of queries receive at least one correct tool with shortlisting vs\. 69% flat\. \(b\) Prediction\-level accuracy: catch\-all absorption \(red\) drops from 19% to 4% with shortlisting while cross\-cluster confusion is largely unchanged\.

### 5\.5Error Analysis

#### Query outcomes\.

Figure[2](https://arxiv.org/html/2606.17519#S5.F2)a decomposes query outcomes into five categories\. As the catalog scales, fully correct predictions drop from 39% to 17% in flat routing and from 46% to 22% with shortlisting, while partial matches grow to fill the gap\. Shortlisting cuts routing misses from 31% to 10% at full scale at the cost of 9% shortlister misses\. The trade is favorable: correct\-or\-partial coverage rises from 69% to 80%, and the shortlister miss rate is stable across scale\.

#### Prediction\-level accuracy\.

At the prediction level \(Figure[2](https://arxiv.org/html/2606.17519#S5.F2)b\), shortlisting improves accuracy from 61% to 75% correct at full scale\. The largest error in flat routing is catch\-all absorption \(general\-purpose agents absorbing specific queries, 19% of predictions\), which shortlisting reduces to 4%\. Cross\-cluster confusion \(routing to a wrong semantic cluster, 14%→\\rightarrow17%\) and same\-cluster overlap \(routing to a similar tool in the correct cluster,∼\{\\sim\}3%\) are largely unaffected, and intra\-pack errors are negligible \(<<3%\)\. Shortlisting helps by narrowing the candidate set, reducing the opportunity for general\-purpose agents to absorb specific queries\.

#### Positional bias in routing\.

The router exhibits primacy bias: ranked shortlisters provide∼\\sim2pp of free accuracy from retrieval ordering, and oracle upper bounds with fixed ground\-truth position overstate the routing ceiling by∼\\sim4pp \(Appendix[F](https://arxiv.org/html/2606.17519#A6)\)\.

## 6Discussion

The retrieval gap \(16pp\) is addressable with better retrievers\. Embedding shortlisting atk=20k\{=\}20adds∼\{\\sim\}50ms \(API\) or∼\{\\sim\}2ms \(local\) while shrinking the routing prompt from 584 to 20 tool definitions\. For catalogs beyond∼\{\\sim\}30 agents \(∼\{\\sim\}180 tools\), the accuracy gain outweighs this cost\.

The confusion gap \(at least 10pp\) is not recoverable by retrieval alone\. Shortlisting narrows the candidate set but the router still confuses semantically similar tools\. Promising directions include description deduplicationLiuet al\.\([2026](https://arxiv.org/html/2606.17519#bib.bib11)\), dynamic tool exclusion at routing time, and multi\-turn clarification for ambiguous intent\.

These findings are measured on a single enterprise productivity catalog with dense semantic overlap\. Domains with more functionally distinct tools may degrade slower, and the∼\{\\sim\}30\-agent threshold is catalog\-specific\. The qualitative pattern \(recall\-driven degradation, recoverable by retrieval\) is more likely to transfer than the specific numbers\.

## 7Conclusion

On a production catalog of 110 agents \(584 tools\), single\-step routing degrades 16–23pp as the catalog scales\. The degradation is recall\-driven and decomposes into a retrieval gap recoverable by better candidate selection and a confusion gap \(at least 10pp on this catalog\) that better retrieval alone cannot close\. Embedding shortlisting recovers \+10–11pp F1 across three models and two providers, with tool\-level retrieval consistently outperforming all pack\-level approaches including platform tool search and hierarchical LLM routing\. A production annotation study \(1,435 human\-labeled utterances, three\-way annotation\) validates the synthetic findings, with shortlisting recovery replicating at \+10–17pp on real traffic despite 10–15pp lower absolute performance\.

## Limitations

Our primary evaluation uses synthetic queries; the production validation \(§[5\.3](https://arxiv.org/html/2606.17519#S5.SS3)\) confirms the pattern with fold\-averaging at intermediate scale and moderate inter\-annotator agreement \(α=0\.68\\alpha\{=\}0\.68\)\. Production queries are entirely implicit and 10–15pp harder than synthetic, likely reflecting both distributional differences and annotation strictness\.

All experiments use a single enterprise productivity catalog\. Domains with less semantic overlap \(e\.g\., distinct API services\) may degrade slower, while domains with more overlap \(e\.g\., multiple coding assistants\) may degrade faster\.

GPT models use OpenAI function calling \(Responses API\); Sonnet uses Anthropic native tool use\. The cross\-model comparison reflects both model and interface differences\. We verified that Sonnet’s native tool use and text\-in\-prompt routing produce similar F1 \(<<1pp difference at scale 20\), suggesting the interface effect is small for this model\.

## References

- T\. Bogavelli, R\. Sharma, and H\. Subramani \(2025\)Benchmark of agentic configurations for enterprise tasks\.arXiv preprint arXiv:2509\.10769\.External Links:2509\.10769Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Chen, J\. Yoon, D\. S\. Sachan, Q\. Wang, V\. Cohen\-Addad, M\. Bateni, C\. Lee, and T\. Pfister \(2024\)Re\-invoke: tool invocation rewriting for zero\-shot tool retrieval\.InFindings of EMNLP 2024,Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Du, F\. Wei, and H\. Zhang \(2024\)AnyTool: self\-reflective, hierarchical agents for large\-scale API calls\.InProceedings of the International Conference on Machine Learning \(ICML\),External Links:[Link](https://arxiv.org/abs/2402.04253)Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Gan and Q\. Sun \(2025\)RAG\-MCP: mitigating prompt bloat in LLM tool selection via retrieval\-augmented generation\.arXiv preprint arXiv:2505\.03275\.External Links:2505\.03275Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Hao, T\. Liu, Z\. Wang, and Z\. Hu \(2023\)ToolkenGPT: augmenting frozen language models with massive tools via tool embeddings\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Kate, T\. Pedapati, K\. Basu, Y\. Rizk, V\. Chenthamarakshan, S\. Chaudhury, M\. Agarwal, and I\. Abdelaziz \(2025\)LongFuncEval: measuring the effectiveness of long context models for function calling\.arXiv preprint arXiv:2505\.10570\.External Links:2505\.10570Cited by:[§1](https://arxiv.org/html/2606.17519#S1.p2.1),[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Kim, K\. Gu, C\. Park, C\. Park, S\. Schmidgall, A\. A\. Heydari, Y\. Yan, Z\. Zhang, Y\. Zhuang, Y\. Liu, M\. Malhotra, P\. P\. Liang, H\. W\. Park, Y\. Yang, X\. Xu, Y\. Du, S\. Patel, T\. Althoff, D\. McDuff, and X\. Liu \(2026\)Towards a science of scaling agent systems\.arXiv preprint arXiv:2512\.08296\.External Links:2512\.08296Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px3.p1.1)\.
- M\. M\. Liu, D\. Garcia, F\. Parllaku, V\. Upadhyay, S\. F\. A\. Shah, and D\. Roth \(2026\)ToolScope: enhancing LLM agent tool use through tool merging and context\-aware filtering\.InProceedings of ACL 2026,External Links:[Link](https://arxiv.org/abs/2510.20036)Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.17519#S6.p2.1)\.
- Z\. Liu, T\. Hoang, J\. Zhang, M\. Zhu, T\. Lan, S\. Kokane, J\. Tan, W\. Yao, Z\. Liu, Y\. Feng, R\. Murthy, L\. Yang, S\. Savarese, J\. C\. Niebles, H\. Wang, S\. Heinecke, and C\. Xiong \(2024\)APIGen: automated pipeline for generating verifiable and diverse function\-calling datasets\.arXiv preprint arXiv:2406\.18518\.Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Lu, H\. Huang, R\. Meng, Y\. Jin, W\. Zeng, and X\. Shen \(2025\)Tools are under\-documented: simple document expansion boosts tool retrieval\.arXiv preprint arXiv:2510\.22670\.External Links:2510\.22670Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Lumer, A\. Gulati, V\. K\. Subbiah, P\. Honaganahalli Basavaraju, and J\. A\. Burke \(2025a\)Auto\-synchronizing MCP tool indexing with agentic\-RAG retrieval\.arXiv preprint arXiv:2505\.06416\.External Links:2505\.06416Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Lumer, V\. K\. Subbiah, J\. A\. Burke, P\. Honaganahalli Basavaraju, and A\. Huber \(2025b\)Toolshed: advanced RAG\-tool fusion\.InProceedings of ICAART 2025,External Links:[Link](https://arxiv.org/abs/2410.14594)Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Mo, W\. Zhong, J\. Chen, Q\. Yuan, X\. Chen, Y\. Lu, H\. Lin, B\. He, X\. Han, and L\. Sun \(2025\)LiveMCPBench: evaluating tool\-using agents with real\-world MCP servers\.arXiv preprint arXiv:2508\.01780\.External Links:2508\.01780Cited by:[§1](https://arxiv.org/html/2606.17519#S1.p2.1),[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Osuagwu, T\. Cook, M\. Masoud, K\. Ghosal, and R\. Mattivi \(2025\)ScaleCall: agentic tool calling at scale for fintech\.arXiv preprint arXiv:2511\.00074\.External Links:2511\.00074Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px3.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)Berkeley function calling leaderboard\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[item 1](https://arxiv.org/html/2606.17519#S1.I1.i1.p1.1),[§3\.1](https://arxiv.org/html/2606.17519#S3.SS1.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive APIs\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, D\. Li, Z\. Liu, and M\. Sun \(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InProceedings of the International Conference on Learning Representations \(ICLR\),Note:SpotlightExternal Links:[Link](https://arxiv.org/abs/2307.16789)Cited by:[item 1](https://arxiv.org/html/2606.17519#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.17519#S3.SS1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Shao, Y\. Liu, B\. Lu, and W\. Zhang \(2026\)Scaling multi\-agent system with monotonic improvement\.arXiv preprint arXiv:2601\.23219\.External Links:2601\.23219Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Shen, K\. Song, X\. Tan, D\. Li, W\. Lu, and Y\. Zhuang \(2023\)HuggingGPT: solving AI tasks with ChatGPT and its friends in Hugging Face\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Shi, Y\. Wang, L\. Yan, P\. Ren, S\. Wang, D\. Yin, and Z\. Ren \(2025\)Retrieval models aren’t tool\-savvy: benchmarking tool retrieval for large language models\.InFindings of ACL 2025,External Links:[Link](https://arxiv.org/abs/2503.01763)Cited by:[item 1](https://arxiv.org/html/2606.17519#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.17519#S3.SS1.p1.1),[§5\.4](https://arxiv.org/html/2606.17519#S5.SS4.SSS0.Px2.p1.3)\.
- R\. Wang, X\. Han, L\. Ji, S\. Wang, T\. Baldwin, and H\. Li \(2025\)ToolGen: unified tool retrieval and calling via generation\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2410.03439)Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Wu, T\. Zhu, H\. Han, C\. Tan, X\. Zhang, and W\. Chen \(2024\)Seal\-tools: self\-instruct tool learning dataset for agent tuning and detailed benchmark\.arXiv preprint arXiv:2405\.08355\.Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zheng, Z\. Zhang, C\. Ma, Y\. Yu, J\. Zhu, Y\. Wu, T\. Xu, B\. Dong, H\. Zhu, R\. Huang, and G\. Yu \(2026\)SkillRouter: skill routing for LLM agents at scale\.arXiv preprint arXiv:2603\.22455\.External Links:2603\.22455Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zheng, P\. Li, W\. Liu, Y\. Liu, J\. Luan, and B\. Wang \(2024\)Adaptive and hierarchy\-aware reranking for tool retrieval\.InProceedings of LREC\-COLING 2024,External Links:[Link](https://arxiv.org/abs/2403.06551)Cited by:[§2](https://arxiv.org/html/2606.17519#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AFull Scaling Results by Model

Figure[3](https://arxiv.org/html/2606.17519#A1.F3)shows precision and recall curves across scale for all four approaches\. Tables[3](https://arxiv.org/html/2606.17519#A1.T3)–[5](https://arxiv.org/html/2606.17519#A1.T5)report precision, recall, and F1 on implicit queries for flat routing and embedding shortlisting \(k=20k\{=\}20tools, text\-embedding\-3\-large\) at all scale points\.

![Refer to caption](https://arxiv.org/html/2606.17519v1/x3.png)Figure 3:Precision \(left\) and recall \(right\) on implicit queries across scale\. Precision is relatively stable across scale\. Recall drives the degradation, dropping 15pp even for the oracle\.GPT\-5\.4 and GPT\-5\.1 flat results are fold\-averaged at intermediate scales \(k=3k\{=\}3folds\); all other entries are fold 0\. Scale 110 \(584 tools\) is the full catalog and has a single fold\.

Table 3:GPT\-5\.4 scaling results \(implicit queries\)\. Intermediate scales are fold\-averaged \(k=3k\{=\}3\)\.Table 4:GPT\-5\.1 scaling results \(implicit queries\)\. Flat results are fold\-averaged at intermediate scales\. Embedding shortlisting on fold 0\.Table 5:Claude Sonnet 4\.5 scaling results \(implicit queries\)\. Flat intermediate scales are fold\-averaged \(k=3k\{=\}3\); all other entries are fold 0\.
## Appendix BRetriever Comparison

Figure[4](https://arxiv.org/html/2606.17519#A2.F4)isolates retrieval method and granularity effects\. Panel \(a\) compares four tool\-level retrievers\. BM25 and base e5\-large\-v2 both fall below flat routing, demonstrating that low\-quality retrieval is counterproductive\. Panel \(b\) shows that tool\-level retrieval outperforms pack\-level by 2–4pp for both embedding models\.

![Refer to caption](https://arxiv.org/html/2606.17519v1/x4.png)Figure 4:Retriever comparison across scale \(GPT\-5\.4\)\. \(a\) Tool\-level retrievers ranked by end\-to\-end F1\. General\-purpose text\-embedding\-3\-large outperforms fine\-tuned ToolRet\-e5 and base e5\-large\-v2\. BM25 falls below flat routing \(Figure[1](https://arxiv.org/html/2606.17519#S4.F1)\) at all scales\. \(b\) Same retriever, tool\-level \(solid\) vs\. pack\-level \(dotted\)\. Tool\-level consistently wins by 2–4pp\.
## Appendix CK\-Sensitivity

F1 plateaus atk≥10k\{\\geq\}10tools and is statistically indistinguishable fromk=20k\{=\}20tok=50k\{=\}50\(bootstrapp=0\.78p\{=\}0\.78fork=20k\{=\}20vs\.k=35k\{=\}35\)\. We fixk=20k\{=\}20for all scaling experiments as the cheapest point on the plateau \(93% embedding recall at 3\.4% of catalog\)\.

## Appendix DFixed\-Cohort Validation

To confirm degradation is not driven by harder queries entering the pool at larger scales, we track a fixed cohort of 731 implicit queries whose target tools are present at every scale point \(fold 0\)\. On this cohort, flat\-routing F1 drops from 58\.2% at 51 tools to 43\.3% at 584 tools, a 14\.9pp degradation on*identical queries*as the catalog grows\. Embedding shortlisting partially recovers this \(46\.4% at 584 tools,\+\+3\.1pp\), confirming both the degradation and the recovery are genuine\.

## Appendix EPack\-Level Approaches

#### Platform tool search\.

OpenAI’s namespace\-based tool search exhibits a crossover effect:−\-6\.3pp at 10 agents,\+\+7–9pp from 30 agents onward\. At small scale, namespace selection adds unnecessary indirection\. At larger scale, namespace filtering reduces the option space\. The ceiling near 60 agents reflects within\-namespace tool*selection*degrading 14pp, nearly double the namespace retrieval degradation\.

#### Hierarchical LLM routing\.

Two\-stage routing \(LLM selects pack, then routes within pack\) achieves 47\.9% F1 at full scale, below all other shortlisting approaches\. The LLM selects only 1\.2 packs on average \(83% hit rate\) despite a recall\-oriented prompt, making the pack decision a source of unrecoverable error\.

## Appendix FPositional Bias in Routing

Embedding shortlisting returns candidates ranked by similarity, placing the correct tool near the top\. To isolate the filtering benefit from any ranking advantage, we re\-run all scale points with shuffled candidate order \(per\-query deterministic permutation\)\.

Table 6:Positional bias decomposition \(implicit F1, averaged across all scale points\)\. Embedding ranking contributes∼\\sim2pp; oracle ranking contributes∼\\sim4pp because ground truth is always at rank 1\. All deltas are statistically significant \(paired bootstrap,p<0\.01p<0\.01\)\.The router favors candidates presented earlier in the tool list\. For embedding shortlisting, this accounts for∼\{\\sim\}2pp of the total gain, stable across scale points\. For oracle shortlisting, the effect is larger \(∼\{\\sim\}4pp\) because ground truth is always at rank 1, meaning oracle upper bounds overstate the routing ceiling\.
Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Similar Articles

Routing agent work across 4 LLM tiers: orchestrator, advisor, deep reasoning, premier

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

Learning Agent Routing From Early Experience

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

Did you see it when Salesforce's run their own AI Agents benchmark

Submit Feedback

Similar Articles

Routing agent work across 4 LLM tiers: orchestrator, advisor, deep reasoning, premier
Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation
Learning Agent Routing From Early Experience
Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications
Did you see it when Salesforce's run their own AI Agents benchmark