Switchcraft: AI Model Router for Agentic Tool Calling

arXiv cs.AI Papers

Summary

This paper introduces Switchcraft, the first AI model router specifically optimized for agentic tool calling to reduce inference costs. By using a lightweight DistilBERT classifier, it achieves significant cost savings while maintaining high accuracy in tool-use tasks.

arXiv:2605.07112v1 Announce Type: new Abstract: Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 07:12 AM

# Switchcraft: AI Model Router for Agentic Tool Calling
Source: [https://arxiv.org/html/2605.07112](https://arxiv.org/html/2605.07112)
Sharad Agarwal Microsoft Research sagarwal@microsoft\.com &Pooria Namyar Microsoft Research namyarpooria@microsoft\.com &Alec Wolman Microsoft Research alecw@microsoft\.com &Rahul Ambavat Microsoft raambava@microsoft\.com &Ankur Gupta Microsoft angup@microsoft\.com &Qizheng Zhang Stanford qizhengz@stanford\.edu

###### Abstract

Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets\. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use\. We presentSwitchcraft, the first \(to the best of our knowledge\) model router optimized for agentic tool calling\. Switchcraft operates inline, selecting the lowest\-cost model subject to correctness\. We construct an evaluation framework on five function\-calling benchmarks and train a DistilBERT\-based classifier, deployed under a latency budget\. Switchcraft achieves 82\.9% accuracy—matching or exceeding the best individual model—while reducing inference cost by 84%, saving over $3,600 per million queries\. We find that larger models do not consistently outperform smaller ones on tool\-use tasks, and that nominally cheaper models can incur higher total cost due to token\-intensive reasoning\. Our work enables cost\-aware agentic AI deployment without sacrificing correctness\.

## 1Introduction

Agentic AI systems– where LLMs invoke external tools and APIs to perform complex tasks – are emerging as powerful solutions for multiple domains\[[1](https://arxiv.org/html/2605.07112#bib.bib1),[15](https://arxiv.org/html/2605.07112#bib.bib15)\], but they incur substantial costs\. In interviews with a dozen enterprise teams, we found that selecting models for tool\-assisted queries remains challenging: teams default to large, widely\-used models, leading to significant overspending—a problem for customers who overpay and for service providers facing over\-subscribed GPU infrastructure even when smaller models would suffice\.

Model routingaddresses this by dynamically selecting an appropriate LLM per query, routing simple requests to smaller models and reserving large models for harder ones\. Prior work shows substantial savings; e\.g\., an AWS service reports∼43\.9%\\sim 43\.9\\%cost reduction through intelligent routing\[[9](https://arxiv.org/html/2605.07112#bib.bib9)\]\. However, existing routers\[[22](https://arxiv.org/html/2605.07112#bib.bib22),[7](https://arxiv.org/html/2605.07112#bib.bib7),[17](https://arxiv.org/html/2605.07112#bib.bib17)\]target chat completion and do not address agentic tool calling, where models must generate precise tool invocations across multiple steps—requirements that differ fundamentally from chat tasks and make existing routers ill\-suited\.

To support agentic tool\-calling, we curate a diverse set of benchmarks for agentic workloads and use them to fine\-tune a specialized routing model\. We aggregate five public function\-calling benchmarks—Berkeley Function Calling Leaderboard \(BFCL v3\)\[[24](https://arxiv.org/html/2605.07112#bib.bib24)\], AWS ConFETTI\[[3](https://arxiv.org/html/2605.07112#bib.bib3)\], Salesforce xLAM\-60K\[[36](https://arxiv.org/html/2605.07112#bib.bib36)\], Glaive Function Calling\[[11](https://arxiv.org/html/2605.07112#bib.bib11)\], and Hermes\[[20](https://arxiv.org/html/2605.07112#bib.bib20)\]—covering tool use, multi\-turn interaction, and parallel API calls\. We build a unified evaluation framework that normalizes these datasets, executes each query across candidate LLMs, and uses an abstract syntax tree \(AST\)\-based checker to score tool invocations robustly\. Using these signals, we fine\-tuneSwitchcraft, a DistilBERT\-based router \(66M parameters\) that takes an agent’s query and context as input and predicts the most suitable model for execution\.

Key Results\.Switchcraft achieves 82\.9% accuracy—matching or exceeding the best individual model \(GPT\-5\.3\-chat at 82\.3%\)—while reducing inference cost by 84%, saving over $3,600 per million queries\. Relative to an oracle that always selects the cheapest correct model \(89\.4%\), Switchcraft closes 37% of the accuracy gap\. We further find that costlier models do not consistently outperform cheaper ones \(GPT\-5\.4 trails GPT\-5\.3\-chat at 81\.3% vs\. 82\.3%\), nominally cheaper models can incur higher cost due to verbose output, and a chat\-fine\-tuned router of the same architecture significantly underperforms Switchcraft \(Appendix[O](https://arxiv.org/html/2605.07112#A15)\)\.

To the best of our knowledge, this is the first system to address LLM model selection for tool\-augmented, multi\-turn tasks\. We make three main contributions:

- •We formulate model routing for agentic use\-cases and identify key challenges\.
- •We build a unified evaluation framework for agentic routing, including multi\-benchmark normalization, corrected ground\-truth annotations, and an AST\-based comparison framework for robust tool\-calling evaluation\.
- •We develop Switchcraft, a DistilBERT\-based model router that significantly improves cost versus quality trade\-offs, achieving near\-oracle accuracy while reducing inference cost by 84%\.

## 2Motivation

Available tools:get\_symbol\_by\_name,add\_to\_watchlist,get\_stock\_info,place\_order,get\_order\_details,send\_message, …Turn 1 – User:“Integrate stock for ‘Omega Industries’ into my watchlist effectively\.”Correct:get\_symbol\_by\_name\(name="Omega Industries"\)→\\rightarrow\{"symbol": "OMEG"\} add\_to\_watchlist\(stock="OMEG"\)Turn 3 – User:“Execute a transaction for 150 shares at the present market value for the stock we just added\.”Correct:get\_stock\_info\(symbol="OMEG"\)→\\rightarrow\{"price": 457\.23, \.\.\.\} place\_order\(order\_type="Buy", symbol="OMEG", price=457\.23, amount=150\)Wrong function:place\_order\(order\_type="Buy", symbol="OMEG", price=557\.23, amount=150\)×\\times Skippingget\_stock\_infoand using a hallucinated or stale price causes the order to be placed at wrong price\.Wrong parameters:place\_order\(order\_type="Sell", symbol="OMEG", price=457\.23, amount=150\)×\\times Selling instead of buying: a single wrong parameter value causes the*opposite*of the intended action\.Wrong value:place\_order\(order\_type="Buy", symbol="OMEG", price=457\.23, amount=1500\)×\\times An order\-of\-magnitude error inamountcauses a 10×\\timeslarger financial commitment\.

Figure 1:Motivating example from BFCL v3\[[24](https://arxiv.org/html/2605.07112#bib.bib24)\]\. Agentic queries demand*correct functions*and*precise parameter values*at every step; small errors \(wrong type, wrong value, wrong order\) produce consequential failures\.Figure[1](https://arxiv.org/html/2605.07112#S2.F1)illustrates why routing for agentic tool calling differs from routing for chat completion\. A user asks an AI agent to add “Omega Industries” to a watchlist and place a market order; fulfilling this request requires a*sequence*of tool invocations \(resolve ticker, add to watchlist, fetch price, place order\) where each step depends on the previous one\. Unlike chat, where a paraphrase may be acceptable, agentic errors compound and can be irreversible: hallucinating a price puts the order at the wrong value; confusing"Buy"with"Sell"executes the*opposite*of the user’s intent; a single digit error inamountcauses a ten\-fold larger financial commitment\. At the same time, not all variation is an error—independent parallel calls can be issued in any order, and string parameters often admit semantically equivalent forms \(e\.g\.,"Illinois"vs\."IL"\); a correct evaluator must accept these \(full discussion in Appendix[A](https://arxiv.org/html/2605.07112#A1)\)\. These properties—sequential dependencies, strict parameter precision, and selective tolerance for variation—motivate a router specifically fine\-tuned on agentic data with evaluation metrics that capture the unique correctness criteria of tool\-calling\.

## 3Design

We explored a large space of architectures, input representations, scoring methods, and routing strategies \(frozen\-embedding MLPs, larger encoders, LLM\-as\-a\-router, similarity routing, cost\-weighted losses, BLEU and LLM\-as\-judge scoring\); rejected alternatives are detailed in Appendix[B](https://arxiv.org/html/2605.07112#A2)\. Figure[2](https://arxiv.org/html/2605.07112#S3.F2)shows the final architecture: a fine\-tuning pipeline that ingests function\-calling benchmarks, runs every query against every candidate LLM, scores outputs via AST comparison, and distils the resulting preference data into a lightweight DistilBERT classifier; and an inference pipeline that runs the classifier plus a cost model to select the cheapest predicted\-correct LLM\. We build a router specialized for function\-calling to avoid diluting it across heterogeneous workloads such as chat\. Dispatching to the correct router is trivial— function\-calling requests are identified by the presence of thetoolsparameter in the request body\.

Fine\-Tuning PipelineBFCLConFETTIxLAMGlaiveHermesData IngestionLLM InferenceGPT\-5\.4\-nanoKimi\-K2\.5GPT\-5\.4…AST ScoringPreference DataRouter Fine\-TuningInference PipelineAgent QueryAgentic RouterDistilBERTCost modelSelect cheapestpredicted\-correctGPT\-5\.4\-nanoKimi\-K2\.5GPT\-5\.4…Tool Call ResponsedeployFigure 2:System architecture\.Left:fine\-tuning pipeline \(ingest benchmarks, run inference across LLMs, score via AST, fine\-tune router\)\.Right:inference\-time routing \(DistilBERT classifier and cost model select the cheapest predicted\-correct LLM\)\.### 3\.1Input representation

A key design choice is how to pack an agentic query—multi\-turn conversation, tool definitions, and metadata—into DistilBERT’s 512\-token window\. Our packing operates in five steps:

1. 1\.Latest user turn\(immediate intent, always included\)\.
2. 2\.Tool signaturesconverted to compactfunc\_name\(param1, param2\)form \(descriptions and JSON\-schema boilerplate stripped, etc\.; capped at 100 subword tokens with\[truncated\]\)\.
3. 3\.Earlier turnsadded greedily in reverse chronological order until the budget is exhausted, prioritizing recent context\.
4. 4\.Metadata preambleThree numeric features:length, num\_tools, num\_turnsfor complexity\-aware routing without relying solely on text\.
5. 5\.Concatenate and tokenize\(512\-token truncation, dynamic padding\)\.

### 3\.2Scoring the tool\-calls

Executing each LLM\-issued tool call is infeasible at our scale: queries span thousands of distinct, often private third\-party APIs\. Instead, we label each call statically by comparing its abstract syntax tree \(AST\)—function name, argument names, types, and values—to that of the ground\-truth call\.

The AST checker determines which \(model, query\) pairs the router sees as “correct”\. However, designing a general scorer is surprisingly hard: \(1\) the same call admits many semantically equivalent forms \(set\- vs\. list\-valued arguments, omitted default values, alternate string encodings\), \(2\) tool\-call arguments can be arbitrarily nested objects that must be compared structurally not by direct equality, and \(3\) the same syntactic shape across benchmarks \(e\.g\., a list of lists\) can carry different meanings\.

Table 1:AST checker comparison on GPT\-5\.3\-chat results\.*FN*: our checker accepts but BFCL rejects;*FP*: inverse\.†\\daggerSingle\-turn BFCL categories only \(Simple, Multiple, Parallel, Parallel\+Multiple, and their*Live*variants\); multi\-turn categories are excluded because BFCL’s checker does not run on per\-turn evaluation records\.For these reasons, the existing AST checker from BFCL\[[24](https://arxiv.org/html/2605.07112#bib.bib24)\]handles its native benchmark well but breaks on every other one: scoring the same GPT\-5\.3\-chat outputs yields much lower accuracies on the four non\-BFCL datasets \(Table[1](https://arxiv.org/html/2605.07112#S3.T1)\)\. Through a systematic study of rejected calls across all five datasets, we identify recurring classes of bias in BFCL’s AST checker; we summarize each below, with examples in Appendix[D](https://arxiv.org/html/2605.07112#A4)\(TableLABEL:tab:ast\_checker\_examples\)\.

1. 1\.Array order sensitivity\.BFCL compares arrays element\-wise, rejecting set\-valued parameters when the model emits them in a different order than the ground truth\. We compare flat arrays as multisets and lists of dictionaries as sets\.
2. 2\.No default\-parameter awareness\.BFCL treats every parameter listed in the ground truth as required\. We parse defaults from the tool description and accept an omission when the documented default matches the ground truth\.
3. 3\.Brittle string matching\.Differences in case, whitespace, punctuation, and ISO\-8601 timestamp formatting cause spurious mismatches\. We canonicalize each string and fall back to DistilBERT cosine similarity \(threshold 0\.85\) for multi\-word strings\.
4. 4\.No nested\-structure handling\.BFCL does not recognize the JSON\-Schema"object"type and aborts the entry; even when fixed, it has no recursive comparator\. We add the missing types and recursively validate each nested object against its sub\-schema\.
5. 5\.Ambiguous array semantics\.A list\-of\-lists in the ground truth can mean either a list of*alternative*valid values or a true*nested*array\. We use the tool\-call schema to disambiguate\.

With our AST checker, the same models show much more consistent accuracy across all five datasets \(Table[1](https://arxiv.org/html/2605.07112#S3.T1)\)\. We validate it via unit tests, manual review of random entries, and LLM\-as\-judge scoring\.

### 3\.3Routing logic

At inference time, the router selects a single LLM for each incoming query in two stages:

#### Stage 1: multi\-label classification\.

The DistilBERT classifier outputs a probability for each of theKKcandidate models, indicating the likelihood that the model will answer the query correctly\. These probabilities are thresholded at 0\.5 to produce a binary vector of predicted\-correct models\.

#### Stage 2: cost\-aware selection\.

Given the set of predicted\-correct models, the router selects the one with thelowest profiled cost: the actual dollar cost computed from the input/output token counts observed when each candidate model answered training queries, multiplied by its per\-million\-token list prices \(Table[3](https://arxiv.org/html/2605.07112#S4.T3); full formula in Appendix[C](https://arxiv.org/html/2605.07112#A3)\)\. This naturally captures*chattiness*: a model that generates extensive reasoning or verbose output accumulates a higher profiled cost than a concise model at the same per\-token rate\. We analyze chattiness in detail in Section[4\.8](https://arxiv.org/html/2605.07112#S4.SS8)\.

If no model is predicted correct \(all probabilities below 0\.5\), the router falls back to the model with the highest probability \(argmax\)\. This ensures graceful degradation: even when the classifier is uncertain, it routes to its best guess rather than refusing to route\.

#### Oracle router\.

Our oracle upper bound uses the same two\-stage logic with perfect information: for all queries, including the test set, it knows which models answer correctly and selects the cheapest correct one\. When no model is correct, the oracle defaults to the most expensive model, providing a tight upper bound on any router that does not change the underlying models’ answers\.

## 4Evaluation

We evaluate Switchcraft on five function\-calling benchmarks \(below\), comparing it against individual LLMs, heuristic baselines, and an oracle upper bound\.

### 4\.1Datasets

We combine five function\-calling benchmarks spanning a broad range of tool\-calling complexity, totaling 157,101 examples \(122,267 after deduplication\) across 14 categories \(Table[2](https://arxiv.org/html/2605.07112#S4.T2)\); diversity is essential to avoid overfitting to any single benchmark’s distribution\.

Table 2:Datasets and splits used in our evaluation\.*ST*= single\-turn,*MT*= multi\-turn,*par*= parallel calls\.The corpus spans synthetic and human\-authored data, single\- and multi\-turn conversations\. We decompose multi\-turn conversations into per\-turn evaluation records, which is reflected in Table[2](https://arxiv.org/html/2605.07112#S4.T2)\. Several datasets required substantial cleaning—most notably ConFETTI, where we corrected 27% of entries \(Appendix[E](https://arxiv.org/html/2605.07112#A5)\)—and format normalization to a common BFCL\-style JSONL schema \(Appendix[F](https://arxiv.org/html/2605.07112#A6)\)\. All datasets pass through a unified pipeline: deduplication on \(query, tools\), per\-dataset stratified 80/10/10 splits, and multi\-label annotation by running each query through all candidate LLMs and recording correctness per our AST framework \(Section[3\.2](https://arxiv.org/html/2605.07112#S3.SS2)\)\.

### 4\.2Experimental setup

We route among eight LLMs spanning four model families \(Table[3](https://arxiv.org/html/2605.07112#S4.T3)\), accessed through API endpoints \(default temperature; all values in USD\)\. For each of the 14 dataset splits, we run every query against all eight models using the OpenAI\-compatible function\-calling API \(tool definitions via the nativetoolsparameter, no system prompt\) and score outputs with our AST framework\. Per\-dataset stratified 80/10/10 splits yield a 12,267\-example validation set and a 12,282\-example held\-out test set; we select the best seed on validation and report final numbers on test\. We fine\-tune DistilBERT\-base\-uncased\[[26](https://arxiv.org/html/2605.07112#bib.bib26)\]\(66M parameters\) as a multi\-label classifier with eight output heads using the input representation in Section[3\.1](https://arxiv.org/html/2605.07112#S3.SS1), with 20 random seeds \(hyperparameters in Appendix[G](https://arxiv.org/html/2605.07112#A7); fine\-tuning curves in Appendix[H](https://arxiv.org/html/2605.07112#A8)\)\. We compare against \(i\)single\-modelrouters \(eight baselines\); \(ii\)heuristicrouters using a single feature \(input token length, number of tool definitions, or conversation turns\) with thresholds profiled on training data; and \(iii\) anoraclethat selects the cheapest correct model per query \(most expensive when none is correct\)\.

### 4\.3Main results: accuracy–cost trade\-offs

Table[3](https://arxiv.org/html/2605.07112#S4.T3)and Figure[3](https://arxiv.org/html/2605.07112#S4.F3)present the headline results\. The cost column reports the actual API\-billed dollar cost per query; a token\-level decomposition of these costs and an analysis of model chattiness are given in Section[4\.8](https://arxiv.org/html/2605.07112#S4.SS8)\.

Table 3:Accuracy and average cost per query on the held\-out test set \(12,282 examples\)\. In/Out columns are list prices\. For our routers we report best\-seed accuracy over 20 random seeds \(±\\pmstd\); per\-seed details in Appendix[I](https://arxiv.org/html/2605.07112#A9)\.TypeEntityIn \($/M\)Out \($/M\)Accuracy \(%\)Avg Cost \(10−410^\{\-4\}$\)Single LLMGPT\-5\.3\-chat1\.7514\.0082\.2943\.1GPT\-5\.42\.5015\.0081\.2664\.2GPT\-5\.4\-mini0\.754\.5080\.2521\.1GPT\-5\-nano0\.050\.4079\.151\.9GPT\-5\.4\-nano0\.201\.2578\.005\.4GPT\-5\-mini0\.252\.0077\.338\.6Qwen\-3\.5\-9B0\.050\.1572\.402\.1Kimi\-K2\.50\.603\.0060\.888\.2Heuristic RouterNum\. Turns—80\.4141\.2Length—78\.756\.0Num\. Tools—75\.6353\.6OursModernBERT \(149M\)—83\.02 \(±\\pm0\.38\)6\.1DistilBERT \(66M\)—82\.94 \(±\\pm0\.41\)6\.8DeBERTa\-v3 \(86M\)—82\.89 \(±\\pm0\.41\)6\.1Upper BoundOracle—89\.399\.6![Refer to caption](https://arxiv.org/html/2605.07112v1/graphs/trade-offs-visualization.png)Figure 3:Accuracy–cost Pareto plot on the held\-out test set \(12,282 examples\)\. Switchcraft \(red square, with seed\-range error bars\) lies on the Pareto frontier between the cheapest single models and the oracle upper bound\.#### Switchcraft occupies a region of the Pareto frontier no single model reaches\.

Switchcraft achieves82\.94%accuracy at6\.8×10−46\.8\\times 10^\{\-4\}$per query—matching the best individual model \(GPT\-5\.3\-chat, 82\.29%\) while reducing cost by84%\. The two are within seed variance; the salient claim is that*no individual model in the pool simultaneously achieves≥\\geq80% accuracy and cost≤10×10−4\{\\leq\}\\,10\{\\times\}10^\{\-4\}$*\. GPT\-5\.3\-chat reaches the accuracy at 6×\\timesthe cost, while GPT\-5\-nano reaches the cost ceiling at 79\.15% accuracy\. Switchcraft uniquely occupies this region, saving approximately $3,630 per million queries over GPT\-5\.3\-chat at matched accuracy\. The threshold can be tuned to other Pareto points \(Appendix[J](https://arxiv.org/html/2605.07112#A10)\)\. ModernBERT and DeBERTa\-v3 are accuracy\-equivalent within seed variance \(±\\pm0\.41 and±\\pm0\.38 pp\); we prefer DistilBERT for its smaller footprint\.

#### Gap to the oracle\.

The oracle’s 89\.39% at9\.6×10−49\.6\\times 10^\{\-4\}$ upper\-bounds any router; Switchcraft closes\(82\.94−79\.15\)/\(89\.39−79\.15\)=37%\(82\.94\-79\.15\)/\(89\.39\-79\.15\)=37\\%of the gap from the cheapest single model\. A per\-dataset breakdown \(Appendix[F](https://arxiv.org/html/2605.07112#A6)\) shows the advantage is largest on Glaive and xLAM\-60K\. We analyze the gap to the oracle in Section[4\.7](https://arxiv.org/html/2605.07112#S4.SS7)\.

#### Heuristic routers are limited\.

The number\-of\-turns heuristic performs best \(80\.41%\), still 2\.53 pp below Switchcraft; input length and number of tools score 78\.75% and 75\.63%—no single surface\-level feature captures the complexity of agentic routing\.

### 4\.4Key findings

Table[3](https://arxiv.org/html/2605.07112#S4.T3)reveals two counterintuitive inversions\.

#### Finding 1: costlier is not always better\.

GPT\-5\.3\-chat \(82\.29%\) outperforms the newer, more expensive GPT\-5\.4 \(81\.26%\): GPT\-5\.4 over\-elaborates, producing verbose chain\-of\-thought that occasionally corrupts the structured tool\-call output\. Similarly, GPT\-5\-nano \(79\.15%\) outperforms GPT\-5\-mini \(77\.33%\): the nano model follows tool\-calling instructions more directly, while mini paraphrases arguments or adds explanations\. Blindly routing to the newest or most expensive model—a common enterprise default—can therefore decrease both accuracy and cost\-efficiency\.

#### Finding 2: open\-weight models lag behind on tool calling\.

The two worst\-performing models—Kimi\-K2\.5 \(60\.88%\) and Qwen\-3\.5\-9B \(72\.40%\)—are both open\-weight\. Their dominant failure modes are: \(i\)format violations\(markdown wrappers, preamble text, malformed JSON\); \(ii\)argument hallucination\(parameter values invented from training data rather than provided context\); and \(iii\)refusal to call tools, returning a textual answer instead\. These modes are largely absent from the proprietary instruction\-tuned models, which were fine\-tuned for structured output\.

### 4\.5Router inference latency

A practical router must add minimal latency to LLM dispatch\. Table[4](https://arxiv.org/html/2605.07112#S4.T4)shows single\-query P99 latency on a commodity NVIDIA T4 GPU \(~$0\.35/hr spot\): DistilBERT adds only 3–17 ms depending on sequence length—over an order of magnitude below typical LLM generation latency—and reaches 722 queries/sec peak throughput\. ModernBERT is 2\.7–5\.6×\\timesslower with no meaningful accuracy gain\. More details are in Appendix[L](https://arxiv.org/html/2605.07112#A12)\.

Table 4:Single\-query router latency \(batch = 1\) on NVIDIA T4, FP32\.
### 4\.6Robustness and ablations

#### Seed stability and encoder choice\.

Across 20 random seeds, DistilBERT router accuracy varies by only 1\.86 pp \(81\.08%–82\.94%, mean81\.89±0\.4181\.89\\pm 0\.41pp\)\. Even the worst seed beats six of eight individual LLMs and all heuristic baselines\. ModernBERT\-base \(149M, 8K context\) and DeBERTa\-v3\-base \(86M\) achieve best\-seed accuracy within 0\.13 pp of DistilBERT \(Table[3](https://arxiv.org/html/2605.07112#S4.T3)\) with comparable seed stability \(±\\pm0\.38,±\\pm0\.41 pp\)\. We prefer DistilBERT for its smaller footprint \(Appendix[I](https://arxiv.org/html/2605.07112#A9)\)\.

#### Token packing ablation\.

Our input representation uses intelligent token packing \(Section[3\.1](https://arxiv.org/html/2605.07112#S3.SS1)\) to fit decision\-relevant content into the encoder’s 512\-token window\. Table[5](https://arxiv.org/html/2605.07112#S4.T5)isolates its contribution: replacing packing with naive right\-truncation drops accuracy by 1\.66 pp and increases routing cost by 21% \(full details in Appendix[K](https://arxiv.org/html/2605.07112#A11)\)\.

Table 5:Token packing ablation \(DistilBERT, seed 4, test set\)\.

### 4\.7Error analysis

We classify every validation query into four mutually exclusiverouting outcomes\(Table[6](https://arxiv.org/html/2605.07112#S4.T6)\)\. On the validation set \(12,267 examples\), Switchcraft selects a correct model for82\.94%of queries; only7\.4%are avoidable mistakes \(*wrong model*\) where a different prediction would have succeeded; the remaining 9\.7% are*irreducible*\(no model in the pool answers correctly\)\. The implied oracle gap on the validation set \(7\.4 pp\) is slightly larger than the 6\.45 pp gap reported on the test set in Table[3](https://arxiv.org/html/2605.07112#S4.T3); both are dominated by the same “wrong model” failure mode and the difference reflects split composition, not a metric inconsistency\.

Table 6:Routing outcomes on the validation set \(12,267 examples, best\-seed DistilBERT router\)\.Among correctly routed queries, 50\.7% go to a non\-cheapest model, but the practical impact is small \(median overhead0\.18×10−40\.18\\times 10^\{\-4\}$, total $2\.19 across the validation set\); 91% of suboptimal cases route to GPT\-5\-nano where the oracle would have chosen Qwen\. Two robust difficulty patterns emerge \(Appendix[M](https://arxiv.org/html/2605.07112#A13)\): fewer correct models in the pool drives error rates up \(67\.9% misrouted with 1 correct model vs\. 0% with 8\), and more tool definitions hurt \(15\.3% at 1 tool vs\. 39\.8% at 11–50 tools\), likely because complex schemas exceed the 512\-token context window\. This points to an opportunity for future improvement via loss shaping, probabilistic correctness, and richer embeddings\.

### 4\.8Chattiness: when cheaper models become more expensive

A recurring theme is thatper\-token pricing is a misleading proxy for per\-query cost\. We definechattinessas the ratio of actual per\-query cost to an*expected cost*assuming each model consumes the cross\-model average number of tokens \(above 1\.0×\\times= more expensive than expected\)\. Table[7](https://arxiv.org/html/2605.07112#S4.T7)summarizes per\-model results; full derivation in Appendix[N](https://arxiv.org/html/2605.07112#A14)\. Kimi\-K2\.5 is the chattiest \(1\.31×\\times, output dominates 69% of its cost\)\. Qwen\-3\.5\-9B generates the most tokens \(228 avg\) yet its chattiness is only 1\.10×\\timesbecause input cost dominates \(81%\)—the large output surplus barely moves the ratio\. GPT\-5\.3\-chat is most concise \(0\.78×\\times, 47 avg tokens\), making it the best accuracy–cost trade\-off among single models despite its high list price\. A router selecting on list price would systematically misjudge verbose budget models and concise mid\-tier ones; using profiled per\-query cost \(Section[3\.3](https://arxiv.org/html/2605.07112#S3.SS3)\) avoids this trap\.

Table 7:Per\-model chattiness on the validation set\.
### 4\.9Comparison to chat router and other model pools

With an earlier eight\-model basket \(Appendix[O](https://arxiv.org/html/2605.07112#A15)\), the same pattern holds: Switchcraft achieves 82\.73% accuracy at 88% lower cost than the best individual model, while a chat\-fine\-tuned variant of the same architecture \(trained on 64 public chat benchmarks\) reaches only 77\.47% \(5\.26 pp below Switchcraft\)—routing decisions learned from chat completions do not transfer to agentic tool calling\.

## 5Limitations

#### Oracle gap\.

Switchcraft’s 6\.45 pp gap to the oracle is the primary improvement opportunity\. Promising directions include probabilistic correctness modeling \(Appendix[P](https://arxiv.org/html/2605.07112#A16)\), asymmetric loss shaping, and richer input representations that capture more context within the token budget\.

#### In\-distribution evaluation and contamination\.

Our test set is drawn from the same distribution as training \(stratified 80/10/10 splits per benchmark\)—a*known agentic workload mix*typical of enterprise settings with representative production logs available for fine\-tuning\. All public datasets predate the 2025\-08\-31 GPT\-5\.3/5\.4 training cutoff, so we cannot rule out memorization inflating single\-model accuracies\. However, Switchcraft learns from*realized model behavior*, so its advantage holds regardless\. Quantifying degradation under benchmark shift remains future work\.

#### Generalization to new models\.

Switchcraft is fine\-tuned for a fixed pool of eight models and requires retraining when models are added or updated\. We validate that the same architecture generalises across baskets \(Appendix[O](https://arxiv.org/html/2605.07112#A15)\)\. We evaluate cold\-start approaches \(MIRT: Appendix[Q](https://arxiv.org/html/2605.07112#A17)\), but combining those with agentic specialization is future work\.

Additional considerations—per\-turn vs\. end\-to\-end success, cost\-model assumptions, prompt caching, reasoning effort—are in Appendix[R](https://arxiv.org/html/2605.07112#A18)\.

## 6Related work

Tool callingextends LLMs with access to external data and computation\[[4](https://arxiv.org/html/2605.07112#bib.bib4)\], and prior work has proposed methods that teach individual LLMs to invoke tools\[[23](https://arxiv.org/html/2605.07112#bib.bib23),[27](https://arxiv.org/html/2605.07112#bib.bib27)\]\. Scoring those calls, however, remains the bottleneck\. Prior work either execute each call\[[36](https://arxiv.org/html/2605.07112#bib.bib36)\], or match its abstract syntax tree against a reference using BFCL’s checker\[[24](https://arxiv.org/html/2605.07112#bib.bib24),[3](https://arxiv.org/html/2605.07112#bib.bib3)\]\. The former is infeasible at our scale as it requires implementing all the APIs, and the latter, as we show in §[3\.2](https://arxiv.org/html/2605.07112#S3.SS2), is significantly biased on every non\-BFCL benchmark we test\. Prior work also focus on synthetic data generation for tool calls\[[5](https://arxiv.org/html/2605.07112#bib.bib5)\]\.

Model routingreduces inference cost by dispatching each query to the cheapest model in a pool that can answer it with a reasonable quality\. A growing body of work learns this routing policy from per\-query quality signals, using neural networks\[[7](https://arxiv.org/html/2605.07112#bib.bib7),[25](https://arxiv.org/html/2605.07112#bib.bib25),[6](https://arxiv.org/html/2605.07112#bib.bib6),[2](https://arxiv.org/html/2605.07112#bib.bib2)\], k\-nearest neighbors\[[12](https://arxiv.org/html/2605.07112#bib.bib12),[28](https://arxiv.org/html/2605.07112#bib.bib28),[31](https://arxiv.org/html/2605.07112#bib.bib31)\], matrix factorization\[[22](https://arxiv.org/html/2605.07112#bib.bib22),[38](https://arxiv.org/html/2605.07112#bib.bib38)\], graph neural networks\[[10](https://arxiv.org/html/2605.07112#bib.bib10)\], k\-means clustering\[[14](https://arxiv.org/html/2605.07112#bib.bib14)\], item response theory\[[30](https://arxiv.org/html/2605.07112#bib.bib30)\], or constrained optimization\[[18](https://arxiv.org/html/2605.07112#bib.bib18)\]\. Complementary lines target cost estimation directly\[[29](https://arxiv.org/html/2605.07112#bib.bib29),[19](https://arxiv.org/html/2605.07112#bib.bib19)\], enrich the action space with token\-budget control\[[34](https://arxiv.org/html/2605.07112#bib.bib34)\]or best\-of\-nnsampling\[[8](https://arxiv.org/html/2605.07112#bib.bib8)\], learn more expressive query representations\[[33](https://arxiv.org/html/2605.07112#bib.bib33)\], or improve explainability\[[21](https://arxiv.org/html/2605.07112#bib.bib21)\]\. Two recent benchmarks also target model routing\[[16](https://arxiv.org/html/2605.07112#bib.bib16),[17](https://arxiv.org/html/2605.07112#bib.bib17)\]\. None of these works, however, targets tool calling or treats correctness as a first\-class objective\. As we argue in §[2](https://arxiv.org/html/2605.07112#S2), assuming partial responses are acceptable does not work well in tool calling workloads, where a single incorrect argument can have consequences\.

We bridge these two threads with Switchcraft—to the best of our knowledge, the first model router explicitly designed and evaluated for agentic tool calling: fine\-tuned on agentic benchmarks and optimized for strict, AST\-level correctness\. A separate line of work attacks agent inference cost through orthogonal mechanisms\[[37](https://arxiv.org/html/2605.07112#bib.bib37),[32](https://arxiv.org/html/2605.07112#bib.bib32)\]which compose with model routing rather than replacing it\.

## 7Conclusions

We presented Switchcraft, an agentic model router using a lightweight DistilBERT classifier \(66M parameters\) to route tool\-calling queries among eight candidates, matching the best individual model’s accuracy \(82\.9% vs\. 82\.3%\) at 84% lower cost \($3,600\+ saved per million queries\)\. Key enablers are agentic training data with an improved AST framework, a chattiness\-aware cost model, and sub\-20 ms inference\. As available LLMs proliferate and their cost–capability trade\-offs diversify, intelligent routing becomes critical for scalable AI deployment\. Code generation and multi\-step planning are natural future targets\.

## 8Acknowledgments

We are grateful to Vijay Aski, Rupeshkumar Mehta, Sethu Raman and Steve Sweetman for supporting and enabling deep collaboration between Microsoft Research and Microsoft Foundry on this effort\.

## References

- Abou Ali et al\. \[2025\]Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine\.Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions\.*Artificial Intelligence Review*, 59\(11\), 2025\.doi:10\.1007/s10462\-025\-11422\-4\.URL[https://link\.springer\.com/article/10\.1007/s10462\-025\-11422\-4](https://link.springer.com/article/10.1007/s10462-025-11422-4)\.
- Aggarwal et al\. \[2024\]Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam \.Automix: Automatically mixing language models\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*, 2024\.
- Alkhouli et al\. \[2025\]Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, and Yi Zhang\.CONFETTI: Conversational Function\-Calling Evaluation Through Turn\-Level Interactions\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 7993–8006, Vienna, Austria, jul 2025\. Association for Computational Linguistics\.doi:10\.18653/v1/2025\.acl\-long\.394\.URL[https://aclanthology\.org/2025\.acl\-long\.394/](https://aclanthology.org/2025.acl-long.394/)\.Creative Commons Attribution 4\.0 International License\.
- Attouche et al\. \[2024\]Lyes Attouche, Mohamed\-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger\.Validation of modern json schema: Formalization and complexity\.*Proc\. ACM Program\. Lang\.*, 8\(POPL\), January 2024\.doi:10\.1145/3632891\.URL[https://doi\.org/10\.1145/3632891](https://doi.org/10.1145/3632891)\.
- Belavadi et al\. \[2025\]Vibha Belavadi, Tushar Vatsa, Dewang Sultania, Suhas Suresha, Ishita Verma, Cheng Chen, Tracy Holloway King, and Michael Friedrich\.Routenator: A router\-based multi\-modal architecture for generating synthetic training data for function calling llms, 2025\.URL[https://arxiv\.org/abs/2505\.10495](https://arxiv.org/abs/2505.10495)\.
- Chen et al\. \[2024\]Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang\.Routerdc: Query\-based router by dual contrastive learning for assembling large language models\.In*Advances in Neural Information Processing Systems*, volume 37, 2024\.doi:10\.52202/079017\-2120\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2024/file/7a641b8ec86162fc875fb9f6456a542f\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2024/file/7a641b8ec86162fc875fb9f6456a542f-Paper-Conference.pdf)\.
- Ding et al\. \[2024\]Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks V\. S\. Lakshmanan, and Ahmed Hassan Awadallah\.Hybrid LLM: Cost\-efficient and quality\-aware query routing\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Ding et al\. \[2025\]Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks V\. S\. Lakshmanan, Qingyun Wu, and Victor Rühle\.BEST\-route: Adaptive LLM routing with test\-time optimal compute\.In*Forty\-second International Conference on Machine Learning*, 2025\.
- Feng et al\. \[2025a\]Aosong Feng, Balasubramaniam Srinivasan, Yun Zhou, Zhichao Xu, Kang Zhou, Sheng Guan, Yueyan Chen, Xian Wu, Ninad Kulkarni, Yi Zhang, Zhengyuan Shen, Dmitriy Bespalov, Soumya Smruti Mishra, Yifei Teng, Darren Yow\-Bang Wang, Haibo Ding, and Lin Lee Cheong\.IPR: Intelligent Prompt Routing with User\-Controlled Quality\-Cost Trade\-offs\.*arXiv preprint arXiv:2509\.06274*, 2025a\.URL[https://arxiv\.org/abs/2509\.06274](https://arxiv.org/abs/2509.06274)\.
- Feng et al\. \[2025b\]Tao Feng, Yanzhen Shen, and Jiaxuan You\.Graphrouter: A graph\-based router for llm selections, 2025b\.URL[https://arxiv\.org/abs/2410\.03834](https://arxiv.org/abs/2410.03834)\.
- Glaive AI \[2023\]Glaive AI\.Glaive Function Calling v2\.Dataset available from HuggingFace, 2023\.URL[https://huggingface\.co/datasets/glaiveai/glaive\-function\-calling\-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2)\.Apache 2\.0 License\.
- Hu et al\. \[2024\]Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay\.Routerbench: A benchmark for multi\-LLM routing system\.In*Agentic Markets Workshop at ICML 2024*, 2024\.
- Jimenez et al\. \[2024\]Carlos E\. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan\.SWE\-bench: Can Language Models Resolve Real\-World GitHub Issues?*arXiv preprint arXiv:2310\.06770*, 2024\.
- Jitkrittum et al\. \[2025\]Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen\-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar\.Universal Model Routing for Efficient LLM Inference\.*arXiv preprint arXiv:2502\.08773*, 2025\.
- Las\-Casas et al\. \[2024\]Pedro Las\-Casas, Alok Gautum Kumbhare, Rodrigo Fonseca, and Sharad Agarwal\.LLexus: An AI Agent System for Incident Management\.*ACM SIGOPS Operating Systems Review*, 58\(1\):23–36, 2024\.doi:10\.1145/3689051\.3689056\.
- Li et al\. \[2026\]Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, and Shuyue Hu\.LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing\.*arXiv preprint arXiv:2601\.07206*, 2026\.
- Lu et al\. \[2025\]Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing\.RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers\.*arXiv preprint arXiv:2510\.00202*, 2025\.
- Mei et al\. \[2025\]Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang\.OmniRouter: Budget and Performance Controllable Multi\-LLM Routing\.*arXiv preprint arXiv:2502\.20576*, 2025\.
- Nguyen et al\. \[2025\]Quang H\. Nguyen, Thinh Dao, Duy C\. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V\. Chawla, and Khoa D\. Doan\.Metallm: A high\-performant and cost\-efficient dynamic framework for wrapping llms, 2025\.URL[https://arxiv\.org/abs/2407\.10834](https://arxiv.org/abs/2407.10834)\.
- Nous Research \[2024\]Nous Research\.Hermes 3 Technical Report\.Technical report, Nous Research, 2024\.URL[https://nousresearch\.com/wp\-content/uploads/2024/08/Hermes\-3\-Technical\-Report\.pdf](https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf)\.Apache 2\.0 License\.
- Okamoto et al\. \[2026\]Mika Okamoto, Ansel Kaplan Erol, and Mark Riedl\.Explainable model routing for agentic workflows, 2026\.URL[https://arxiv\.org/abs/2604\.03527](https://arxiv.org/abs/2604.03527)\.
- Ong et al\. \[2025\]Isaac Ong, Amjad Almahairi, Vincent Wu, Wei\-Lin Chiang, Tianhao Wu, Joseph E\. Gonzalez, M Waleed Kadous, and Ion Stoica\.RouteLLM: Learning to route LLMs from preference data\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Patil et al\. \[2024\]Shishir G\. Patil, Tianjun Zhang, Xin Wang, and Joseph E\. Gonzalez\.Gorilla: Large language model connected with massive apis\.In*Advances in Neural Information Processing Systems*, volume 37\. Curran Associates, Inc\., 2024\.doi:10\.52202/079017\-4020\.
- Patil et al\. \[2025\]Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng\-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E\. Gonzalez\.The berkeley function calling leaderboard \(BFCL\): From tool use to agentic evaluation of large language models\.In*Proceedings of the 42nd International Conference on Machine Learning*, pages 48371–48392, 2025\.Apache 2\.0 License\.
- Sakota et al\. \[2024\]Marija Sakota, Maxime Peyrard, and Robert West\.Fly\-swat or cannon? cost\-effective language model choice via meta\-modeling\.In*Proceedings of the 17th ACM International Conference on Web Search and Data Mining*, WSDM ’24, page 606–615, 2024\.URL[https://doi\.org/10\.1145/3616855\.3635825](https://doi.org/10.1145/3616855.3635825)\.
- Sanh et al\. \[2019\]Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf\.DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter\.*arXiv preprint arXiv:1910\.01108*, 2019\.URL[https://arxiv\.org/abs/1910\.01108](https://arxiv.org/abs/1910.01108)\.
- Schick et al\. \[2023\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2023/file/d842425e4bf79ba039352da0f658a906\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2023/file/d842425e4bf79ba039352da0f658a906-Paper-Conference.pdf)\.
- Shnitzer et al\. \[2023\]Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin\.Large language model routing with benchmark datasets, 2023\.URL[https://arxiv\.org/abs/2309\.15789](https://arxiv.org/abs/2309.15789)\.
- Somerstep et al\. \[2025\]Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, and Subha Maity\.Carrot: A cost aware rate optimal router, 2025\.URL[https://arxiv\.org/abs/2502\.03261](https://arxiv.org/abs/2502.03261)\.
- Song et al\. \[2025\]Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu\.IRT\-Router: Effective and Interpretable Multi\-LLM Routing via Item Response Theory\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2025\.
- Stripelis et al\. \[2024\]Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He\.Tensoropera router: A multi\-model router for efficient llm inference, 2024\.URL[https://arxiv\.org/abs/2408\.12320](https://arxiv.org/abs/2408.12320)\.
- vLLM Semantic Router Team \[2025\]vLLM Semantic Router Team\.vllm semantic router\.[https://github\.com/vllm\-project/semantic\-router](https://github.com/vllm-project/semantic-router), 2025\.
- Wang et al\. \[2025\]Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, and Shuyue Hu\.Icl\-router: In\-context learned model representations for llm routing\.In*AAAI Conference on Artificial Intelligence*, 2025\.URL[https://api\.semanticscholar\.org/CorpusID:282057625](https://api.semanticscholar.org/CorpusID:282057625)\.
- Xue et al\. \[2026\]Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang\.R2\-router: A new paradigm for llm routing with reasoning\.*arXiv preprint arXiv:2602\.02823*, 2026\.
- Yao et al\. \[2024\]Shunyu Yao, Jian Pei, Yue Ma, and Howard Chen\.τ\\tau\-bench: A Benchmark for Tool\-Agent\-User Interaction in Real\-World Domains\.*arXiv preprint arXiv:2406\.12045*, 2024\.
- Zhang et al\. \[2024\]Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, and Caiming Xiong\.xLAM: A Family of Large Action Models to Empower AI Agent Systems\.*arXiv preprint arXiv:2409\.03215*, 2024\.URL[https://arxiv\.org/abs/2409\.03215](https://arxiv.org/abs/2409.03215)\.Creative Commons Attribution 4\.0 International License\.
- Zhang et al\. \[2025\]Qizheng Zhang, Michael Wornow, and Kunle Olukotun\.Cost\-Efficient Serving of LLM Agents via Test\-Time Plan Caching\.In*International Conference on Machine Learning Workshops*, 2025\.
- Zhuang et al\. \[2024\]Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchandran\.Embedllm: Learning compact representations of large language models, 2024\.URL[https://arxiv\.org/abs/2410\.02223](https://arxiv.org/abs/2410.02223)\.

## Appendix AAcceptable variations in tool calls

Available tool:calculate\_sales\_tax\(purchase\_amount, city, state\)User:“Calculate the amount of sales tax to be added on a purchase amount of $30\.45 in Chicago, Illinois, $52\.33 in Sacramento, California and $11\.23 in Portland, Oregon\.”Response A✓\(three parallel calls\): 1\. calculate\_sales\_tax\(purchase\_amount=30\.45, city="Chicago", state="Illinois"\) 2\. calculate\_sales\_tax\(purchase\_amount=52\.33, city="Sacramento", state="California"\) 3\. calculate\_sales\_tax\(purchase\_amount=11\.23, city="Portland", state="Oregon"\)Response B✓\(also correct—reordered calls with abbreviated parameter values\): 1\. calculate\_sales\_tax\(purchase\_amount=11\.23, city="Portland", state="OR"\) 2\. calculate\_sales\_tax\(purchase\_amount=30\.45, city="CHI", state="IL"\) 3\. calculate\_sales\_tax\(purchase\_amount=52\.33, city="Sacramento", state="CA"\)Both responses are correct\.Because these three calls are*independent*\(no data flows between them\), any ordering is valid\. Additionally, semantically equivalent parameter values—“Illinois” vs\. “IL”, “Chicago” vs\. “CHI”—are equally acceptable\.

Figure 4:Acceptable variation in an agentic parallel tool\-calling scenario \(BFCL v3\[[24](https://arxiv.org/html/2605.07112#bib.bib24)\]\)\. The user’s request requires three independentcalculate\_sales\_taxcalls\. Response A and Response B differ in \(i\) the*order*of the parallel calls and \(ii\) the surface form of string parameters \(e\.g\., “Illinois” vs\. “IL”\), yet both are semantically correct\. An agentic evaluation framework must recognise such benign variation rather than penalising it as error\.While agentic tool calls demand strict precision, not all variation constitutes an error\. Consider the sales\-tax scenario in Figure[4](https://arxiv.org/html/2605.07112#A1.F4), where the user asks an agent to compute sales tax for purchases in three cities\. The agent must issue three independent calls tocalculate\_sales\_tax, but because no call depends on the output of another, any permutation of the three calls is equally valid\. Furthermore, string parameters admit semantically equivalent surface forms:"Illinois"and"IL"refer to the same state, as do"Chicago"and"CHI"\. A correct evaluation must treat both orderings and both spellings as acceptable, rather than penalizing responses that happen to differ from a single canonical reference\. These flexibility requirements directly motivate the AST\-checker design choices in Section[3\.2](https://arxiv.org/html/2605.07112#S3.SS2)\.

## Appendix BDesign space exploration

The architecture described in Section[3](https://arxiv.org/html/2605.07112#S3)emerged from systematic exploration of a substantially larger design space\. Table[8](https://arxiv.org/html/2605.07112#A2.T8)summarizes the principal alternatives we evaluated and the reasons we did not retain them in the final system\.

Table 8:Design alternatives explored during development\. Each row is an approach we prototyped or carefully evaluated before discarding\.This exploration consumed the majority of our effort and informed several key decisions: \(i\) a fine\-tuned encoder outperforms frozen embeddings; \(ii\) latency constraints eliminate larger backbones and LLM\-based routers; \(iii\) AST\-based scoring is essential for reliable labels in the function\-calling domain; and \(iv\) separating correctness prediction from cost\-aware selection \(two\-stage routing\) dominates single\-objective alternatives such as cost\-weighted losses orα/β\\alpha/\\betatrade\-off parameters\.

## Appendix CProfiled cost model

For every candidate modelmm, we accumulate its actual dollar cost across all training queries:

Cprofiled​\(m\)=∑q∈𝒟traincost​\(q,m\)wherecost​\(q,m\)=nin​\(q\)⋅rin​\(m\)\+nout​\(q,m\)⋅rout​\(m\)106\.C\_\{\\text\{profiled\}\}\(m\)=\\sum\_\{q\\in\\mathcal\{D\}\_\{\\text\{train\}\}\}\\text\{cost\}\(q,m\)\\quad\\text\{where\}\\quad\\text\{cost\}\(q,m\)=\\frac\{n\_\{\\text\{in\}\}\(q\)\\cdot r\_\{\\text\{in\}\}\(m\)\+n\_\{\\text\{out\}\}\(q,m\)\\cdot r\_\{\\text\{out\}\}\(m\)\}\{10^\{6\}\}\.ninn\_\{\\text\{in\}\}andnoutn\_\{\\text\{out\}\}are the actual input and output token counts observed when modelmmanswered queryqq;rinr\_\{\\text\{in\}\}androutr\_\{\\text\{out\}\}are the per\-million\-token list prices from Table[3](https://arxiv.org/html/2605.07112#S4.T3)\. For multi\-turn conversations, costs are summed across all turns\. This profiled cost captures*chattiness*: a model that generates extensive reasoning tokens or verbose explanations accumulates a higher profiled cost than a concise model at the same per\-token rate\.

## Appendix DAST checker examples

TableLABEL:tab:ast\_checker\_examplespresents representative cases where the BFCL AST checker produces false negatives, along with our corresponding fixes\.

Table 9:Representative cases where BFCL AST checker fails with our corresponding fix\. Each row is a false\-negative entry selected from parsing the GPT\-5\.3\-chat responses\.*out*shows the model output, and*GT*is the ground truth\.LimitationFailing exampleWhy BFCL failsOur resolutionArray order sensitivity*out:*search\_books\(
keywords=\["haunted
house", "mystery"\],
genre="mystery"\)
*GT:*keywords=\["mystery",
"haunted house"\]BFCL compares the model’s list to the ground\-truth list with Python’s==after exact\-membership lookup, which is order\-sensitive\. The two lists contain the same two strings but in different order, so equality is false and the call is marked wrong\.Order\-insensitive multiset comparison with per\-element string normalization\.No default\-parameter awareness*out:*top\_players\_by\_
matchmaking\(limit=50\)
*GT:*limit=50, page=0
*schema:*pagedescribed as
*“Default is0”*BFCL treats every parameter that appears in the ground truth as required\. Because the model omittedpage, BFCL reports it missing; it never reads the schema description to learn that the documented default \(0\) is exactly the expected ground\-truth value\.Default value parsed from the schema description; an omission is accepted iff the documented default matches the GT\.Brittle string matching*out:*birthdate=
"1990\-05\-15T00:00:00"
*GT:*"1990\-05\-15T00:00:00Z"*out:*date="15 June", location="Office conference room" *GT:*date="15th June", location="office conference room"

BFCL does only case\-insensitive byte equality on strings\. The trailingZ\(UTC marker\) makes the two ISO\-8601 timestamps unequal even though they denote the identical instant; likewise,"15 June"vs\."15th June"differs only by an ordinal suffix and"Office conference room"differs only in capitalization\. BFCL has no whitespace, punctuation, ISO\-8601, or paraphrase normalization\.Canonicalize case, whitespace, punctuation, and ISO\-8601 \(Z,\+00:00\); a DistilBERT cosine≥0\.85\\geq 0\.85serves as a paraphrase fallback when GT contains spaces\.No nested\-structure comparison*out:*sync\_salesforce\_
data\(authentication\_
details=\{
salesforce:\{client\_id,
\.\.\.\}, pega:\{api\_key,
\.\.\.\}\}\)
*GT:*byte\-identical nested object
*schema:*authentication\_
details:\{type:"object",
properties:\{\.\.\.\}\}BFCL maps each schema type string to a Python class through a hard\-coded lookup table that has no entry for the JSON\-Schema standard type"object"\(and also omits"number"\)\. The first time the checker meets such a parameter it raisesKeyError: ’object’and aborts the entire entry, marking it wrong regardless of content\. Even if the lookup succeeded, BFCL has no recursive comparator and could not have validated the nested sub\-objects\.Add"object":dict\(and"number":int\) to the type table; recursively validate each nested key against the corresponding sub\-schema\.Ambiguous array semantics*out:*find\_common\_
elements\(arrays=
\[\[1,2,3,4,5\],
\[2,3,4,6,7\],
\[2,3,8,9,10\]\]\)
*GT:*byte\-identical 2\-D matrix
*schema:*arrays:\{type:
"array", items:
\{type:"array"\}\}BFCL ground truth wraps every value in a list, which makes a true list\-of\-lists indistinguishable in shape from a list of alternative flat answers\. BFCL has no schema\-aware branch: it always interprets the outer list as alternatives\. Here the parameter’s actual type isarray of arrays, so BFCL silently re\-reads the GT as*three alternative flat integer lists*and demands the model’s output equal one of them\. It can never match\.Consultsitems\.type: alternatives only when the items are scalars; a true 2\-D structure when items are themselvesarray\.List\-of\-dict handling and ordering*out:*calculate\_gpa\(
grades=\[
\{course:"Math",
grade:"B"\},
\{course:"Science",
grade:"A"\}, \.\.\.\]\)
*GT:*same dicts reordered
*schema:*grades:\{type:
"array", items:
\{type:"object",\.\.\.\}\}BFCL has no entry for"object"in its type table, so anyarray of objectparameter is dispatched into a code path that immediately raisesKeyError: ’object’and aborts\. Even if the crash were fixed, BFCL would still compare the two lists element\-wise in order, so any reordering of the per\-course dictionaries would also be rejected\.Each inner dict is normalized and the two lists are compared as multisets, so neither key order within a dict nor element order across the list affects the verdict\.
## Appendix EConFETTI ground\-truth corrections

The ConFETTI dataset\[[3](https://arxiv.org/html/2605.07112#bib.bib3)\]provides 506 multi\-turn conversational function\-calling entries that we adopted for use in our evaluation \(Section[4](https://arxiv.org/html/2605.07112#S4)\)\. When manually examining results from preliminary experiments, we observed that some ground\-truth labels contained errors—incorrect function calls or parameter values that do not match the conversation context\. Because our routing evaluation scores every model against these labels, even a small number of incorrect labels can unfairly penalize correct model behavior and distort accuracy comparisons\. We therefore conducted a systematic review and corrected the ground truth where necessary\.

### E\.1Correction methodology

We performed two review rounds using an LLM\-assisted human\-in\-the\-loop process\. In each round, every entry was submitted to a verifier model \(GPT\-5\) that received the full conversation history, the available function definitions, and the current ground\-truth label\. The model assessed whether the ground truth correctly represents the next function call given the conversational context and flagged entries it considered incorrect, along with a suggested correction and explanation\.

Entries flagged by the model were then presented to a human reviewer in an interactive terminal interface\. For each flagged entry the reviewer could*accept*the correction \(updating the ground\-truth file\),*decline*it \(keeping the original label\), or*skip*it so that the human can manually edit that entry, typically because the suggested correction was not quite correct or the input needed correction rather than the ground truth\. Entries the model deemed correct were automatically marked as reviewed and not surfaced for human inspection\. A second round re\-examined entries that were skipped and manually edited in round 1 and served as a consistency check\.

Across two rounds, a total of69 ground\-truth corrections\(13\.6% of the dataset\) were made\. Combined with the 86 input disambiguations described below, 137 of 506 entries \(27%\) were modified in total\. We will open\-source all corrections as a PR against the ConFETTI repository, and separately open\-source the LLM\-assisted correction tool used to produce them\.

### E\.2Error categories

Table[10](https://arxiv.org/html/2605.07112#A5.T10)classifies the 69 corrections into six categories\. Each correction was assigned a single primary category; when multiple issues co\-occurred, we categorised by the most impactful error \(e\.g\., calling the wrong function takes precedence over a casing mismatch in one of its parameters\)\. These categories matter for function\-calling evaluation because scoring typically relies on exact or near\-exact matching of function names and parameter values, so even minor mismatches in the ground truth can cause correct model outputs to be marked as failures\.

Table 10:Categories of ground\-truth errors corrected in the ConFETTI dataset \(69 corrections out of 506 entries\)\.
### E\.3Input disambiguation

In addition to the 69 ground\-truth corrections, we modified the*conversation input*of 86 entries \(17% of the dataset\) to resolve temporal ambiguity\. Many ConFETTI conversations reference dates without specifying a year \(e\.g\., “fly on the 25th of December” or “check hotel availability for June”\)\. The ground\-truth function calls in these entries use specific dates with hardcoded years \(e\.g\.,2023\-12\-25\), yet the conversation text provides no basis for inferring which year was intended\. This creates an unfair evaluation setup: a model that produces the correct function call with a different year—or that reasonably asks for clarification—would be scored as incorrect\.

To resolve this, we appended the year \(or month and year where needed\) to the user utterance so that the conversation unambiguously specifies the date referenced by the ground\-truth label\. For example, “fly on the 25th of December” becomes “fly on the 25th of December 2023\.” Three additional entries received non\-temporal disambiguation \(e\.g\., adding a location qualifier when the ground truth assumed a specific region\)\. These input changes do not alter the ground\-truth labels; they ensure that the information needed to produce the expected function call is present in the conversation\.

In total, 137 of 506 entries \(27%\) received either an input disambiguation, a ground\-truth correction, or both \(18 entries required both\)\.

## Appendix FDataset details and per\-dataset breakdown

This appendix provides detailed descriptions and cleaning procedures for each dataset, followed by a per\-dataset accuracy and cost breakdown\.

### F\.1Dataset descriptions

#### BFCL v3\.

The Berkeley Function Calling Leaderboard v3\[[24](https://arxiv.org/html/2605.07112#bib.bib24)\]is the de facto standard for evaluating tool\-calling capabilities\. We use all single\-turn categories \(simple,multiple,parallel,parallel\_multiple\) and their “live” counterparts, which test against real\-world API signatures\. For the multi\-turn categories \(multi\_turn\_base,multi\_turn\_long\_context\), we decompose each multi\-turn conversation into per\-turn evaluation records using a custom script that replays ground\-truth function executions to build the conversation history progressively\. This yields 745 per\-turn examples per category, providing fine\-grained multi\-turn evaluation\.

#### ConFETTI\.

ConFETTI\[[3](https://arxiv.org/html/2605.07112#bib.bib3)\]provides 109 human\-simulated multi\-turn conversations spanning 86 unique APIs, totaling 506 turns after decomposition\. Unlike BFCL’s synthetic multi\-turn data, ConFETTI conversations are authored by human annotators who simulate realistic dialogue flows, including clarifications, corrections, and multi\-step task completions\. We apply a format normalization step that restructures conversation turns into the nested list format expected by our evaluation framework\. The ConFETTI dataset required substantial data cleaning: we disambiguated 86 conversation inputs \(17%\) where temporal references lacked a year, and corrected 69 ground\-truth annotations \(13\.6%\) where the expected function call was incorrect or incomplete—137 entries \(27%\) in total \(Appendix[E](https://arxiv.org/html/2605.07112#A5)\)\. We frame these corrections as*benchmark infrastructure*, not a methodological contribution: we will open\-source the corrected ConFETTI annotations, the LLM\-assisted correction tool used to produce them, and our improved AST checker, so that any third party can reproduce, audit, or reject individual edits without re\-deriving the corrections\.

#### Glaive Function Calling v2\.

The Glaive Function Calling v2 dataset\[[11](https://arxiv.org/html/2605.07112#bib.bib11)\]contributes the largest number of examples \(83,814\) and covers multi\-turn function\-calling conversations with diverse tool definitions\. We convert the dataset to BFCL format and apply two cleaning steps: \(i\) replacingNoneplaceholder values in ground\-truth arguments with appropriate defaults \(e\.g\., empty objects for functions likeget\_current\_timethat take no arguments\), and \(ii\) removing empty\-string optional parameters from ground\-truth annotations that would cause spurious evaluation failures\.

#### xLAM\-60K\.

The Salesforce xLAM Function Calling 60K dataset\[[36](https://arxiv.org/html/2605.07112#bib.bib36)\]provides 60,000 single\-turn function\-calling examples\. Each example consists of a user query, a set of tool definitions, and ground\-truth function calls\. This dataset contributes scale and diversity of tool schemas, serving as the bulk of our single\-turn training data\.

#### Hermes Function Calling v1\.

The Hermes Function Calling v1 dataset\[[20](https://arxiv.org/html/2605.07112#bib.bib20)\]from Nous Research contributes 8,940 examples drawn from three subsets:func\_calling\_singleturn,func\_calling, andglaive\_func\_calling\. This provides a mix of single\-turn and multi\-turn tool\-calling scenarios, including both tool\-call prediction \(selecting the right function and arguments\) and tool\-response understanding \(interpreting the result of a prior tool call\)\.

### F\.2Per\-dataset accuracy breakdown

Table[11](https://arxiv.org/html/2605.07112#A6.T11)breaks down accuracy by dataset group on the seed\-4 validation split \(12,267 examples\)\.

Table 11:Per\-dataset\-group accuracy \(%\) on the validation set\. Dataset groups aggregate the sources in Table[2](https://arxiv.org/html/2605.07112#S4.T2):*Single*= BFCL single\-turn \(simple, multiple, parallel, parallel\_multiple\),*Live*= BFCL live variants,*MT*= BFCL multi\-turn \(base \+ long context\),*ConF\.*= ConFETTI\. Sample sizes \(nn\) shown below each column header\. Models sorted by overall accuracy; best single\-model result per column inbold\.#### Observations\.

- •The router outperforms every single model on Glaive and Hermes\(85\.7% and 88\.9%\), the two largest dataset groups\. On Glaive, it exceeds the best single model \(GPT\-5\.3\-chat, 84\.0%\) by 1\.7 percentage points, confirming that routing can improve accuracy, not just reduce cost\.
- •Multi\-turn \(MT\) is the hardest category for all models\.Even the best single model \(GPT\-5\.4, 73\.0%\) falls far short of the oracle \(90\.5%\), leaving a 17\.5 pp gap\. The router \(61\.5%\) does not close this gap, matching the performance of lower\-tier models\. This suggests that multi\-turn context is difficult for the router to capture within its 512\-token window\.
- •ConFETTI is challengingdue to its small size \(50 queries\) and conversational nature\. The oracle itself reaches only 74\.0%, meaning 26% of ConFETTI queries defeat all eight models\. The router’s 44\.0% is between the best \(GPT\-5\.3\-chat, 56\.0%\) and the median single model\.
- •xLAM\-60K shows the largest absolute model spread: Kimi\-K2\.5 \(40\.6%\) vs\. GPT\-5\.3\-chat \(83\.1%\), a 42\.5 pp gap\. The router achieves 80\.6%, close to the best model, indicating effective routing on this large synthetic dataset\.
- •BFCL\-Singlequeries are relatively easy \(82–94% across all models\), so routing adds less value\. The router achieves 90\.0%, near the oracle’s 99\.0%\.

Table[12](https://arxiv.org/html/2605.07112#A6.T12)shows the corresponding per\-query cost breakdown\. Costs are reported in milli\-dollars \(×103\\times 10^\{3\}\)\.

Table 12:Per\-dataset\-group average cost per query on the validation set \(12,267 examples\), in×103\\times 10^\{3\}$ \(milli\-dollars\)\. Multi\-turn queries are 100–200×\\timesmore expensive than single\-turn due to extended conversation context\. Cheapest single\-model result per column inbold\. Costs here are computed as the per\-query average of API\-billed dollar amounts \(sum of per\-query costs divided by number of queries with valid token\-count records\); they may differ slightly from the cost\-decomposition view in Table[28](https://arxiv.org/html/2605.07112#A14.T28), which computest¯in⋅rin\+t¯out⋅rout\\bar\{t\}\_\{\\text\{in\}\}\\\!\\cdot\\\!r\_\{\\text\{in\}\}\+\\bar\{t\}\_\{\\text\{out\}\}\\\!\\cdot\\\!r\_\{\\text\{out\}\}from rounded average token counts and so does not preserve per\-query variance\.
#### Cost observations\.

- •Multi\-turn queries dominate cost\.MT queries cost 100–200×\\timesmore than single\-turn queries for the same model \(e\.g\. GPT\-5\.4: $0\.456 vs\. $0\.002 per query\) due to extended conversation context\. Although MT comprises only 1\.2% of val queries, it accounts for a disproportionate share of total spending\.
- •Switchcraft saves 80–92% on most dataset groupscompared to the best single model \(GPT\-5\.3\-chat\), with savings ranging from 79% \(xLAM\-60K\) to 92% \(BFCL\-Single\)\. The exception is ConFETTI, where savings are only 30%—the router routes many ConFETTI queries to expensive models, reflecting the difficulty of these conversational queries\.
- •Qwen\-3\.5\-9B is the cheapest model across all groups, yet its accuracy is too low to be the default choice \(73\.7% overall vs\. 83\.6% for GPT\-5\.3\-chat\)\. On validation, Switchcraft achieves 82\.9% accuracy at a cost \(6\.8×10−46\.8\\times 10^\{\-4\}$/query\) only 3\.2×\\timesthat of Qwen, while closing most of the accuracy gap to GPT\-5\.3\-chat\.

## Appendix GRouter fine\-tuning configuration

This appendix details the hyperparameters and fine\-tuning settings used for all three encoder models evaluated in Section[4\.6](https://arxiv.org/html/2605.07112#S4.SS6)\. All models share the same fine\-tuning framework \(Hugging Face Transformers\), loss function, and data pipeline; they differ only in the base model, sequence length, batch size, and learning rate\.

### G\.1Training data

Training labels are generated from the evaluation results of all 8 target models on 14 dataset splits \(Section[4](https://arxiv.org/html/2605.07112#S4)\)\. For each query, a model is labeled as*correct*if its AST\-match score≥0\.75\\geq 0\.75\(the same threshold used by BFCL v3 scoring\)\. This produces amulti\-labelbinary target vector per query—multiple models may be correct for the same input\.

#### Stratified 80/10/10 splits\.

Each dataset split \(the 14 rows of Table[2](https://arxiv.org/html/2605.07112#S4.T2)\) is partitioned*independently*into 80% train / 10% validation / 10% test, then the per\-dataset slices are concatenated and shuffled to form the global splits\. Stratifying by dataset guarantees that every benchmark and every category within BFCL \(Simple, Multiple, Parallel, Live\*, Multi\-turn, etc\.\) appears in the train, validation, and test sets in its native 80/10/10 ratio, so no benchmark is held out entirely\. Deduplication on the \(query, tools\) pair is performed*within each dataset*before splitting, preventing near\-duplicates from leaking across the train/validation/test boundary\. The final sizes are approximately 98,000 training, 12,267 validation, and 12,282 test examples\.

#### Seed\-dependent splits\.

The seed value controls*both*the random initialization of the classifier head*and*the per\-dataset shuffle that produces the 80/10/10 split\. Concretely, each seedssdeterministically produces a distinct train/validation/test partition \(train\_data\_\*\_ss\.csv,val\_data\_\*\_ss\.csv,test\_data\_\*\_ss\.csv\) by seedingrandom\.seed\(s\)before shuffling each dataset\. The 80/10/10 ratio and per\-dataset stratification are preserved across seeds—only the specific examples assigned to each split change\. As a result, the±0\.41\\pm 0\.41pp seed variance reported in Section[4\.6](https://arxiv.org/html/2605.07112#S4.SS6)reflects variation across both classifier initializations*and*train/validation/test partitions, providing a stronger generalization signal than a fixed\-split protocol with only initialization\-level seed variation\.

To assess seed stability, each model is fine\-tuned independently with 20 random seeds \(see Appendix[I](https://arxiv.org/html/2605.07112#A9)\)\. The full list of seeds is given at the end of this appendix\.

### G\.2Input representation

Each training example is tokenized using the token packing strategy described in Section[3\.1](https://arxiv.org/html/2605.07112#S3.SS1)\. Table[13](https://arxiv.org/html/2605.07112#A7.T13)shows the per\-model token budgets\.

Table 13:Token packing budgets by model\.Within the budget, the packing algorithm prioritizes the latest turn, compact tool signatures, and earlier turns newest\-first, as detailed in Section[3\.1](https://arxiv.org/html/2605.07112#S3.SS1)\.

### G\.3Shared hyperparameters

Table[14](https://arxiv.org/html/2605.07112#A7.T14)lists settings common to all three models\.

Table 14:Shared training hyperparameters\.
### G\.4Per\-model hyperparameters

Table[15](https://arxiv.org/html/2605.07112#A7.T15)lists settings that differ between models\. The learning rate and batch size were selected via preliminary experiments; DistilBERT tolerates a higher learning rate due to its smaller depth, while DeBERTa\-v3 and ModernBERT fine\-tune stably at2×10−52\\times 10^\{\-5\}\. ModernBERT uses a smaller batch size to accommodate its longer sequences within GPU memory\.

Table 15:Per\-model training hyperparameters\.
### G\.5Classification head

All models use a single linear classification head mapping from hidden dimension \(768\) to the number of target classes \(8 models\)\. No intermediate layers or dropout beyond what is built into the pretrained encoder are added\. The loss is binary cross\-entropy with logits, computed independently for each class, enabling multi\-label prediction\.

### G\.6Inference\-time selection

At inference time, the model outputs 8 sigmoid probabilities \(one per target model\)\. Thelowest\-cost correctselection strategy is applied: among all classes whose probability exceeds the threshold \(0\.5\), the model with the lowest profiled inference cost is selected\. If no class exceeds the threshold, the class with the highest probability \(argmax\) is chosen as a fallback\.

### G\.7Compute environment

All fine\-tuning was conducted on a node with 4×\\timesNVIDIA A100 80 GB PCIe GPUs\. The parallelism strategy differs by model: DistilBERT uses PyTorch DataParallel across all 4 GPUs per seed \(effective batch size16×4=6416\\times 4=64; seeds run sequentially\); DeBERTa\-v3 uses DistributedDataParallel \(DDP\) viatorchrunacross all 4 GPUs per seed \(effective batch size32×4=12832\\times 4=128; seeds run sequentially\); ModernBERT does not support multi\-GPU parallelism, so seeds are distributed across GPUs \(5 seeds per GPU, run sequentially in parallel across devices; batch size 8 on a single GPU\)\. Table[16](https://arxiv.org/html/2605.07112#A7.T16)lists the software stack\.

Table 16:Software environment for router fine\-tuning\.#### Wall\-clock time\.

DistilBERT fine\-tunes in∼\{\\sim\}50 min per seed on 4 GPUs via DataParallel \(17 hr total for 20 seeds, run sequentially\)\. DeBERTa\-v3 requires∼\{\\sim\}4\.5 hr per seed using 4\-GPU DDP\. ModernBERT requires∼\{\\sim\}6 hr per seed on a single GPU \(30 hr wall\-clock with 5 seeds per GPU×\\times4 GPUs in parallel\)\. Total compute across all 60 runs \(3 models×\\times20 seeds\):∼\{\\sim\}70 \(DistilBERT, 4 GPUs×\\times17 hr\) \+∼\{\\sim\}360 \(DeBERTa, 4 GPUs×\\times90 hr\) \+∼\{\\sim\}120 \(ModernBERT\)≈\\approx550 A100 GPU\-hours\. No gradient accumulation is used; the per\-device batch sizes in Table[15](https://arxiv.org/html/2605.07112#A7.T15)are multiplied by the number of GPUs to obtain the effective global batch size\.

#### Seeds\.

The 20 seeds used across all models are:9999, 1, 2, 3, 4, 5, 6, 7, 8, 9, 997, 420, 1234, 2025, 2024, 666, 247, 42, 31415, 27182\.

## Appendix HFine\-tuning curves

Figure[5](https://arxiv.org/html/2605.07112#A8.F5)shows the fine\-tuning and validation loss curves for all 20 DistilBERT seeds\. Fine\-tuning uses binary cross\-entropy loss with AdamW \(lr=5×10−55\{\\times\}10^\{\-5\}, linear schedule with 10% warmup, weight decay 0\.01\) for up to 30 epochs, with early stopping \(patience 3\) on macro F1\.

![Refer to caption](https://arxiv.org/html/2605.07112v1/graphs/training-curves.png)Figure 5:DistilBERT fine\-tuning curves across 20 seeds\. Left: training and validation loss\. Right: validation micro\-F1\. Faint lines show individual seeds; bold lines show the mean \(over seeds still fine\-tuning at each epoch\)\. Early stopping triggers between epochs 9 and 14 depending on the seed\.Training loss decreases steadily from 0\.43 \(epoch 1\) to∼\\sim0\.10 \(epoch 12\)\. Validation loss reaches its minimum at epoch 5 \(0\.243 on average\) and rises thereafter, indicating mild overfitting\. Despite the rising validation loss, validation micro\-F1 continues to improve slightly, plateauing around 0\.939–0\.940 from epoch 7 onward\. This divergence between loss and F1 is expected for multi\-label classification: the loss penalizes calibration \(sigmoid probability\), while F1 rewards ranking \(threshold at 0\.5\)\. Early stopping monitors macro F1 and triggers between epochs 9 and 14 across seeds\.

The narrow spread across all 20 seeds—standard deviation of F1 is only 0\.0015 at epoch 9—confirms that fine\-tuning is stable and the final checkpoint quality is not sensitive to random initialization \(see also Section[4\.6](https://arxiv.org/html/2605.07112#S4.SS6)\)\.

## Appendix IRouter seed stability

Figure[6](https://arxiv.org/html/2605.07112#A9.F6)visualizes the distribution of Switchcraft’s accuracy across 20 random seeds \(DistilBERT encoder\) on the validation set\.

80\.880\.8818181\.281\.281\.481\.481\.681\.681\.881\.8828282\.282\.282\.482\.482\.682\.682\.882\.8838383\.283\.2mean = 81\.89%bestworstrange: 1\.86 ppValidation Accuracy \(%\)Figure 6:DistilBERT router accuracy across 20 random seeds\. Each dot is one seed; the red bar marks the mean\. The total range is 1\.86 percentage points, confirming stability\.Table[17](https://arxiv.org/html/2605.07112#A9.T17)reports the validation accuracy and average inference cost for all 20 seeds of each router encoder model, evaluated on the 12,267\-example validation set\.

Table 17:Validation accuracy \(%\) and average cost per query \(10−410^\{\-4\}$\) for each of the 20 random seeds across DistilBERT, ModernBERT, and DeBERTa\-v3 routers \(1×\\timesbinary AST score, 12,267 examples\)\. The best seed per model isbolded\.All 20 DistilBERT seeds achieve accuracy between 81\.08% and 82\.94%, a range of only 1\.86 percentage points, with a mean of 81\.89% \(±\\pm0\.41\)\. ModernBERT shows a similar pattern: its 20 seeds span 81\.36%–83\.02% \(1\.66 pp range, mean 81\.91%±\\pm0\.38\), indicating slightly tighter clustering\. DeBERTa\-v3 is comparable: 81\.42%–82\.89% \(1\.47 pp range, mean 81\.90%±\\pm0\.41\)\. For all three models, seed 4 is the top performer\. Average inference cost is similarly stable across seeds for all architectures, varying between66and9×10−49\\times 10^\{\-4\}$\. This narrow spread confirms that Switchcraft’s advantage over single\-model baselines is not an artifact of seed selection: even the*worst*seed across all three models \(81\.08%, DistilBERT\) outperforms six of the eight individual LLMs and all three heuristic routers reported in Table[3](https://arxiv.org/html/2605.07112#S4.T3)\.

## Appendix JAdaptive thresholding

### J\.1Motivation

Our default routing pipeline in Switchcraft applies a fixed sigmoid threshold \(θ=0\.5\\theta=0\.5\) to convert multi\-label logits into binary predictions, then selects the*cheapest*model among those predicted positive\. When no class exceeds the threshold, the router falls back to the single highest\-probability class \(argmax\)\. This cost\-minimizing strategy is optimal when the goal is maximum savings at acceptable accuracy—the regime reported in Section[4\.3](https://arxiv.org/html/2605.07112#S4.SS3)\.

However, production deployments may face different operating points: some workloads tolerate higher cost if it brings meaningful accuracy gains, while others demand the absolute lowest cost floor\. A natural question is whether the sigmoid threshold can be*tuned*to trade off between accuracy and cost—and whether alternative decision rules expose a broader Pareto frontier\.

### J\.2Strategies evaluated

We evaluate two families of thresholding strategies:

#### Constant threshold \(θ\\theta\)\.

The same approach as the default pipeline, but sweepingθ∈\{0\.5,0\.75\}\\theta\\in\\\{0\.5,\\,0\.75\\\}\. A higher threshold shrinks the set of “positive” classes, forcing more predictions through the argmax fallback \(which picks the highest\-probability model regardless of cost\)\. This trades cost for accuracy: the router becomes more willing to choose an expensive\-but\-confident model\.

#### Max\-probability drift \(δ\\delta\)\.

An alternative rule that ignores the binary threshold entirely\. For each query, we identify all classes whose predicted probability is withinδ\\deltaof the maximum probability:

𝒞δ=\{c:pc≥maxj⁡pj−δ\}\\mathcal\{C\}\_\{\\delta\}=\\bigl\\\{c:p\_\{c\}\\geq\\max\_\{j\}p\_\{j\}\-\\delta\\bigr\\\}and then select the cheapest class in𝒞δ\\mathcal\{C\}\_\{\\delta\}\. Smallδ\\delta\(e\.g\., 0\.001\) effectively acts as argmax \(only the single top\-scoring class qualifies\), while largeδ\\delta\(e\.g\., 0\.2\) permits aggressive cost optimization by considering near\-ties\.

We sweepδ∈\{0\.001,0\.01,0\.1,0\.2\}\\delta\\in\\\{0\.001,\\,0\.01,\\,0\.1,\\,0\.2\\\}\.

### J\.3Results

We evaluate all six strategies across all 20 random seeds of the DistilBERT router and report test\-set accuracy and cost \(Table[18](https://arxiv.org/html/2605.07112#A10.T18), Figure[7](https://arxiv.org/html/2605.07112#A10.F7)\)\.

Table 18:Accuracy–cost trade\-off under different thresholding strategies \(mean±\\pmstd across 20 seeds, test set\)\.![Refer to caption](https://arxiv.org/html/2605.07112v1/graphs/adaptive-threshold.png)Figure 7:Accuracy–cost Pareto frontier for different thresholding strategies\. Each faint curve traces one seed’s trade\-off across six strategies; the bold curve connects means across 20 seeds with a±\\pm1 std shaded band\. The standard pipeline operating point \(82\.9%,6\.8×10−46\.8\\times 10^\{\-4\}$\) is shown as a star\.
### J\.4Discussion

The results reveal a smooth accuracy–cost Pareto frontier spanning from 81\.9% accuracy at7\.3×10−47\.3\\times 10^\{\-4\}$/query \(aggressive cost minimization\) to 85\.2% at41\.7×10−441\.7\\times 10^\{\-4\}$/query \(accuracy maximization\):

- •Max\-prob drift dominates constant thresholds\.At comparable cost, drift\-based strategies consistently achieve higher accuracy \(e\.g\.,δ=0\.1\\delta=0\.1achieves 83\.5% at11\.7×10−411\.7\\times 10^\{\-4\}$ vs\.θ=0\.75\\theta=0\.75at 82\.8% and10\.8×10−410\.8\\times 10^\{\-4\}$\)\.
- •Over 3 percentage points of accuracy are availableby relaxing the cost budget; even the sweet spot \(δ=0\.1\\delta=0\.1at11\.7×10−411\.7\\times 10^\{\-4\}$\) is a 3\.7×\\timescost reduction compared to the best single LLM \(43\.1×10−443\.1\\times 10^\{\-4\}$ for GPT\-5\.3\-chat\)\.
- •The sweet spot isδ=0\.1\\delta=0\.1, which gains \+1\.6 pp accuracy over the default at only4\.4×10−44\.4\\times 10^\{\-4\}$ additional cost per query \($440 per million queries\)\.
- •Seed variance is lowacross all strategies \(±\\pm0\.3 pp\), confirming that the trade\-off is stable and not an artifact of seed selection\.

These results suggest that deployments with moderate cost tolerance should considerδ∈\[0\.01,0\.1\]\\delta\\in\[0\.01,0\.1\]for a favourable accuracy–cost balance, while the defaultθ=0\.5\\theta=0\.5remains optimal for strict cost minimization\.

## Appendix KAblation: token packing

Switchcraft’s input representation uses an intelligent*token packing*strategy \(Section[3\.1](https://arxiv.org/html/2605.07112#S3.SS1)\) to fit the most decision\-relevant information into the encoder’s context window\.

When using ModernBERT’s 8,192\-token context window, we allocate up to 1,600 tokens for tool definitions \(vs\. 100 for DistilBERT\)\.

To isolate the contribution of this design, we fine\-tune an ablation model that replaces the packing strategy with*naive right\-truncation*: the raw conversation text is fed directly to the tokenizer, which truncates at 512 tokens with no prioritization of recent turns, tool signatures, or metadata\. All other hyperparameters \(learning rate, epochs, class weighting, early stopping\) remain identical\.

#### Results\.

Table[5](https://arxiv.org/html/2605.07112#S4.T5)\(in Section[4\.6](https://arxiv.org/html/2605.07112#S4.SS6)\) reports the test\-set performance \(12,282 examples\) for a single seed\. Without token packing, accuracy drops by 1\.66 percentage points and average routing cost increases by 21%\. Switchcraft more frequently selects expensive models when it lacks the structured context that packing provides—particularly tool signatures and metadata that signal query complexity\.

#### Implications\.

The accuracy gap confirms that token packing is a meaningful contributor to router quality—not merely a convenience for handling long inputs\. Because agent conversations often exceed 512 tokens \(median length in our dataset is∼\{\\sim\}800 tokens\), naive truncation discards the most recent user turn in roughly half of all examples, depriving Switchcraft of the single strongest routing signal\. The cost increase further indicates that without structured packing, Switchcraft defaults to routing queries to larger, more expensive models as a hedge against uncertainty\.

## Appendix LRouter inference latency

A practical router must not only minimize its own prediction cost but also provide a prediction with minimal latency since it would add to TTFT of the subsequent LLM call it dispatches\. We benchmark both router encoder models—DistilBERT \(66 M parameters\) and ModernBERT \(149 M parameters\)—to demonstrate that DistilBERT is deployable on commodity inference hardware with low cost, low latency and high throughput, and that the marginal accuracy gain from ModernBERT \(\+0\.08 pp\) comes at a significant inference cost\.

#### Infrastructure choice\.

We benchmark on a single NVIDIA T4 GPU \(16 GB, Turing architecture\), one of the most widely deployed inference accelerators in public clouds \(AWSg4dn, AzureNC\_T4\_v3, GCPn1\-standard \+ T4\)\. The T4 represents the*cheapest*GPU option \(~$0\.35/hr spot111Based on AWSg4dn\.xlargespot pricing as of April 2026:[https://aws\.amazon\.com/ec2/spot/pricing/](https://aws.amazon.com/ec2/spot/pricing/)\.\); strong performance here demonstrates minimal cost overhead with no specialised hardware required\.

#### Methodology\.

We load each model’s best\-seed checkpoint and run inference on synthetic inputs at three sequence lengths: short \(64 tokens\), medium \(200 tokens\), and long \(512 tokens—DistilBERT’s maximum context\)\. For each configuration we perform 50 warmup iterations followed by 200 timed iterations, reporting wall\-clock latency percentiles \(P50, P95, P99\) and sustained throughput \(queries/sec\)\. All measurements usetorch\.cuda\.synchronize\(\)barriers for accurate timing\. GPU utilisation is the mean SM \(streaming multiprocessor\) activity sampled at 10 ms intervals via NVML during each measurement window\. We report FP32 precision with PyTorch 2\.6 and CUDA 12\.4\.

#### Results\.

Tables[19](https://arxiv.org/html/2605.07112#A12.T19)and[20](https://arxiv.org/html/2605.07112#A12.T20)present the full latency and throughput measurements for DistilBERT and ModernBERT respectively\.

Table 19:DistilBERT router inference \(67 M params, 284 MB GPU memory\) on NVIDIA T4, FP32\.Table 20:ModernBERT router inference \(149 M params, 871 MB GPU memory\) on NVIDIA T4, FP32\.
#### Discussion\.

For single\-query routing at intermediate prompt lengths \(200 tokens\), DistilBERT adds only7\.3 ms at P95—over an order of magnitude faster than the cheapest LLM in our pool \(GPT\-5\-nano averages ~200 ms per call\)\. ModernBERT, despite its marginal accuracy advantage \(\+0\.08 pp\), is3\.2×\\timesslowerat the same operating point \(23\.1 ms P95\)\. At maximum sequence length \(512 tokens\), DistilBERT remains under 17 ms P95 while ModernBERT requires 46 ms\.

Peak throughput tells an even starker story: DistilBERT achieves722 queries/secversus ModernBERT’s 243—a3\.0×\\timesadvantage\. ModernBERT also consumes3\.1×\\timesmore GPU memory\(871 MB vs\. 284 MB\)\. These differences matter in production: at DistilBERT’s throughput, a single $0\.35/hr T4 can serve over 2\.5 million routing decisions per hour, making the per\-query router cost effectively zero \(~1\.4×10−71\.4\\times 10^\{\-7\}$\)\.

These results confirm that DistilBERT’s combination of near\-identical accuracy \(82\.94% vs\. 83\.02%\) and dramatically better inference efficiency makes it the clear production choice, particularly on low\-cost commodity hardware like the T4\.

## Appendix MMisrouted query breakdowns

This appendix provides detailed breakdowns of the 2,093 misrouted queries \(17\.1% of the 12,267\-example validation set\) from Switchcraft’s best\-seed configuration \(DistilBERT, seed 4\)\. A query is*misrouted*if the predicted model answers incorrectly \(*wrong model*: 902 cases\) or if no model in the pool answers correctly \(*no correct model*: 1,191 cases\)\.

### Misroute rate by number of tool definitions

Table 21:Misroute rate by number of tool definitions\.As shown in Table[21](https://arxiv.org/html/2605.07112#A13.T21), the misroute rate increases monotonically with the number of tool definitions in the query, rising from 15\.3% for single\-tool queries to 39\.8% for queries with 11–50 tools\. Queries with many tools tend to have longer function schemas that may exceed the router’s 512\-token context window, causing information loss\.

### Misroute rate by number of conversation turns

Table 22:Misroute rate by number of conversation turns\.Table[22](https://arxiv.org/html/2605.07112#A13.T22)shows that the relationship with turn count is non\-monotonic: moderate multi\-turn conversations \(3–8 turns\) are actually slightly easier to route than single\-turn queries \(14–15% vs\. 17\.5%\)\. Very long conversations \(≥\\geq9 turns\) spike sharply to 42\.5%, though the sample size is small \(153 queries\)\.

### Misroute rate by text length

Table 23:Misroute rate by text length \(in tokens\)\.Table[23](https://arxiv.org/html/2605.07112#A13.T23)shows that very short queries \(<<15 tokens\) are the easiest to route \(10\.1%\), likely because they are simple single\-function calls\. Mid\-range and long queries are harder, with the longest bucket \(68\+ tokens\) reaching 23\.3%\.

### Misroute rate by number of correct models

Table 24:Misroute rate by number of correct models in the pool\.Table[24](https://arxiv.org/html/2605.07112#A13.T24)reveals the strongest predictor of routing difficulty\. When all 8 models answer correctly, the misroute rate is 0%—every prediction yields a correct answer\. When only 1 model is correct, the router must identify that specific model among 8 candidates, and the error rate climbs to 67\.9%\. The 1,191 queries where no model is correct are misrouted by definition \(100%\) and account for 56\.9% of all misroutes\.

### Predicted vs\. Oracle model confusion

Among the 902*wrong\-model*errors \(where a correct model existed but the router chose one that failed\), Table[25](https://arxiv.org/html/2605.07112#A13.T25)shows the predicted and oracle model distributions:

Table 25:Predicted vs\. oracle model distribution for the 902 wrong\-model errors\.Switchcraft over\-predicts GPT\-5\.3\-chat \(49\.9% of wrong predictions vs\. 4\.1% of oracle selections\) and GPT\-5\-nano \(38\.8% vs\. 8\.3%\), while under\-predicting Qwen \(5\.3% vs\. 44\.7%\) and GPT\-5\.4\-nano \(1\.1% vs\. 17\.6%\)\. The top confusion flows are:

- •GPT\-5\-nano→\\rightarrowQwen \(320 cases\): the router predicts nano but it fails, while Qwen would have been the cheapest correct model\.
- •GPT\-5\-nano→\\rightarrowGPT\-5\.4\-nano \(126 cases\): nano fails, but the slightly more expensive 5\.4\-nano succeeds\.
- •GPT\-5\-nano→\\rightarrowGPT\-5\-mini \(63 cases\): nano fails on queries that require the mini model’s capability\.

## Appendix NPer\-model token usage and chattiness analysis

Token counts are extracted from each model’s API response metadata\. Every result record includesinput\_token\_count,output\_token\_count, and \(where applicable\)reasoning\_token\_count\. The output token count corresponds to the API’scompletion\_tokensfield, which*includes*reasoning tokens; the reasoning token count is the subset reported viacompletion\_tokens\_details\.reasoning\_tokens\. Thus the visible \(non\-reasoning\) output isoutput−reasoning\\text\{output\}\-\\text\{reasoning\}tokens\. For multi\-turn conversations, token counts are stored as nested lists \(one entry per turn\); we sum across all turns to obtain the per\-example total\. Reasoning tokens are billed at the output token rate but are not returned in the response text\.

Table[26](https://arxiv.org/html/2605.07112#A14.T26)reports the average input, output, and reasoning token counts per query for each model on the 12,267\-example validation set, alongside per\-token pricing\.

Table 26:Per\-model token consumption on the validation set \(12,267 examples\)\. Avg In = average input \(prompt\) tokens per query\. Avg Out = average total output \(completion\) tokens per query, which*includes*reasoning tokens\. Avg Reas = average reasoning tokens per query \(a subset of Avg Out\)\. % Reas = percentage of queries that consume any reasoning tokens\. Pricing is in USD per million tokens\. See the caveat on input token counts at the end of this section\.Several patterns emerge\. First,reasoning token usage varies dramatically: GPT\-5\-nano uses reasoning tokens on 71% of queries \(118 avg\), while GPT\-5\.3\-chat uses them on only 2%\. This hidden cost substantially inflates GPT\-5\-nano’s per\-query expense beyond what its ultra\-low per\-token price \($0\.05/M input\) would suggest\. Second,output verbosity and reasoning are distinct phenomena: Qwen\-3\.5\-9B produces the most total output tokens \(228 avg, all visible\) but zero reasoning tokens, whereas GPT\-5\-nano’s 170 avg output tokens include 118 reasoning tokens—leaving only 52 visible output tokens, the fewest of any model\.

#### Profiled per\-query cost\.

The router’s cost\-based tie\-breaking uses*profiled costs*: the actual dollar cost of each model computed from its observed token consumption and per\-token pricing, rather than from list price alone\. Table[27](https://arxiv.org/html/2605.07112#A14.T27)reports the average profiled cost per query for each model, computed over all 157,101 raw examples \(before the pipeline’s deduplication step that yields 122,267 records for splitting; see Section[4\.1](https://arxiv.org/html/2605.07112#S4.SS1)\)\. The same per\-model cost will differ modestly across splits because the share of expensive multi\-turn queries varies: e\.g\. GPT\-5\.3\-chat averages32\.8×10−432\.8\\times 10^\{\-4\}$ on the full raw dataset,37\.3×10−437\.3\\times 10^\{\-4\}$ on validation \(Table[28](https://arxiv.org/html/2605.07112#A14.T28)\), and43\.1×10−443\.1\\times 10^\{\-4\}$ on the test set \(Table[3](https://arxiv.org/html/2605.07112#S4.T3)\)\. These differences reflect split composition, not pricing changes\.

Table 27:Profiled average inference cost per query \(10−410^\{\-4\}$\), computed from observed token consumption across all 157,101 raw examples \(pre\-deduplication; see Section[4\.1](https://arxiv.org/html/2605.07112#S4.SS1)\)\. Models are sorted by ascending cost\. This ordering determines which model the router selects when multiple candidates are predicted correct\.The profiled cost ordering matches the list\-price ordering \(Table[3](https://arxiv.org/html/2605.07112#S4.T3)\) for all eight models\. This means that, for our current model pool, a simpler price\-based ranking would yield the same routing behavior\. However, this equivalence is not guaranteed in general: a model with a low list price but high output verbosity or heavy reasoning token usage could be more expensive per query than a model with a higher list price but more concise responses\. In earlier versions of Switchcraft where we considered a different basket of models, we did observe such cost\-price inversions\.

#### Cost decomposition\.

Table[28](https://arxiv.org/html/2605.07112#A14.T28)breaks the per\-query cost into its input and output components, revealing how much each contributes to the total\. For most models input cost accounts for 60–85% of total cost, meaning that output token differences—the driver of the chattiness metric \(Section[4\.8](https://arxiv.org/html/2605.07112#S4.SS8)\)—are compressed when viewed through the lens of total cost\.

Table 28:Per\-query cost decomposition on the validation set \(12,267 examples\)\. Input Cost and Output Cost are the average per\-query costs \(avg\_input\_tokens×\\timesinput\_rateandavg\_output\_tokens×\\timesoutput\_rate, respectively\)\. Input % is the share of total cost attributable to input tokens\. Models are sorted by ascending total cost\.The outlier is Kimi\-K2\.5, whose input share is only 31%—a combination of its high output rate \($3\.00/M\), a more efficient tokenizer, and early termination of failed multi\-turn conversations \(see caveat below\)\. Because output dominates Kimi’s cost, its chattiness of 1\.31×\\times\(Section[4\.8](https://arxiv.org/html/2605.07112#S4.SS8)\) reflects the raw output ratio \(1\.53×\\times\) with relatively little compression\. Conversely, Qwen\-3\.5\-9B has the highest raw output ratio \(1\.90×\\times\) but a chattiness of only 1\.10×\\timesbecause input accounts for 81% of its cost\.

#### Caveat on input token counts\.

The Avg In column in Table[26](https://arxiv.org/html/2605.07112#A14.T26)and the input costs in Table[28](https://arxiv.org/html/2605.07112#A14.T28)should be interpreted with two caveats\. First, different models use different tokenizers, so the same prompt text produces different input token counts—for instance, on single\-turn queries all GPT models encode the prompt as∼\\sim213 tokens on average, while Kimi\-K2\.5’s tokenizer produces≈\\approx0\.68×\\timesas many \(144\) and Qwen\-3\.5\-9B’s tokenizer produces≈\\approx1\.76×\\timesmore\. These differences reflect tokenizer design, not meaningful variation in model behavior\. Second, on multi\-turn conversations Kimi\-K2\.5 completes far fewer turns than other models—a median of 3 turns versus 21 for GPT models—resulting in more failures, fewer cumulative API calls and correspondingly lower total input token counts\. Together, tokenizer efficiency and early conversation termination explain Kimi’s unusually low average input count \(414 tokens vs\.∼\\sim2,000 for GPT models\) and its low input cost share \(31%\)\. Despite these caveats, the profiled per\-query costs in Tables[27](https://arxiv.org/html/2605.07112#A14.T27)and[28](https://arxiv.org/html/2605.07112#A14.T28)are computed from each API’s own reported token counts and pricing, so they accurately reflect what a deployment would actually be billed\.

#### Chattiness derivation\.

For each modelmm, we compute an*expected cost*assuming it generates the cross\-model average number of output tokens \(120, averaged over all eight models on the validation set\) while keeping its actual input cost fixed:

expected​\(m\)=t¯in\(m\)⋅rin​\(m\)⏟actual input cost\+t¯out⋅rout​\(m\)⏟avg\-output cost\\text\{expected\}\(m\)=\\underbrace\{\\bar\{t\}\_\{\\text\{in\}\}^\{\(m\)\}\\cdot r\_\{\\text\{in\}\}\(m\)\}\_\{\\text\{actual input cost\}\}\+\\underbrace\{\\bar\{t\}\_\{\\text\{out\}\}\\cdot r\_\{\\text\{out\}\}\(m\)\}\_\{\\text\{avg\-output cost\}\}wheret¯in\(m\)\\bar\{t\}\_\{\\text\{in\}\}^\{\(m\)\}ismm’s average input token count,t¯out=120\\bar\{t\}\_\{\\text\{out\}\}=120is the cross\-model average output token count, andrin,routr\_\{\\text\{in\}\},r\_\{\\text\{out\}\}are the per\-token rates\.Chattinessis the ratio of actual to expected cost:

chattiness​\(m\)=actual cost per query​\(m\)expected cost​\(m\)\\text\{chattiness\}\(m\)=\\frac\{\\text\{actual cost per query\}\(m\)\}\{\\text\{expected cost\}\(m\)\}Figure[8](https://arxiv.org/html/2605.07112#A14.F8)visualizes this metric\. Points above they=xy\{=\}xdiagonal cost more than expected \(chattiness\>1\{\>\}1\); points below cost less \(chattiness<1\{<\}1\)\.

00\.50\.5111\.51\.5222\.52\.5333\.53\.5444\.54\.5555\.55\.5666\.56\.5777\.57\.588⋅10−3\\cdot 10^\{\-3\}022446688⋅10−3\\cdot 10^\{\-3\}GPT\-5\.4GPT\-5\.4\-miniGPT\-5\.3\-chatGPT\-5\-miniKimiGPT\-5\.4\-nanoQwenGPT\-5\-nanoExpected Cost per Query \($\)Actual Cost per Query \($\)Expected = ActualModelFigure 8:Expected vs\. actual average cost per query on the validation set\.*Expected cost*uses each model’s actual input cost plus the cost of generating the cross\-model average output \(120 tokens\) at that model’s output rate\. Points above they=xy\{=\}xdiagonal cost more than expected—indicating above\-average output verbosity; points below cost less—indicating concise output\.
#### Verbose budget models\.

Kimi\-K2\.5 has the highest chattiness in our pool \(≈\\approx1\.31×\\times\): it generates 183 output tokens on average—well above the cross\-model mean of 120—making its actual cost 31% higher than expected\. Because output accounts for 69% of Kimi’s total cost \(Table[28](https://arxiv.org/html/2605.07112#A14.T28)\), its verbose output translates almost directly into higher total cost\. Qwen\-3\.5\-9B is the most verbose model in raw token terms \(228 avg\), yet its chattiness is only≈\\approx1\.10×\\times\. The explanation is that 81% of its per\-query cost is input \(Table[28](https://arxiv.org/html/2605.07112#A14.T28)\), so even a large output surplus barely moves the overall cost ratio\. Despite sharing the cheapest input rate with GPT\-5\-nano \($0\.05/M\) and having an even cheaper output rate \($0\.15/M vs\. $0\.40/M\), Qwen costs 7% more per query \(1\.81\.8vs\.1\.7×10−41\.7\\times 10^\{\-4\}$\) due to its higher total token consumption—and achieves lower accuracy \(72\.40% vs\. 79\.15%\)\.

#### Concise output vs\. verbose output\.

GPT\-5\.3\-chat has the second\-highest output rate in our pool \($14\.00/M\), yet it is the most concise model \(chattiness≈\\approx0\.78×\\times, 47 avg completion tokens, only 2% with reasoning\)\. Its actual per\-query cost \(37\.3×10−437\.3\\times 10^\{\-4\}$\) is 22% below expected, making it the best accuracy–cost trade\-off among single models\. A router relying on list price alone would rank GPT\-5\.3\-chat among the most expensive models; in practice its frugal output makes it one of the most cost\-efficient\. The GPT\-5\.4 family \(GPT\-5\.4, GPT\-5\.4\-mini, GPT\-5\.4\-nano\) also falls below the diagonal \(chattiness 0\.88–0\.91×\\times\), generating fewer output tokens than the cross\-model average\. For these premium models, input cost dominates \(83–85% of total; Table[28](https://arxiv.org/html/2605.07112#A14.T28)\), which compresses chattiness toward 1 even though their output token ratios range from 0\.52–0\.64×\\timesthe cross\-model mean\.

## Appendix OReproduction with a different model basket

To demonstrate that our findings generalise beyond a single set of LLMs, we replicate the core evaluation with an*earlier model basket*comprising eight OpenAI models available in early 2025 \(Table[29](https://arxiv.org/html/2605.07112#A15.T29)\)\. This basket pre\-dates the GPT\-5\.4 family and the open\-weight models used in the main paper; it includes two reasoning models \(o4\-mini, GPT\-5\-chat\) and spans a wide cost range from GPT\-4\.1\-nano \($0\.10/M input\) to GPT\-5 \($1\.25/M input\)\.

Table 29:Earlier model basket: LLMs available for routing, listed with per\-token pricing \(USD per million tokens\)\.#### Experimental setup\.

The evaluation protocol mirrors Section[4\.2](https://arxiv.org/html/2605.07112#S4.SS2)exactly: the same 14 datasets, the same 80/10/10 stratified splits, the same DistilBERT architecture and hyperparameters, and the same 20\-seed fine\-tuning procedure\. The only difference is the pool of candidate LLMs\. We report results on the same held\-out test set of 12,282 examples\.

#### Results\.

Table[30](https://arxiv.org/html/2605.07112#A15.T30)and Figure[9](https://arxiv.org/html/2605.07112#A15.F9)present the accuracy–cost trade\-offs\. The key findings from the main paper hold:

Table 30:Accuracy and average inference cost per query on the test set \(12,282 examples\) with the earlier model basket\. Entities are grouped by type and sorted by accuracy\.![Refer to caption](https://arxiv.org/html/2605.07112v1/graphs/old-models-tradeoffs.png)Figure 9:Accuracy–cost Pareto plot for the earlier model basket on the held\-out test set \(12,282 examples\)\. Our agent\-fine\-tuned DistilBERT router \(■\\blacksquare\) achieves 82\.73% accuracy while the chat\-fine\-tuned router \(◆\\blacklozenge\) reaches only 77\.47% despite similar cost, demonstrating the importance of agentic training data\.
#### Finding 1: the learned router dominates the cost–accuracy frontier\.

The DistilBERT agent router achieves 82\.73% accuracy at8\.7×10−48\.7\\times 10^\{\-4\}$ per query—comparable to the best individual model \(GPT\-4\.1 at 83\.22%\) while reducing cost by88%\(73\.6→8\.7×10−473\.6\\rightarrow 8\.7\\times 10^\{\-4\}$\)\. This mirrors the main result in Section[4\.3](https://arxiv.org/html/2605.07112#S4.SS3): a lightweight classifier can match frontier\-model accuracy at a fraction of the cost\.

#### Finding 2: a chat\-fine\-tuned router is insufficient for agentic workloads\.

This experiment additionally includes a*Chat Router*—a DistilBERT model fine\-tuned primarily on a diverse set of public chat completion benchmarks \(non\-agentic workloads; see Table[31](https://arxiv.org/html/2605.07112#A15.T31)for the full list\)\. Despite achieving low cost \(7\.7×10−47\.7\\times 10^\{\-4\}$\), the chat router reaches only77\.47%accuracy—5\.26 percentage points below our agent router \(82\.73%\) and even below the cheapest heuristic baseline \(Length at 78\.95%\)\.

This gap demonstrates that routing decisions learned from chat completions do not transfer effectively to agentic tool\-calling scenarios\. Agentic queries involve multi\-turn conversations, complex tool schemas, and interleaved reasoning that differ fundamentally from single\-turn chat\. The techniques presented in this paper—agentic evaluation data, multi\-label fine\-tuning, and cost\-aware inference—are essential for effective routing in this domain\.

#### Finding 3: heuristic performance varies with the model pool\.

With the earlier model basket, the Length heuristic performs best \(78\.95%\), while Num\. Turns drops to 62\.67%—a reversal from the main results where Num\. Turns leads \(80\.41%\)\. This suggests that heuristic effectiveness is tightly coupled to the specific models available and does not generalise, whereas the learned router consistently performs well regardless of the underlying model pool\.

Table 31:64 public datasets used to train the chat\-fine\-tuned router baseline\. HF = HuggingFace, GH = GitHub\.

## Appendix PProbabilistic correctness labels

### P\.1Motivation

The standard evaluation protocol in this paper treats each LLM response as either correct or incorrect \(a binary label\)\. In practice, however, LLMs are*stochastic*: the same model may produce a correct answer on some invocations and an incorrect one on others for the*same*query\. A model that answers correctly 95% of the time is fundamentally more reliable than one that answers correctly 55% of the time, yet both receive the same binary label of “correct” if evaluated on a single invocation\.

This appendix presents a preliminary investigation intoprobabilistic correctness labels: instead of evaluating each model once per query, we evaluate it20 timesand record the fraction of correct responses as a probabilityp∈\[0,1\]p\\in\[0,1\]\. We then fine\-tune Switchcraft using these soft labels and measure whether this richer supervision improves routing quality\.

### P\.2Methodology

#### Multi\-invocation evaluation\.

Each of the eight candidate LLMs is called 20 times on every query\. Each response is independently scored using the same AST comparison framework described in Section[4\.2](https://arxiv.org/html/2605.07112#S4.SS2)\. The*correctness probability*for modelmmon queryqqis:

p​\(m,q\)=correct iterations20p\(m,q\)=\\frac\{\\text\{correct iterations\}\}\{20\}This probability captures the reliability of a model on a given query, not just whether it*can*answer correctly\.

#### Dataset coverage\.

Due to the 20×\\timescost multiplier, we evaluated 12 of the 14 datasets \(omitting Glaive and xLAM\-60K\), yielding 1,224 test examples\. This is a smaller test set than the 12,282 examples used in the main paper, making results noisier but still directionally informative\.

#### Soft\-label training\.

The data generation pipeline stores each model’s probability directly as the training label\. For example, a query where GPT\-4\.1 answered correctly on 17/20 invocations and GPT\-5\-nano on 20/20 receives the label vector:

\{‘‘gpt\-4\.1’’: 0\.85, ‘‘gpt\-5\-nano’’: 1\.0, \.\.\.\}

The DistilBERT classifier is then fine\-tuned withBCEWithLogitsLossagainst these continuous targets\. Each output head learns to predict the*probability of correctness*for the corresponding model, rather than a hard correct/incorrect label\.

#### Inference procedure\.

At inference time, the soft\-label fine\-tuned model still outputs a sigmoid probability for each candidate LLM\. The prediction procedure is identical to the binary\-label case:

1. 1\.Apply a threshold \(θ=0\.5\\theta=0\.5\) to each sigmoid output to determine which models are predicted to be “reliable” for this query\.
2. 2\.Among the models above threshold, select thecheapestone \(cost\-aware tie\-breaking using profiled per\-query costs\)\.
3. 3\.If no model exceeds the threshold, fall back to the model with the highest predicted probability \(argmax\)\.

#### Probability distributions\.

Figure[10](https://arxiv.org/html/2605.07112#A16.F10)shows the distribution of correctness probabilities per model\. The distributions are strongly bimodal: the vast majority of model–query pairs have probability near 0 \(always incorrect\) or 1 \(always correct\), with only 10–22% falling in the intermediate range \(denoted “mid” in each subplot\)\. This bimodality explains why soft labels do not dramatically change the learning problem—most labels are effectively binary\. Figure[11](https://arxiv.org/html/2605.07112#A16.F11)shows the same analysis grouped by dataset; multi\-turn datasets \(ConFETTI, multi\-turn base/long\-context\) exhibit higher mid\-range fractions \(29–31%\), indicating greater stochasticity in model responses for complex conversational tasks\.

![Refer to caption](https://arxiv.org/html/2605.07112v1/graphs/correctness-probabilities-model.png)Figure 10:Distribution of correctness probabilities across 20 invocations, grouped by model\. Each histogram shows the fraction of queries at each probability level\. “mid” denotes the fraction of queries with probability strictly between 0 and 1\.![Refer to caption](https://arxiv.org/html/2605.07112v1/graphs/correctness-probabilities-dataset.png)Figure 11:Distribution of correctness probabilities across 20 invocations, grouped by dataset\. Multi\-turn and conversational datasets show higher fractions of intermediate probabilities, indicating greater response variability\.

### P\.3Results

Table[32](https://arxiv.org/html/2605.07112#A16.T32)and Figure[12](https://arxiv.org/html/2605.07112#A16.F12)present the results\. The model basket is the same as Appendix[O](https://arxiv.org/html/2605.07112#A15)\(the earlier basket of eight OpenAI models\)\.

Table 32:Accuracy and average inference cost per query on the test set \(1,224 examples\) with probabilistic correctness labels\. Entities are grouped by type and sorted by accuracy\.TypeEntityAccuracy \(%\)Avg Cost \(10−410^\{\-4\}$\)Single LLMGPT\-5\-chat82\.11240GPT\-4\.180\.31506GPT\-579\.98329GPT\-5\-nano78\.8411GPT\-4\.1\-mini78\.10105GPT\-5\-mini77\.6165GPT\-4\.1\-nano74\.6725o4\-mini73\.69195Heuristic RouterNum\. Tool Calls80\.96430Length77\.9484Num\. Turns74\.35338Chat RouterChat\-fine\-tuned79\.17118Agent RouterDistilBERT \(ours\)82\.2894Upper BoundOracle91\.01205![Refer to caption](https://arxiv.org/html/2605.07112v1/graphs/probabilistic-scoring-tradeoffs.png)Figure 12:Accuracy–cost Pareto plot with probabilistic correctness labels \(1,224 test examples\)\. The agent router fine\-tuned on soft probability labels \(■\\blacksquare, with seed\-range error bars\) achieves 82\.28% accuracy\.
### P\.4Discussion

#### Soft labels match binary\-label performance\.

The probabilistic router achieves82\.28%accuracy at94×10−494\\times 10^\{\-4\}$ per query\. Comparing to the binary\-label router on the same model basket \(Appendix[O](https://arxiv.org/html/2605.07112#A15)\), which was evaluated on all 14 datasets \(12,282 test examples\), the binary router achieved 82\.73% at8\.7×10−48\.7\\times 10^\{\-4\}$\. However, this is not a direct comparison: the probabilistic experiment covers only 12 datasets \(1,224 test examples\) due to the prohibitive cost of 20×\\timesevaluation\. The accuracy difference \(0\.45 pp\) is within noise for the smaller test set, and the higher per\-query cost \(94×10−494\\times 10^\{\-4\}$ vs\.8\.7×10−48\.7\\times 10^\{\-4\}$\) reflects the different dataset composition rather than the labeling strategy itself—the two omitted datasets \(Glaive, xLAM\-60K\) contain many easy queries that drive down average cost in the full evaluation\.

#### The prediction task is not substantially harder\.

Our initial hypothesis was that probabilistic labels would make the prediction task*harder*for the router—since a model might be correct only 60% of the time, the router must learn a finer\-grained distinction than correct/incorrect\. In practice, however, the accuracy distributions are strongly bimodal: most model–query pairs have probability near 0 or near 1, with relatively few in the intermediate range\. The mean correctness probability across all 203,279 model–query evaluations is 0\.80, and the median is 1\.0, indicating that the majority of entries are deterministic\.

#### Cost implications\.

The 20×\\timesevaluation cost is prohibitive at scale: evaluating 8 models×\\times12,282 queries×\\times20 iterations would require nearly 2 million API calls\. We were only able to complete 12 of the 14 datasets within our budget, resulting in a smaller test set \(1,224 examples\)\. The reduced test set makes it difficult to draw definitive conclusions about whether soft labels improve routing quality\.

#### Implications and future work\.

While this preliminary investigation does not demonstrate a clear benefit of probabilistic labels over binary labels, the approach remains promising for settings where:

- •Model responses are highly stochastic \(e\.g\., complex multi\-step reasoning tasks where correctness varies significantly across runs\)\.
- •The cost of incorrect routing is high, and a confidence\-calibrated router that can say “this model answers correctly only 60% of the time” would enable risk\-aware fallback strategies\.
- •New evaluation data is expensive to acquire, and soft labels extract more information per evaluation run\.

A larger\-scale investigation with the full dataset and a proper cost\-controlled comparison is an important direction for future work\.

## Appendix QMIRT\-DistilBERT baseline

To assess whether an alternative routing architecture that explicitly models per\-LLM capabilities could outperform our multi\-label classification approach, we implement and evaluate a router based onMultidimensional Item Response Theory\(MIRT\)\[[30](https://arxiv.org/html/2605.07112#bib.bib30)\]\. MIRT originates in psychometrics, where it models the probability that a test\-taker with latent ability𝜽\\boldsymbol\{\\theta\}answers an item with discrimination𝒂\\boldsymbol\{a\}and difficultybbcorrectly\. Applied to LLM routing, each candidate model plays the role of a test\-taker and each query plays the role of a test item\.

### Q\.1Architecture

The MIRT router uses a two\-stage architecture:

#### Stage 1: embedding extraction \(frozen\)\.

Both query texts and LLM profile descriptions are encoded by a frozen DistilBERT model into 768\-dimensional mean\-pooled, L2\-normalised embeddings\. Each LLM is represented by a short natural\-language profile describing its release date, capabilities, and intended use case \(8 profiles in total\)\. Query embeddings are computed from the concatenation of the user message and tool signatures\.

#### Stage 2: MIRT head \(fine\-tuned\)\.

A lightweight MIRT head projects query and LLM embeddings into a sharedKK\-dimensional latent space \(we useK=25K\{=\}25\) via three linear projections:

𝜽\\displaystyle\\boldsymbol\{\\theta\}=Wθ​𝐞llm\\displaystyle=W\_\{\\theta\}\\,\\mathbf\{e\}\_\{\\mathrm\{llm\}\}\(LLM ability\)\(1\)𝒂\\displaystyle\\boldsymbol\{a\}=softplus​\(Wa​𝐞query\)\\displaystyle=\\mathrm\{softplus\}\\\!\\bigl\(W\_\{a\}\\,\\mathbf\{e\}\_\{\\mathrm\{query\}\}\\bigr\)\(query discrimination\)\(2\)b\\displaystyle b=Wb​𝐞query\\displaystyle=W\_\{b\}\\,\\mathbf\{e\}\_\{\\mathrm\{query\}\}\(query difficulty\)\(3\)The predicted probability that LLMmmanswers queryqqcorrectly follows the 2PL \(two\-parameter logistic\) IRT response function:

P​\(m,q\)=σ​\(∑k=1Kak​θk−b\)P\(m,q\)=\\sigma\\\!\\Bigl\(\\sum\_\{k=1\}^\{K\}a\_\{k\}\\,\\theta\_\{k\}\-b\\Bigr\)At inference time, the router scores all eight LLMs for a given query and selects the cheapest one whose predicted probability exceeds a threshold \(θ=0\.5\\theta\{=\}0\.5\), falling back to argmax if none qualifies—the same cost\-aware selection rule used by our multi\-label classifier\.

### Q\.2Training differences from our multi\-label router

Table[33](https://arxiv.org/html/2605.07112#A17.T33)summarises the key differences between the MIRT router and our DistilBERT multi\-label classifier\.

Table 33:Architectural and training differences between our multi\-label DistilBERT router and the MIRT\-DistilBERT baseline\.The MIRT architecture has two potential advantages over fixed\-head classification: \(i\) it can theoretically generalise to*new*LLMs at test time by computing an embedding from their profile text, without retraining; and \(ii\) its*trainable*parameter count is orders of magnitude smaller \(∼\{\\sim\}59K vs\. 66M\), since only the three projection matrices are learned while the encoder remains frozen\. Note, however, that the frozen encoder is still required at inference time, so the total model size is comparable; the advantage is reduced fine\-tuning cost, not a smaller deployment footprint\.

### Q\.3Results

Table[34](https://arxiv.org/html/2605.07112#A17.T34)compares the MIRT router against our DistilBERT multi\-label classifier on the same validation set \(12,267 examples\) used for seed selection, evaluated across 20 random seeds\.

Table 34:Routing accuracy and cost comparison: our DistilBERT multi\-label router vs\. MIRT\-DistilBERT \(validation set, 12,267 examples, 20 seeds\)\. Best\-seed accuracy is reported with mean±\\pmstd across seeds in parentheses\.The MIRT router achieves a best\-seed accuracy of81\.64%\(mean 80\.79%, std±\\pm0\.37 pp\), which is1\.30 percentage points belowSwitchcraft’s best seed \(82\.94%\) and 1\.10 pp below its mean \(81\.89%\)\. The MIRT router does achieve a slightly lower average cost per query \(5\.25\.2vs\.6\.8×10−46\.8\\times 10^\{\-4\}$\), indicating that it routes more aggressively to cheaper models—but at the expense of accuracy\.

### Q\.4Discussion

#### Why does MIRT underperform?

We identify two likely factors:

1. 1\.Frozen encoder\.Our multi\-label router fine\-tunes all 66M DistilBERT parameters end\-to\-end, allowing the encoder to learn task\-specific representations for agentic function\-calling queries\. The MIRT router uses frozen embeddings from a general\-purpose pre\-trained model, which may not capture the fine\-grained distinctions \(e\.g\., JSON structure validity, tool\-schema compliance\) that matter for routing\.
2. 2\.Vanilla tokenization\.The MIRT router uses simple text concatenation rather than our compressed token\-packing strategy \(Section[3\.1](https://arxiv.org/html/2605.07112#S3.SS1)\), which prioritises the most recent user turn and tool signatures within the 512\-token budget\. The ablation in Appendix[K](https://arxiv.org/html/2605.07112#A11)shows that token packing alone contributes 1\.66 pp of accuracy; this accounts for most of the observed gap\.

#### MIRT’s advantage: extensibility\.

The MIRT architecture represents LLMs via their profile embeddings rather than fixed output heads, which in principle allows zero\-shot routing to new models by simply providing a profile description\. Our multi\-label classifier requires retraining when the model pool changes \(Section[5](https://arxiv.org/html/2605.07112#S5)\)\. However, since MIRT’s accuracy already trails Switchcraft by 1\.3 pp even on the*known*model pool, its zero\-shot performance on unseen models—which would lack fine\-tuning signal entirely—is unlikely to be competitive in practice without further architectural improvements \(e\.g\., unfreezing the encoder or adopting compressed tokenization\)\.

#### Implications for router design\.

This comparison supports our design choice of end\-to\-end fine\-tuning with a simple multi\-label head over a more structured IRT formulation\. The expressiveness gained by fine\-tuning the full encoder outweighs the theoretical elegance of the IRT framework in our setting, where the model pool is fixed and retraining is inexpensive \(30 epochs on a single GPU\)\.

### Q\.5Deviations from the original IRT\-Router

Table[35](https://arxiv.org/html/2605.07112#A17.T35)lists the differences between our MIRT\-DistilBERT implementation and the original IRT\-Router\[[30](https://arxiv.org/html/2605.07112#bib.bib30)\], along with the rationale for each deviation\. All deviations are motivated by the goal of an apples\-to\-apples comparison with our multi\-label router: we isolate the effect of the*routing architecture*\(multi\-label classification vs\. IRT head\) by holding the encoder, data, and evaluation protocol constant\.

Table 35:Deviations from the original IRT\-Router\[[30](https://arxiv.org/html/2605.07112#bib.bib30)\]and justification\. Each change is made to enable a controlled comparison with our DistilBERT multi\-label router on the same data and model pool\.In summary, our implementation faithfully reproduces the MIRT\-Router’s core architecture \(embedding→\\tolinear projections→\\to2PL response function→\\toBCE loss\) while substituting the encoder, data domain, model pool, and routing rule to match our experimental setup\. This design ensures that the 1\.30 pp accuracy gap we observe reflects a genuine limitation of the frozen\-encoder IRT formulation relative to end\-to\-end fine\-tuning, rather than an artifact of mismatched evaluation conditions\.

## Appendix RAdditional limitations and design considerations

#### Per\-turn correctness vs\. end\-to\-end task success\.

Our evaluation scores each turn independently via AST matching, but does not measure end\-to\-end agent task completion under environment dynamics\. Extending to trajectory\-level success on execution environments such asτ\\tau\-bench\[[35](https://arxiv.org/html/2605.07112#bib.bib35)\]or SWE\-bench\[[13](https://arxiv.org/html/2605.07112#bib.bib13)\]is future work\.

#### Cost\-model assumptions\.

We use list\-price API rates for all models\. Self\-hosted deployments would substitute amortized GPU\-hour costs for open\-weight models, which can shift absolute numbers; the relative cost ordering is robust to such substitutions\.

#### Per\-turn routing and prompt caching\.

Switchcraft selects a model independently at each turn, so consecutive turns may be served by different models\. In stateless API deployments, the full conversation is re\-transmitted regardless, so switching models does not increase billed tokens\. However, switching forfeits prompt\-caching discounts \(typically 50% off repeated prefixes\) and increases time\-to\-first\-token\. A production deployment could constrain routing to per\-conversation granularity or incorporate caching discounts into cost\-aware selection\.

#### Reasoning effort and cascading\.

Many recent LLMs expose a configurable reasoning effort \(e\.g\., low/medium/high\)\. A natural extension is for the router to select not only the model but also the reasoning level, and its associated cost\. Similarly, multi\-model cascading \(trying a cheap model first and falling back to an expensive one on failure\) could further reduce average cost\.

Similar Articles

@GitHub_Daily: An open-source tool called 9router recently went viral, adding an intelligent dispatch center to all AI coding tools. Usually, when using Claude Code, API quotas deplete rapidly, and encountering lengthy error logs instantly burns through tokens. But 9router has built-in smart compression...

X AI KOLs Timeline

9router is an open-source intelligent routing hub for AI coding tools, featuring built-in compression algorithms to save tokens, supporting a three-tier fallback to automatically switch models, natively compatible with mainstream tools like Claude Code and Cursor, and capable of routing to dozens of model providers, effectively reducing API call costs.

@akshay_pachaar: https://x.com/akshay_pachaar/status/2053166970166772052

X AI KOLs Timeline

The article discusses a shift in AI agent tool usage from the 'MCP vs CLI' debate to 'Code Mode,' where agents write code to dynamically import tools, significantly reducing context window usage. It highlights Anthropic's approach and Cloudflare's implementation, demonstrating a 98.7% reduction in token consumption for specific tasks.

Beyond the Black Box: Interpretability of Agentic AI Tool Use

arXiv cs.AI

This paper introduces a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to monitor internal model states before AI agents invoke tools, aiming to improve diagnostics and safety in enterprise workflows.