Learning Agent Routing From Early Experience
Summary
This paper introduces BoundaryRouter, a training-free framework that optimizes LLM agent usage by routing queries to either lightweight inference or full agent execution based on early experience. It also presents RouteBench, a benchmark for evaluating routing performance, showing significant improvements in speed and accuracy.
View Cached Full Text
Cached at: 05/11/26, 06:53 AM
# Learning Agent Routing From Early Experience
Source: [https://arxiv.org/html/2605.07180](https://arxiv.org/html/2605.07180)
††footnotetext:∗Equal contribution\.††footnotetext:†Corresponding authors\.Yimin Wang∗2,4Jiahao Qiu∗1Xuan Qi3Xinzhe Juan2,4Jingzhe Shi3 Zelin Zhao6Hongru Wang5Shilong Liu†1Mengdi Wang†1 1AI Lab, Princeton University2University of Michigan 3Institute for Interdisciplinary Information Sciences \(IIIS\), Tsinghua University 4Shanghai Jiao Tong University5University of Edinburgh6King’s College London
###### Abstract
LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost\. In practice, many queries fall within the capability boundary of cutting\-edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge\. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold\-start settings\. To address this, we propose BoundaryRouter, a training\-free routing framework that uses early behavioral experience and rubric\-guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent\. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions\. To evaluate this method, we introduce RouteBench, a benchmark covering in\-domain, paraphrased, and out\-of\-domain route settings\. Experiments show that BoundaryRouter reduces inference time by 60\.6% compared to the agent while improving performance by 28\.6% over direct LLM inference, outperforming prompt\-based and retrieval\-only routing by an average of 37\.9% and 8\.2%, respectively\.
Figure 1:Motivation and overview of routing\. Direct LLM inference is fast and low\-cost but can be unreliable on harder queries, while full agent execution is slower and more expensive\. A router dispatches each query to the appropriate system, using the LLM for easy cases and escalating to the agent when needed to achieve a better accuracy–latency trade\-off\.## 1Introduction
Large language model agents have recently emerged as a powerful paradigm for solving tasks that require reasoning, planning, and interaction with external environments\(Zhouet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib22); Huet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib23); Zhanget al\.,[2025b](https://arxiv.org/html/2605.07180#bib.bib24); Qiuet al\.,[2025c](https://arxiv.org/html/2605.07180#bib.bib25); H2O\.ai,[2025](https://arxiv.org/html/2605.07180#bib.bib26); Team,[2025](https://arxiv.org/html/2605.07180#bib.bib28); Qiuet al\.,[2025b](https://arxiv.org/html/2605.07180#bib.bib27)\)\. By combining language understanding with tool use, retrieval, and long\-term memory, these agents show strong adaptability across a wide range of domains, from code generation to specialized scientific and scholarly domains\(Yanget al\.,[2024](https://arxiv.org/html/2605.07180#bib.bib29); Qiuet al\.,[2025a](https://arxiv.org/html/2605.07180#bib.bib55); Liet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib32); Wanget al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib30); Qiuet al\.,[2025d](https://arxiv.org/html/2605.07180#bib.bib33); Dinget al\.,[2025b](https://arxiv.org/html/2605.07180#bib.bib31)\)\. However, not every task requires the complex capabilities of agents, such as multi\-step reasoning or long\-context management \(see Fig\.[1](https://arxiv.org/html/2605.07180#S0.F1)\)\. Contemporary LLMs, trained on web\-scale corpora and in many cases coupled with web search tools \(for example, GPT with online search in production APIs\(OpenRouter,[2025](https://arxiv.org/html/2605.07180#bib.bib34)\)\), can already solve a wide range of factual and well\-structured queries with a single forward inference while with much lower computational and latency cost than a multi\-step agent\.
Hence, the central challenge now is to characterize the intelligence boundary of LLMs, enabling direct LLM inference within the boundary and escalating to agent only for tasks that exceed it\. LLM query routing offers a practical way to probe this boundary by dynamically dispatching the query to models of varying quality and cost\. Yet, existing research primarily focuses on routing exclusively among LLMs or among agents\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.07180#bib.bib1);[d](https://arxiv.org/html/2605.07180#bib.bib3); Yueet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib2); Liuet al\.,[2025b](https://arxiv.org/html/2605.07180#bib.bib4)\), leaving the hybrid routing problem between LLMs and agents largely unexplored\.
Figure 2:Comparison of routers\.Left: Direct routing uses an LLM router to choose between direct LLM inference and full agent execution, but does not leverage experience\.Right: Training\-based routing learns a router from labeled training data, enabling experience use but requiring supervision\.Middle: BoundaryRouter \(ours\) is training\-free yet experience\-driven: it first builds an early experience memory by running both the LLM and the agent on a small shared seed set, then retrieves similar experiences at test time to guide routing decisions\.To address this issue, we proposeBoundaryRouter, a training\-free query routing method that efficiently combines direct LLM inference with agentic execution through early experience and structured reasoning\. A critical constraint in real\-world routing is the cold\-start problem: we often lack prior performance data \(ground truth\) for incoming queries and therefore cannot train a supervised router\. BoundaryRouter addresses this by utilizingearly experience, a compact memory built by executing both the LLM and the agent on a shared seed set without knowing the ground truth, as shown in Figure[2](https://arxiv.org/html/2605.07180#S1.F2)\. Rather than serving as supervision or calibration data, this early experience acts as a lightweight behavioral reference that exposes systematic differences between the two systems\.
To systematically study this routing problem, we constructRouteBench, a benchmark specifically designed to evaluate the decision boundaries of LLMs to route between LLMs and agents\. Unlike conventional evaluation suites that assume a static distribution, RouteBench assesses routing generalization across three progressively challenging dimensions: standard in\-domain tasks, linguistically perturbed queries for robustness, and out\-of\-domain scenarios\. This design enables a rigorous assessment of how well routers can balance performance and cost when facing both familiar and novel task distributions\.
Finally, we evaluate BoundaryRouter on Routebench, which reduces average inference time by60\.6% compared to the agent and achieves28\.6% performance improvement over direct LLM inference, demonstrating a clearly better cost–performance trade\-off\. Using BoundaryRouter, we further evaluate 14 contemporary models on RouteBench\. Among frontier models, GPT\-5, Gemini\-3\-Pro\-Preview, and Gemini\-2\.5\-Pro achieve the strongest overall routing performance\. Compared with simple prompt routing or retrieval\-augmented generation \(RAG\) routing, BoundaryRouter improves routing quality by37\.9% and8\.2%, respectively, confirming the effectiveness of early experience and rubric\-guided reasoning\. Our findings highlight that in cold\-start settings without routing labels or ground\-truth, early behavioral signals paired with rubric\-constrained reasoning can enable reliable routing between LLMs and agents, offering a practical path toward scalable coordination in heterogeneous reasoning systems\.
\(a\)Average RouteBenchScore across models\.
\(b\)Overall Routing Effectiveness of BoundaryRouter\.
Figure 3:Overall routing performance and cost trade\-offs on RouteBench\.\(a\) Average RouteBenchScore across all evaluation sets for different models, sorted in descending order; \(b\) Comparison of routing effectiveness across different routing strategies\. The bar chart reports the average routing score on RouteBench for basic prompt\-based routing, retrieval\-based \(RAG\) routing, and our routing method across three backbone models\. Our method consistently achieves higher routing performance than both baselines, demonstrating its effectiveness as a general routing strategy\.
## 2Related Works
Routing in LLM, LLM Agents and Multi\-Model Systems\.Task routing has become increasingly central to scaling language model systems, especially as workloads grow in diversity and cost sensitivity\. Early efforts established the foundation by evaluating routing performance across diverse benchmark datasets\(Shnitzeret al\.,[2023](https://arxiv.org/html/2605.07180#bib.bib36)\)\. A series of recent works explore how to dynamically dispatch queries across multiple models, agents, or tools to optimize utility through universal routing frameworks\(Jitkrittumet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib35)\)\. Some approaches, like Router\-R1\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.07180#bib.bib1)\), integrate routing into multi\-step inference, where models iteratively decide which component to consult based on intermediate reasoning signals, often using reinforcement learning to balance accuracy and cost\. Others operate in multi\-agent setups, coordinating agents with different roles or specializations via hierarchical planning, graph\-based dispatching, or role\-aware context filtering\(Yueet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib2); Zhanget al\.,[2025d](https://arxiv.org/html/2605.07180#bib.bib3); Liuet al\.,[2025b](https://arxiv.org/html/2605.07180#bib.bib4)\)\. Routing under resource constraints has also received attention, with methods selecting models adaptively based on utility\-cost trade\-offs\(Pandaet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib5)\), lookahead mechanisms\(Huanget al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib6)\), or test\-time compute optimization\(Dinget al\.,[2025a](https://arxiv.org/html/2605.07180#bib.bib37)\)\. Furthermore, reward\-guided ensembles have been proposed to route queries to the most capable expert model\(Luet al\.,[2024](https://arxiv.org/html/2605.07180#bib.bib38)\)\. In parallel, retrieval\-augmented reasoning systems treat routing as a step\-wise selection over knowledge bases or tools\(Penget al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib7)\)\. While these directions reflect growing interest in adaptive coordination, they typically focus on routing within homogeneous spaces, across models, agents, or tools, but not across them\. Our work fills this gap by studying routing between LLMs and agents, enabling task\-level decisions that exploit their complementary strengths\.
Learning from Early Experience and Self\-Evolving Agents\.To build more adaptive and autonomous systems, recent work has explored how agents can learn from their early history\. ReflexionShinnet al\.\([2023](https://arxiv.org/html/2605.07180#bib.bib8)\)and VoyagerWanget al\.\([2023](https://arxiv.org/html/2605.07180#bib.bib9)\)demonstrate that agents can reflect on failures, consolidate long\-term memory, and develop reusable skills through language\-based feedback and exploration\. Beyond retrospective improvement, some methods adapt agents at test time by detecting errors or uncertainty and updating internal components accordinglyAcikgozet al\.\([2025](https://arxiv.org/html/2605.07180#bib.bib10)\), while others leverage early interaction data to bootstrap policies via future\-consistent behavior modelingZhanget al\.\([2025c](https://arxiv.org/html/2605.07180#bib.bib11)\)\. These ideas have also reshaped how agents are architected\. Instead of relying on static pipelines, agents can evolve dynamically by refining their internal logic, memory, and prompting strategies over timeWuet al\.\([2025](https://arxiv.org/html/2605.07180#bib.bib12)\)\. A recent survey consolidates these directions into the emerging framework of self\-evolving agents, which emphasizes the shift from static models to continually adapting, self\-refining systems\(Gaoet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib13)\)\. We extend experience\-based learning from task execution to routing, enabling agents to improve delegation decisions across LLMs and agents, a dimension rarely addressed in prior work\.
Reasoning\-Enhanced Decision Making\.Chain\-of\-Thought \(CoT\) prompting, introduced by Wei et al\.\(Weiet al\.,[2022](https://arxiv.org/html/2605.07180#bib.bib14)\), enables large language models to reason through intermediate steps before producing final answers\. Beyond accuracy gains, CoT supports decision\-making within models and agents\. In ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.07180#bib.bib15)\), reasoning traces guide tool use and subroutine selection, while other work employs CoT for task decomposition and option evaluation\(Zhouet al\.,[2022](https://arxiv.org/html/2605.07180#bib.bib16); Kojimaet al\.,[2022](https://arxiv.org/html/2605.07180#bib.bib17)\)\. Recent advances incorporate rubric\-guided CoT, where reasoning is shaped by explicit evaluation criteria rather than implicit preferences\. This approach improves consistency and alignment in tasks such as text generation, code evaluation, and geospatial planning by ensuring that reasoning respects domain\-specific constraints\(Pathaket al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib18); Chenet al\.,[2025](https://arxiv.org/html/2605.07180#bib.bib19)\)\. Building on this direction, we apply CoT to agent routing, helping models reason explicitly, under rubric guidance, about which agent or submodel is best suited for a given task\. This reframes CoT as a structured coordination mechanism that enhances routing transparency and reliability\.
Figure 4:Overview of our routing pipeline\. A query from RouteBench is routed using early\-experience retrieval\-augmented \(RAG\) and a CoT\-based routing module, which delegates the task to either an LLM or an Agent\. Performance is evaluated on RouteBench via routing accuracy and per\-solver F1, aggregated into a RouteBenchScore\.
## 3Learning Agent Routing from Early Experience
### 3\.1Routing Module Overview
The routing module is an LLM\-based decision system that routes each incoming query to either a lightweight LLM \(low latency and token cost\) or a full agent \(higher cost but stronger tool\-augmented reasoning\)\. Concretely, the router is a*pluggable*routing LLM that conditions on \(i\) the input query and \(ii\) retrieved early\-experience cases containing solver outputs and runtime\. Given these signals, the router follows a rubric\-guided reasoning \(Box[4\.2](https://arxiv.org/html/2605.07180#S4.SS2)\) to trade off expected answer quality against latency and then outputs the routing decision\. Formally, given an input queryxx, the router producesRoute\(x\)∈\{LLM,Agent\}\\text\{Route\}\(x\)\\in\\\{\\text\{LLM\},\\ \\text\{Agent\}\\\}\. Importantly, the routing module istraining\-free: it uses no gradient updates and does not require supervision on routing labels\. This design makes the router easy to deploy and to adapt to new tasks or new agent implementations by updating only the early\-experience memory and retrieval components\.
### 3\.2Learning from Early Experience
A central difficulty in LLM–agent routing is*cold start*: when the router is deployed, we typically do not have ground\-truth for the incoming queries and therefore cannot obtain reliable routing labels\. This makes standard supervised router training impractical\. To address this, we introduceLearning from Early Experience, which provides the router with a compact memory of observable solver behavior, with the pipeline overview shown in Fig\.[4](https://arxiv.org/html/2605.07180#S2.F4)\.
Constructing the early\-experience memory\.We first sample a small seed set of questions𝒟seed\\mathcal\{D\}\_\{\\mathrm\{seed\}\}and run both candidate solvers—a lightweight LLM and a full agent—on the same inputs\. For eachx∈𝒟seedx\\in\\mathcal\{D\}\_\{\\mathrm\{seed\}\}, we record only deployment\-time observable information:
- •the questionxx,
- •the LLM outputyLLMy^\{\\mathrm\{LLM\}\}and latencytLLMt^\{\\mathrm\{LLM\}\},
- •the agent outputyAgenty^\{\\mathrm\{Agent\}\}and latencytAgentt^\{\\mathrm\{Agent\}\}\.
Crucially, we do not store gold answers, correctness labels, or rewards\. The resulting memory
ℳ=\{\(xi,yiLLM,yiAgent,tiLLM,tiAgent\)\}i=1N\\mathcal\{M\}=\\\{\(x\_\{i\},y\_\{i\}^\{\\mathrm\{LLM\}\},y\_\{i\}^\{\\mathrm\{Agent\}\},t\_\{i\}^\{\\mathrm\{LLM\}\},t\_\{i\}^\{\\mathrm\{Agent\}\}\)\\\}\_\{i=1\}^\{N\}captures systematic differences in the two systems’ behavior \(e\.g\., response and runtime\) without requiring supervision\.
Retrieval\-augmented routing\.At inference time, given a new queryxx, we retrieve the top\-KKmost similar records fromℳ\\mathcal\{M\}using a hybrid retriever \(sparse lexical matching plus dense semantic similarity\):
Retrieve\(ℳ,x\)=\{\(xk,ykLLM,ykAgent,tkLLM,tkAgent\)\}k=1K\.\\text\{Retrieve\}\(\\mathcal\{M\},x\)=\\\{\(x\_\{k\},y\_\{k\}^\{\\mathrm\{LLM\}\},y\_\{k\}^\{\\mathrm\{Agent\}\},t\_\{k\}^\{\\mathrm\{LLM\}\},t\_\{k\}^\{\\mathrm\{Agent\}\}\)\\\}\_\{k=1\}^\{K\}\.These retrieved cases are provided to the routing LLM as evidence\. The router compares the current query against the retrieved questions and inspects the two solvers’ outputs and latencies to infer regularities, e\.g\., whether similar questions previously led the agent to produce slower, multi\-step reasoning, or how similarity in phrasing or structure correlates with the relative efficiency and behavior of the two systems\.
### 3\.3Rubric\-guided Chain\-of\-Thought Routing
In the cold\-start setting, the router must make a decision without access to ground\-truth answers or routing labels\. A natural approach to routing is to allow the routing LLM to reason explicitly before selecting between the LLM and the agent\. Chain\-of\-Thought \(CoT\) prompting has been shown to improve structured decision\-making by encouraging step\-by\-step reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2605.07180#bib.bib14); Yaoet al\.,[2022](https://arxiv.org/html/2605.07180#bib.bib15); Kojimaet al\.,[2022](https://arxiv.org/html/2605.07180#bib.bib17)\)\. However, in our setting, routing is not an open\-ended reasoning task: decisions should follow explicit behavioral criteria, such as comparing answer characteristics and response times observed in early experience, as formalized in Box[4\.2](https://arxiv.org/html/2605.07180#S4.SS2)\.
Direct, free\-form CoT does not guarantee that the routing LLM will consistently attend to these criteria, especially under paraphrasing or distribution shift\. To align the reasoning process with the rule\-based nature of routing, we therefore adopt a rubric\-guided CoT formulation that explicitly encodes the evaluation protocol into the prompt\. As shown in Fig\.[7](https://arxiv.org/html/2605.07180#A1.F7), the router is required to follow a fixed decision rubric that reflects the actual dimensions available in the early\-experience memory, rather than relying on unconstrained reasoning\.
Figure 5:Overview of the RouteBench evaluation framework\. On the left is the Single\-instance routing process: solver outputs determine the ground\-truth route, and the router’s decision is evaluated via exact\-match accuracy\. On the right is Set\-level evaluation: the router’s predictions over all questions yield per\-solver F1 scores \(LLM vs\. Agent\), which are averaged to produce the final RouteBenchScore\.
## 4RouteBench: Benchmarking LLM Routing between LLM and Agent
Routing between heterogeneous reasoning systems requires a benchmark with diverse tasks, clear supervision, and controlled distribution shifts\. To address this, we introduceRouteBench, a benchmark designed to evaluate how effectively a model assigns queries between a lightweight LLM and a full agent\. RouteBench consists of a curated question pool drawn from GAIA and MMLU, paired solver outputs from both systems, and human\-annotated routing labels\. Each instance includes the question, solver predictions, latency, and the ground\-truth routing decision\. To assess routing generalization, RouteBench provides three evaluation sets covering in\-domain, paraphrased, and out\-of\-domain conditions\. Routing performance is evaluated at both the instance level and the set level, using routing accuracy, solver\-specific F1, and the final RouteBenchScore \(Fig\.[5](https://arxiv.org/html/2605.07180#S3.F5)\)\.
### 4\.1Benchmark Curation
To form the base question pool, we sample from two established sources\. GAIA\(Mialonet al\.,[2023](https://arxiv.org/html/2605.07180#bib.bib20)\)provides open\-ended reasoning tasks that reflect real\-world problem solving\. MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.07180#bib.bib21)\)spans 57 academic subjects and provides structured knowledge questions\. We randomly sample30 GAIA questionsand57 MMLU questions\(one per subject\), producing a compact yet diverse collection that covers factual recall, symbolic reasoning, multi\-step planning, and open\-domain inference\.
For each selected question, we collect solver predictions from two systems: an LLM and a full agent with tool\-use and intermediate reasoning capabilities\. All samples undergo manual review to ensure semantic clarity, correctness of solver traces, and consistency of formatting\. This curated pool serves as the foundation for all evaluation splits introduced in later sections\.
### 4\.2Benchmark Composition
Each RouteBench instance can be viewed as a 5\-tuple
\(x,yLLM,yAgent,y∗,d\),\(x,y^\{\\mathrm\{LLM\}\},y^\{\\mathrm\{Agent\}\},y^\{\*\},d\),where these elements correspond to the question, LLM prediction, agent prediction, ground\-truth answer, and routing decision, respectively\. The first four fields describe the task and solver behaviors\. The final field,dd, is the human\-annotated label indicating which solver provides the preferred answer for that question\.
The label ofddfollows a fixed deterministic rule:
Ground\-truth Routing Rule1\. Correctness priority:If only one solver produces the correct answer, choose that solver\.2\. Efficiency tie\-break:If both solvers are correct, choose the one with shorter response time\.3\. Failure fallback:If both solvers are incorrect, choose the agent, which has a higher chance of recovery through multi\-step reasoning\.
This labeling scheme ensures that RouteBench measures the ability to infer better solver selection, rather than answer\-generation accuracy\.
### 4\.3Evaluation Sets for Routing Generalization
From the curated question pool, we construct three evaluation sets designed to assess routing under different generalization conditions\.
Base Set \(In\-domain\)\.The Base Set contains 30 GAIA and 57 MMLU questions with solver predictions and routing labels\. It reflects the in\-domain distribution and also serves as the early\-experience corpus for retrieval\-augmented routing\. All questions are evaluated in their original form\.
Rephrase Set \(Paraphrased In\-domain\)\.The Rephrase Set is created by paraphrasing each Base Set question using a controlled LLM\-based rewriting process that preserves semantics while modifying surface form\. This set evaluates routing stability under linguistic variation without changing task content\.
Advanced Set \(Out\-of\-domain\)\.It consists of GAIA and MMLU questions that are disjoint from the Base and Rephrase Sets\. Although drawn from the same benchmarks, these questions differ in topic and reasoning structure, providing a controlled out\-of\-domain evaluation\.
### 4\.4Evaluation Metrics
RouteBench uses a small set of complementary metrics to evaluate routing and provide a single scalar score for model comparison\. Instance\-level accuracy and solver\-level PRF metrics serve as diagnostic measures, while RouteBenchScore is the primary metric for overall routing quality\.
Instance\-level routing accuracy\.For each question, the routing model outputs a binary decision indicating which solver to use\. We compute exact\-match accuracy by comparing predictions against the human\-annotated ground\-truth routing labels defined in[4\.2](https://arxiv.org/html/2605.07180#S4.SS2)\. This metric reflects overall decision correctness but does not provide detailed information about each solver’s routing characteristics\.
Solver\-level PRF metrics\.To characterize routing behavior for each solver, we compute Precision, Recall, and F1 for the LLM and the agent separately\. Let class A denote routing to the LLM and class B denote routing to the agent\. For each class, RouteBench provides the number of ground\-truth assignments,totA\\text\{tot\}\_\{A\}andtotB\\text\{tot\}\_\{B\}, while a routing model yields true positives,tpA\\text\{tp\}\_\{A\}andtpB\\text\{tp\}\_\{B\}\. False positives and false negatives follow by symmetry,fpA=fnB\\text\{fp\}\_\{A\}=\\text\{fn\}\_\{B\}andfpB=fnA\\text\{fp\}\_\{B\}=\\text\{fn\}\_\{A\}\. Precision, Recall, and F1 are then computed independently for each solver within each evaluation set\.
Final RouteBench score\.To obtain a single summary metric, we average the solver\-level F1 scores across both solvers and task sources:
RouteBenchScore=14∑s∑dF1sd\.\\text\{RouteBenchScore\}=\\frac\{1\}\{4\}\\sum\_\{s\}\\sum\_\{d\}\\text\{F1\}\_\{s\}^\{d\}\.Here,ssranges over the two solvers \(LLM and agent\), andddranges over the two task sources \(GAIA and MMLU\)\. This score reflects how consistently a routing model selects the correct solver across both open\-domain and academic reasoning tasks, while weighting the two solvers and the two benchmarks equally\.
Primary comparison metric\.Although instance\-level accuracy and solver\-level PRF metrics are useful for diagnostic analysis,RouteBenchScore is the primary metric used for comparing routing models\. All model\-to\-model comparisons in this paper are based on this score, as it provides a concise summary of overall routing effectiveness across evaluation settings\.
## 5Experiment
### 5\.1Experiment Setup
Models\.To construct the early\-experience database used for retrieval\-augmented routing, we adopt GPT\-4oHurstet al\.\([2024](https://arxiv.org/html/2605.07180#bib.bib51)\)as the representative LLM\. For the agent implementation, we use Claude Sonnet 4 for agent logic and GPT\-4o as the tool\-calling backbone within the SmolAgent Open DeepResearch frameworkRoucheret al\.\([2025](https://arxiv.org/html/2605.07180#bib.bib52)\)\. We evaluate routing performance across a broad set of contemporary large language models, covering both state\-of\-the\-art systems and widely deployed alternatives\. The evaluated models include GPT\-5OpenAI \([2025b](https://arxiv.org/html/2605.07180#bib.bib39)\), GPT\-5\.2OpenAI \([2025a](https://arxiv.org/html/2605.07180#bib.bib40)\), GPT\-5\-nanoOpenAI \([2025b](https://arxiv.org/html/2605.07180#bib.bib39)\), Gemini\-3\-Pro\-Preview[Google DeepMind](https://arxiv.org/html/2605.07180#bib.bib41), Gemini\-3\-Flash\-PreviewDoshi and Gemini Team \([2025](https://arxiv.org/html/2605.07180#bib.bib42)\), Gemini\-2\.5\-ProComaniciet al\.\([2025](https://arxiv.org/html/2605.07180#bib.bib43)\), Gemini\-2\.5\-flashComaniciet al\.\([2025](https://arxiv.org/html/2605.07180#bib.bib43)\), Claude Sonnet 4, Claude Sonnet 4\.5Anthropic \([2025](https://arxiv.org/html/2605.07180#bib.bib45)\), MinMax\-M2MiniMax AI \([2025](https://arxiv.org/html/2605.07180#bib.bib46)\), Qwen3\-32bYanget al\.\([2025](https://arxiv.org/html/2605.07180#bib.bib47)\), Grok\-4xAI \([2025](https://arxiv.org/html/2605.07180#bib.bib48)\), Kimi\-K2\-ThinkingMoonshot AI \([2025](https://arxiv.org/html/2605.07180#bib.bib50)\), DeepSeek\-v3\.2Liuet al\.\([2025a](https://arxiv.org/html/2605.07180#bib.bib49)\)\.
Evaluation Details\.All experiments are conducted on RouteBench\. For each question, the resulting decision is compared against the ground\-truth routing decision,dd\. Performance is computed using the official RouteBench scoring procedure described in Section[4\.4](https://arxiv.org/html/2605.07180#S4.SS4)\. In all experiments, the same BoundaryRouter framework is used, with different LLMs adopted as the routing model\. All API\-based models are evaluated using the default parameters provided by Openrouter\.
Table 1:Routing performance and inference cost on RouteBench\. Results are reported for LLM\-only, Agent\-only, and our routing method on the Base, Rephrase, and Advanced evaluation sets\. For each set, we report routing accuracy \(Acc\.\) and average inference time \(Time, in seconds\) on MMLU, GAIA, and their average \(Avg\)\. Our routing method consistently achieves higher accuracy than the LLM\-only baseline while substantially reducing inference cost compared to the Agent\-only baseline across all settings, striking a balance between the performance and cost\.
### 5\.2Main Results
Table[1](https://arxiv.org/html/2605.07180#S5.T1)summarizes the accuracy\-latency trade\-off on the three splits \(Base, Rephrase, Advanced\), reported separately on MMLU and GAIA, and averaged across two sources\.
Performance of BoundaryRouter\.Across all three evaluation sets, BoundaryRouter consistently achieves a favorable balance between accuracy and inference cost\. Compared to the LLM\-only run, BoundaryRouter substantially improves accuracy on both MMLU and GAIA, with an average relative improvement of 28\.6%\. Compared to Agent\-only routing, our method reduces inference time by an order of magnitude, with a60\.6% relative reduction, while retaining a large fraction of the agent’s accuracy, only a relative 11\.5% decrease\. On the Base Set, the LLM\-only baseline exhibits low GAIA accuracy \(0\.10\) despite fast inference, whereas the Agent achieves higher accuracy at a prohibitive average cost of 264\.99 seconds\. Our routing method bridges this gap, improving average accuracy to 0\.713 while reducing inference time to 101\.86 seconds\. A similar trend is observed on the Rephrase Set\. Crucially, in the out\-of\-domain set, the Advanced Set, BoundaryRouter still performs well\. It achieves a strong accuracy of 0\.77 with less than half of the Agent’s inference time\. This result indicates that the routing strategy generalizes beyond the in\-domain distribution and remains effective under distribution shift\. These results show that BoundaryRouter effectively routes easy questions to the LLM and leaves hard questions for the Agent\. This selective invocation allows the system to maintain strong performance while avoiding the high cost of invoking the agent for every query\.
Comparison with LLM\-only and Agent\-only baselines\.Overall, while the Agent achieves 43\.7% higher accuracy than the LLM baseline, it is nearly60×\\timesslower in inference time\. Neither baseline alone provides a satisfactory balance between accuracy and inference cost\. The LLM baseline offers low latency but limited performance on GAIA, while the Agent baseline improves performance at the cost of extremely high inference time\. BoundaryRouter bridges this gap by combining the strengths of both approaches, achieving strong accuracy while maintaining a much lower inference cost\.
Table 2:Model Performance on RouteBench\. Models are ranked from top to bottom by their overall average score\. All rankings are computed using the full\-precision underlying scores, while values reported in the table are rounded to two decimal places for readability, but rounded to three decimal places for the overall average for easier comparison\.
### 5\.3RouteBench Results
Overall routing performance\. Table[2](https://arxiv.org/html/2605.07180#S5.T2)reports routing performance across 14 models on RouteBench\. Models are ranked by their overall average score\. We observe a clear performance stratification across model families\.GPT\-5achieves the strongest overall routing performance, with an average score of 0\.75, followed closely byGemini\-3\-Pro\-Preview\(0\.734\) andGemini\-2\.5\-Pro\(0\.726\)\. These models remain stable even when questions are rewritten or shifted to new topics, suggesting that their routing patterns are less sensitive to changes in surface form or content\. These top\-ranked models consistently exhibit strong solver discrimination on both MMLU and GAIA, indicating robust routing decisions across domains and distribution shifts\. This finding also aligns with these models’ abilities on other benchmarks, like AIME25111[https://huggingface\.co/datasets/yentinglin/aime\_2025](https://huggingface.co/datasets/yentinglin/aime_2025), LiveCodeBenchJainet al\.\([2024](https://arxiv.org/html/2605.07180#bib.bib54)\), and Humanity’s last examPhanet al\.\([2025](https://arxiv.org/html/2605.07180#bib.bib53)\)\.
Routing is surprisingly still a relatively hard problem\. Even in the in\-domain setting where questions closely match the early experience and the task is binary, routing accuracy remains far from perfect across all models\. This difficulty becomes more pronounced under distribution shift, but its presence in the Base set already indicates that effective routing is not a solved problem, even without paraphrasing or topic change\.
Advanced Set Drives Ranking Separation\. While routing performance is similar across models on the Base and Rephrase sets, differences become large on the Advanced set, where top models retain average scores around 0\.61–0\.62, mid\-tier models fall to 0\.54–0\.56, and lower\-ranked models drop below 0\.50\. This ranking reshuffle suggests that out\-of\-domain routing ability, rather than in\-domain decision matching, is the primary factor distinguishing strong routing models from weaker ones\.
### 5\.4Ablation Study
Table 3:Comparison of the three routing variants across GPT\-5, Gemini 2\.5 Pro, and Claude Sonnet 4 on Base, Rephrase, and Advanced Set\. The table shows that early\-experience memory and structured reasoning together provide the strongest and most stable performance\.To understand the contribution of each component in our routing framework, Table[3](https://arxiv.org/html/2605.07180#S5.T3)compares three variants: basic Prompt Routing, RAG Routing, and our Rucirc\-guided CoT Routing with early experience \(i\.e\., RAG\), across GPT\-5, Gemini 2\.5 Pro, and Claude Sonnet 4\. The corresponding prompts for the two ablation baselines are provided in Appendix[A\.4](https://arxiv.org/html/2605.07180#A1.SS4)\.
Prompt Routing\.This variant removes early\-experience memory entirely\. It selects between the LLM and the agent using only high\-level capability profiles provided in the prompt, refer to Appendix[A\.4](https://arxiv.org/html/2605.07180#A1.SS4)\. Without early experience, routing decisions depend strongly on surface cues, leading to unstable behavior under paraphrasing and distribution shift\. Even in the Base set, average scores remain low \(e\.g\., 0\.41 for GPT\-5 and 0\.55 for Claude\-4\-Sonnet\), indicating that capability\-aware reasoning is insufficient for reliable solver selection\.
RAG Routing\.This variant introduces early\-experience memory but removes rubric\-guided reasoning\. Retrieved behavioral examples are shown to the router, which must directly output a binary routing decision without reasoning \(prompt in Appendix[A\.4](https://arxiv.org/html/2605.07180#A1.SS4)\)\. It substantially improves performance, with an average of 27\.5% improvement, especially on the Base and Rephrase sets by introducing early\-experience retrieval, suggesting that access to historical solver behavior provides useful signals for routing, even without the gold answer\.
Rubric\-Guided CoT with Early Experience \(BoundaryRouter\)\.It combines early\-experience memory with rubric\-guided chain\-of\-thought reasoning, consistently achieves the best performance across all models, and sets, with an increase of 37\.9% compared to Prompt Routing and of 8\.2% compared to RAG Routing\. Notably, it yields the highest Advanced\-set scores for all three LLMs, also maintaining strong in\-domain performance\. This demonstrates that structured reasoning is essential for effectively interpreting retrieved experiences and making stable routing decisions under the distribution shift\.
## 6Discussion and Conclusion
We study cold\-start routing between direct lightweight LLM inference and full agentic execution, where ground\-truth labels are unavailable at deployment time\. To address this, we propose BoundaryRouter, a training\-free router that learns from early experience, and introduce RouteBench, a benchmark for evaluating LLM–agent routing under in\-domain, paraphrased, and out\-of\-domain settings\. Experiments show that BoundaryRouter improves the accuracy–latency trade\-off over baselines and remains robust under paraphrasing and distribution shift\. These results suggest that early experience provides useful signals for routing decisions even without access to ground\-truth answers, and that structured reasoning helps maintain stable decisions when task distributions change\. While our current framework focuses on binary routing between an LLM and a single agent pipeline, future work may explore more complex routing scenarios involving multiple agents or heterogeneous tools\. Overall, our findings highlight routing as an important component for improving the efficiency and scalability of hybrid LLM–agent systems\.
## Reproducibility Statement
We provide sufficient details to enable full reproduction of RouteBench and the BoundaryRouter framework\. RouteBench is constructed from publicly available GAIA and MMLU benchmarks with explicitly defined sampling, annotation, and evaluation procedures\. Each instance includes solver outputs, latency, and deterministic routing labels based on a fixed decision rule\. BoundaryRouter is training\-free and relies on early\-experience memory, retrieval, and rubric\-guided routing, all of which are fully specified in the paper and appendix, including prompts and decision protocols\. All models are evaluated using the same splits and scoring procedure, including routing accuracy and RouteBenchScore\. We will release the RouteBench dataset, routing prompts, and evaluation code upon publication to ensure full reproducibility\.
## References
- Self\-improving llm agents at test\-time\.arXiv preprint arXiv:2510\.07841\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p2.1)\.
- Anthropic \(2025\)Claude sonnet 4\.5\.Note:Anthropic Model ReleaseReleased September 29, 2025External Links:[Link](https://www.anthropic.com/claude/sonnet)Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- Y\. Chen, L\. Li, Z\. Ma, Q\. Hu, Y\. Zhu, M\. Deng, and R\. Yu \(2025\)Empowering llm agents with geospatial awareness: toward grounded reasoning for wildfire response\.arXiv preprint arXiv:2510\.12061\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p3.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- D\. Ding, A\. Mallick, S\. Zhang, C\. Wang, D\. Madrigal, M\. D\. C\. H\. Garcia, M\. Xia, L\. V\. Lakshmanan, Q\. Wu, and V\. Rühle \(2025a\)BEST\-route: adaptive llm routing with test\-time optimal compute\.arXiv preprint arXiv:2506\.22716\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- K\. Ding, J\. Yu, J\. Huang, Y\. Yang, Q\. Zhang, and H\. Chen \(2025b\)SciToolAgent: a knowledge\-graph\-driven scientific agent for multitool integration\.Nature Computational Science,pp\. 1–11\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- T\. Doshi and Gemini Team \(2025\)Gemini 3 flash: frontier intelligence built for speed\.Note:Google Product BlogPosted December 17, 2025External Links:[Link](https://blog.google/products-and-platforms/products/gemini/gemini-3-flash/)Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu,et al\.\(2025\)A survey of self\-evolving agents: on path to artificial super intelligence\.arXiv preprint arXiv:2507\.21046\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p2.1)\.
- \[9\]Google DeepMindGemini 3 pro\.Note:Web pageAccessed 2026\-01\-28External Links:[Link](https://deepmind.google/models/gemini/pro/)Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- H2O\.ai \(2025\)H2OGPT Generative AI Platform\.Note:Web ApplicationExternal Links:[Link](https://h2ogpte.genai.h2o.ai/)Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§4\.1](https://arxiv.org/html/2605.07180#S4.SS1.p1.1)\.
- M\. Hu, Y\. Zhou, W\. Fan, Y\. Nie, B\. Xia, T\. Sun, Z\. Ye, Z\. Jin, Y\. Li, Q\. Chen,et al\.\(2025\)Owl: optimized workforce learning for general multi\-agent assistance in real\-world task automation\.arXiv preprint arXiv:2505\.23885\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- C\. Huang, T\. Shi, Y\. Zhu, R\. Chen, and X\. Quan \(2025\)Lookahead routing for large language models\.arXiv preprint arXiv:2510\.19506\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)Livecodebench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§5\.3](https://arxiv.org/html/2605.07180#S5.SS3.p1.1)\.
- W\. Jitkrittum, H\. Narasimhan, A\. S\. Rawat, J\. Juneja, C\. Wang, Z\. Wang, A\. Go, C\. Lee, P\. Shenoy, R\. Panigrahy,et al\.\(2025\)Universal model routing for efficient llm inference\.arXiv preprint arXiv:2502\.08773\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.07180#S3.SS3.p1.1)\.
- J\. Li, H\. Le, Y\. Zhou, C\. Xiong, S\. Savarese, and D\. Sahoo \(2025\)Codetree: agent\-guided tree search for code generation with large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 3711–3726\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025a\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- J\. Liu, Z\. Kong, C\. Yang, F\. Yang, T\. Li, P\. Dong, J\. Nanjekye, H\. Tang, G\. Yuan, W\. Niu,et al\.\(2025b\)RCR\-router: efficient role\-aware context routing for multi\-agent llm systems with structured memory\.arXiv preprint arXiv:2508\.04903\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p2.1),[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- K\. Lu, H\. Yuan, R\. Lin, J\. Lin, Z\. Yuan, C\. Zhou, and J\. Zhou \(2024\)Routing to the expert: efficient reward\-guided ensemble of large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 1964–1974\.External Links:[Link](https://aclanthology.org/2024.naacl-long.109/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.109)Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2023\)Gaia: a benchmark for general ai assistants\.InThe Twelfth International Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2605.07180#S4.SS1.p1.1)\.
- MiniMax AI \(2025\)MiniMax\-m2: a compact moe model for coding and agentic workflows\.Note:Open\-source modelModel weights and documentation available at Hugging Face and GitHubExternal Links:[Link](https://huggingface.co/MiniMaxAI/MiniMax-M2)Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- Moonshot AI \(2025\)Introducing kimi k2 thinking\.Note:Web pageAccessed: 2026\-01\-28External Links:[Link](https://moonshotai.github.io/Kimi-K2/thinking.html)Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- OpenAI \(2025a\)Introducing gpt\-5\.2\.Note:Product release blog postExternal Links:[Link](https://openai.com/index/introducing-gpt-5-2/)Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- OpenAI \(2025b\)Introducing GPT\-5\.Note:[https://openai\.com/index/introducing\-gpt\-5/](https://openai.com/index/introducing-gpt-5/)Accessed: 2026\-01\-27Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- OpenRouter \(2025\)Web search \| add real\-time web data to ai model responses\.Note:[https://openrouter\.ai/docs/features/web\-search](https://openrouter.ai/docs/features/web-search)Accessed: 2025\-11\-19Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- P\. Panda, R\. Magazine, C\. Devaguptapu, S\. Takemori, and V\. Sharma \(2025\)Adaptive llm routing under budget constraints\.arXiv preprint arXiv:2508\.21141\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- A\. Pathak, R\. Gandhi, V\. Uttam, A\. Ramamoorthy, P\. Ghosh, A\. R\. Jindal, S\. Verma, A\. Mittal, A\. Ased, C\. Khatri,et al\.\(2025\)Rubric is all you need: improving llm\-based code evaluation with question\-specific rubrics\.InProceedings of the 2025 ACM Conference on International Computing Education Research V\. 1,pp\. 181–195\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p3.1)\.
- C\. Peng, Z\. Xu, Z\. Liu, Y\. Li, Y\. Yan, S\. Wang, Z\. Liu, Y\. Gu, M\. Yu, G\. Yu,et al\.\(2025\)Learning to route queries across knowledge bases for step\-wise retrieval\-augmented reasoning\.arXiv preprint arXiv:2505\.22095\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- L\. Phan, A\. Gatti, Z\. Han, N\. Li, J\. Hu, H\. Zhang, C\. B\. C\. Zhang, M\. Shaaban, J\. Ling, S\. Shi,et al\.\(2025\)Humanity’s last exam\.arXiv preprint arXiv:2501\.14249\.Cited by:[§5\.3](https://arxiv.org/html/2605.07180#S5.SS3.p1.1)\.
- J\. Qiu, X\. Juan, Y\. Wang, L\. Yang, X\. Qi, T\. Zhang, J\. Guo, Y\. Lu, Z\. Yao, H\. Wang,et al\.\(2025a\)AgentDistill: training\-free agent distillation with generalizable mcp boxes\.arXiv preprint arXiv:2506\.14728\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- J\. Qiu, X\. Qi, H\. Wang, X\. Juan, Y\. Wang, Z\. Zhao, J\. Geng, J\. Guo, P\. Li, J\. Shi,et al\.\(2025b\)Alita\-g: self\-evolving generative agent for agent generation\.arXiv preprint arXiv:2510\.23601\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- J\. Qiu, X\. Qi, T\. Zhang, X\. Juan, J\. Guo, Y\. Lu, Y\. Wang, Z\. Yao, Q\. Ren, X\. Jiang,et al\.\(2025c\)Alita: generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self\-evolution\.arXiv preprint arXiv:2505\.20286\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- J\. Qiu, F\. Xiao, Y\. Wang, Y\. Mao, Y\. Chen, X\. Juan, S\. Zhang, S\. Wang, X\. Qi, T\. Zhang,et al\.\(2025d\)On path to multimodal historical reasoning: histbench and histagent\.arXiv preprint arXiv:2505\.20246\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- A\. Roucher, A\. V\. del Moral, T\. Wolf, L\. von Werra, and E\. Kaunismäki \(2025\)‘Smolagents‘: a smol library to build great agentic systems\.\.Note:[https://github\.com/huggingface/smolagents](https://github.com/huggingface/smolagents)Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p2.1)\.
- T\. Shnitzer, A\. Ou, M\. Silva, K\. Soule, Y\. Sun, J\. Solomon, N\. Thompson, and M\. Yurochkin \(2023\)Large language model routing with benchmark datasets\.arXiv preprint arXiv:2309\.15789\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- M\. A\. Team \(2025\)MiroFlow: a high\-performance open\-source research agent framework\.Note:[https://github\.com/MiroMindAI/MiroFlow](https://github.com/MiroMindAI/MiroFlow)Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p2.1)\.
- Z\. Wang, Q\. Jin, C\. Wei, S\. Tian, P\. Lai, Q\. Zhu, C\. Day, C\. Ross, R\. Leaman, and Z\. Lu \(2025\)GeneAgent: self\-verification language agent for gene\-set analysis using domain databases\.Nature Methods,pp\. 1–9\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.07180#S3.SS3.p1.1)\.
- R\. Wu, X\. Wang, J\. Mei, P\. Cai, D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang,et al\.\(2025\)EvolveR: self\-evolving llm agents through an experience\-driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p2.1)\.
- xAI \(2025\)Grok 4\.Note:xAI Model AnnouncementPosted July 9, 2025External Links:[Link](https://x.ai/news/grok-4)Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2605.07180#S5.SS1.p1.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. R\. Narasimhan, and O\. Press \(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2405.15793)Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.07180#S3.SS3.p1.1)\.
- Y\. Yue, G\. Zhang, B\. Liu, G\. Wan, K\. Wang, D\. Cheng, and Y\. Qi \(2025\)Masrouter: learning to route llms for multi\-agent systems\.arXiv preprint arXiv:2502\.11133\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p2.1),[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- H\. Zhang, T\. Feng, and J\. You \(2025a\)Router\-r1: teaching llms multi\-round routing and aggregation via reinforcement learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p2.1),[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- H\. Zhang, J\. Lu, S\. Jiang, C\. Zhu, L\. Xie, C\. Zhong, H\. Chen, Y\. Zhu, Y\. Du, Y\. Gao,et al\.\(2025b\)Co\-sight: enhancing llm\-based agents via conflict\-aware meta\-verification and trustworthy reasoning with structured facts\.arXiv preprint arXiv:2510\.21557\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
- K\. Zhang, X\. Chen, B\. Liu, T\. Xue, Z\. Liao, Z\. Liu, X\. Wang, Y\. Ning, Z\. Chen, X\. Fu,et al\.\(2025c\)Agent learning via early experience\.arXiv preprint arXiv:2510\.08558\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p2.1)\.
- Z\. Zhang, K\. Shi, Z\. Yuan, Z\. Wang, T\. Ma, K\. Murugesan, V\. Galassi, C\. Zhang, and Y\. Ye \(2025d\)AgentRouter: a knowledge\-graph\-guided llm router for collaborative multi\-agent question answering\.arXiv preprint arXiv:2510\.05445\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p2.1),[§2](https://arxiv.org/html/2605.07180#S2.p1.1)\.
- D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le,et al\.\(2022\)Least\-to\-most prompting enables complex reasoning in large language models\.arXiv preprint arXiv:2205\.10625\.Cited by:[§2](https://arxiv.org/html/2605.07180#S2.p3.1)\.
- H\. Zhou, Y\. Chen, S\. Guo, X\. Yan, K\. H\. Lee, Z\. Wang, K\. Y\. Lee, G\. Zhang, K\. Shao, L\. Yang,et al\.\(2025\)Agentfly: fine\-tuning llm agents without fine\-tuning llms\.arXiv preprint arXiv:2508\.16153\.Cited by:[§1](https://arxiv.org/html/2605.07180#S1.p1.1)\.
## Appendix AAppendix
### A\.1Use of LLMs
LLMs were used solely to improve the clarity and readability of the manuscript\.
### A\.2Example
Figure 6:Example illustrating routing between direct LLM inference and agent execution\. For this factual multiple\-choice question, the LLM produces a correct answer quickly \(2\.1s\), while the agent follows a multi\-step search\-and\-reasoning process that is substantially slower \(123\.9s\) but arrives at the same conclusion\. This example highlights the accuracy–latency trade\-off that motivates routing, where many queries fall within the capability boundary of direct LLM inference and do not require full agent execution\.
### A\.3Method Details
Regular CoT \(Unstructured\)You are an intelligent routing system that determines which model should answer a question\.Original question: \{original\_question\}Retrieved similar question examples:\{retrieved examples\}Let’s think step by step about which model should answer the original question based on the historical data above\.Report your thoughts, and must finish your answer with the following template:FINAL ANSWER: \[YOUR FINAL ANSWER\]\.YOUR FINAL ANSWER should be either "YES" \(use Agent\) or "NO" \(use LLM\)\.
Rubric\-guided CoT \(Structured\)You are an intelligent routing system that determines which model should answer a question\.Original question: \{original\_question\}Retrieved similar question examples:\{retrieved examples\}Follow this reasoning process STRICTLY:1\. Analyze Context: Compare the new question with the retrieved examples\. Identify similarities in topic, structure, and complexity\.2\. Performance Comparison: For each example, note which model \(LLM or Agent\) produced the better answer, considering both content quality and response time\.3\. Pattern Inference: Infer general patterns \-\-\- for example, does the Agent perform better on reasoning\-heavy or multi\-step questions, while LLM excels on direct factual queries?4\. Decision Reasoning: Decide which model should handle the new question and explain your reasoning\.5\. Final Decision: Output EXACTLY one of the following on the last line:FINAL ANSWER: YES \# use AgentFINAL ANSWER: NO \# use LLMYour final answer must be either "YES" or "NO"\.
Figure 7:Comparison between regular CoT and rubric\-guided CoT prompts\.
### A\.4Prompts for Ablation Study
Prompt RoutingYou are an intelligent routing system that determines which model should answer a question\.Model A:•Strengths: A fast, stable, general\-purpose QA model that excels at natural language understanding, straightforward reasoning, and well\-formatted outputs; ideal for simple to medium tasks\.Model B:•Strengths: A self\-evolving reasoning agent capable of complex multi\-step planning, self\-consistency checking, and structured problem solving; slower but stronger in deep reasoning tasks\.Choose the most suitable model based only on these capability profiles and the question below\.Question:\{original\_question\}Please answer only:YES\(use Model B\) orNO\(use Model A\)\.
Routing PromptYou are an intelligent routing system that determines which model should answer a question\.Original question:\{original\_question\}Retrieved similar question examples:\{chr\(10\)\.join\(reference\_examples\) if reference\_examples else “No similar questions found”\}Based on the similar questions and their historical performance, decide which model should answer the original question\.Please answer only:YES\(use Model B\) orNO\(use Model A\)\.Similar Articles
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
This paper introduces LQM-ContextRoute, a contextual bandit router for selecting between functionally equivalent tool providers in LLM agents, balancing latency and answer quality. It outperforms baselines on web-search and retriever benchmarks.
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
This paper introduces RouteProfile, a design space for LLM profiles in routing systems, demonstrating that structured profiles and query-level signals improve routing performance and generalization to new models.
Dynamic Latent Routing
Dynamic Latent Routing (DLR) lets LLMs learn their own inner monologue by composing sub-policies via search, inspired by language compositionality. In low-data fine-tuning, DLR matches or outperforms standard supervised fine-tuning.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
This paper introduces a critique-and-routing controller for multi-agent LLM systems that formulates coordination as a sequential decision problem. It uses policy gradients to optimize the controller for iterative refinement, outperforming baselines while reducing reliance on top-tier models.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
This paper introduces AutoLLMResearch, an agentic framework that automates the configuration of expensive LLM experiments by learning from low-fidelity environments and extrapolating to high-cost settings. It aims to reduce computational waste and reliance on expert intuition in scalable LLM research.