@Xudong07452910: Many people's default habit when using AI coding is: go straight to the strongest model. For the same task, should Sonnet or Opus do it? Most of the time this decision is made on a whim. So this paper Agent-as-a-Router raises a very practical question: if different models excel at different tasks…

X AI KOLs Timeline Papers

Summary

This paper proposes the Agent-as-a-Router framework, which transforms model routing into a dynamic, iterative process. Based on task type and real-time execution feedback, it selects the most suitable LLM to improve coding performance and cost efficiency.

Many people's default habit when using AI coding is: go straight to the strongest model. For the same task, should Sonnet or Opus do it? Most of the time this decision is made on a whim. So this paper Agent-as-a-Router raises a very practical question: If different models excel at different tasks, why should every coding task be assigned to the same model? Some models are good at fixing bugs, some at writing tests, some at refactoring, some at long-range agentic programming. An ideal system should first determine the task type, then route it to the most suitable model. An interesting point in the paper is that the authors believe the biggest problem with existing routers is "insufficient information." Many routers only make a static one-time judgment about which model should handle a task, without knowing how the model actually performs during real execution. So they propose Agent-as-a-Router, turning model routing into a loop of Context → Action → Feedback → Context. Simply put: first select a model based on context, let the model execute the task, then use a verifier to check the result, and write this experience back into memory. Next time a similar task appears, the router no longer relies solely on guessing, but selects the model informed by real execution feedback. I think this direction is very inspiring. Future agent systems may not be tied to a single model, but instead act like a dispatcher: dynamically deciding who should be deployed based on task difficulty, cost, historical performance, and verification results. https://arxiv.org/pdf/2606.22902
Original Article
View Cached Full Text

Cached at: 06/28/26, 08:03 AM

Many people’s default habit when using AI coding is: just use the strongest model. For the same task, should we use Sonnet or Opus? Most of the time, this decision is made on a whim. So this paper, Agent-as-a-Router, raises a very realistic question: if different models excel at different tasks, why should every coding task be assigned to the same model? Some models are better at fixing bugs, some at writing tests, some at refactoring, and some at long-range agentic programming. An ideal system should first determine the task type, then route it to the most suitable model.

An interesting point in the paper is that the authors believe the biggest problem with existing routers is “information deficit.” Many routers only statically decide which model to assign to a task once, but they don’t know how the model actually performs in real execution. So they propose Agent-as-a-Router, making model routing a Context → Action → Feedback → Context loop. Simply put: first select a model based on context, let the model execute the task, then use a verifier to check the result and write that experience back to memory. The next time a similar task appears, the router no longer relies solely on guessing but selects the model with actual execution feedback.

I find this direction very inspiring. Future agent systems may no longer be tied to a single model but instead act like a dispatcher: dynamically deciding which model to deploy based on task difficulty, cost, historical performance, and verification results.

https://arxiv.org/pdf/2606.22902


Agent-as-a-Router: Agentic Model Routing for Coding Tasks

Source: https://arxiv.org/html/2606.22902

Pengfei Zhou¹,² Zhiwei Tang²,³,⁴ Yixing Ma³ Jiasheng Tang²,³‡ Yizeng Han² Zhenglin Wan¹ Fanqing Meng¹ Wei Wang⁶ Bohan Zhuang⁵ Wangbo Zhao⁶‡ Yang You¹‡

¹ National University of Singapore
² DAMO Academy, Alibaba Group
³ Hupan Lab
⁴ University of California, Berkeley
⁵ Zhejiang University
⁶ The Hong Kong University of Science and Technology

Abstract

Real-world users typically have access to multiple Large Language Models (LLMs) from different providers, and these LLMs often excel at distinct domains, yet none dominate all. Consequently, routing each task to the most suitable model becomes critical for both performance and cost. Existing routers treat this as a static, one-off classification problem. However, we identify the performance bottleneck for these routers as information deficit: simply augmenting a vanilla LLM router with performance statistics at the task-dimension level yields a 15.3% relative gain, surpassing a heuristic router built on the same dimension-level priors. Motivated by this finding, we propose Agent-as-a-Router, a framework that formalizes routing as a C-A-F loop (Context → Action → Feedback → Context). It closes the information gap by accumulating execution-grounded experience during deployment. We instantiate this framework as ACRouter, composed of an Orchestrator, a Verifier, a Memory module, and introduce CodeRouterBench, an evaluation environment comprising ∼10K task instances with verified scores from 8 frontier LLMs, enabling regret-based router comparison on streaming tasks. Experiments show that ACRouter achieves the lowest cumulative regret on in-distribution tasks and generalizes to out-of-distribution agentic-programming tasks, demonstrating that our routing framework actively closes the information gap. Codes and benchmarks are released at https://github.com/LanceZPF/agent-as-a-router.

1 Introduction

Modern coding agents such as Claude Code [anthropic2025claudecode] and Codex [openai2025codex] have had a significant impact on real-world software development by turning LLMs into interactive systems for coding, debugging, and repository-level programming. However, most of these agents tend to solve all tasks using the same Large Language Model (LLM) [yue2025masrouter]. While this design is reasonable from a provider-centric serving perspective, where providers prioritize in-house models and predictable serving costs [research2026composer], it overlooks the actual needs of users in user-centric scenarios, where the priority is task-level quality and cost-efficiency rather than provider-side predictability. In such scenarios, users can subscribe to multiple providers and run capable open-source models locally. Across our experiments of 8 frontier models on various coding dimensions (Fig.⁠4), the best model varies per task, and always picking the globally strongest model still lags behind the per-task oracle (chooses the best model for each task). As manually selecting the best model for each task is infeasible at scale, a critical question emerges: which model should handle each incoming task? This motivates automatic model routing as a key mechanism for improving agent performance.

Existing routing methods typically frame this as a static classification problem, employing language models as the routing policy [ong2024routellm, liu2026adaptive, varshney2026llm]. However, our preliminary experiments reveal that a zero-shot LLM-as-a-Router, even when powered by a highly capable model like Claude Sonnet 4.6, still falls short of the per-task oracle by a wide margin (see Table⁠1). This substantial performance gap suggests that the fundamental bottleneck in model routing extends beyond pure reasoning capabilities. What is actually limiting these routers? Reasoning capability, or information access?

To find out, we run an ablation that varies only the information available to the LLM router (Table⁠1). With only the zero-shot prompt Vanilla router scores 41.41, and adding per-dimension performance statistics from a held-out probing set (+Perf stats) improves the score to 47.74, a +15.3% relative gain over Vanilla. It also exceeds the best heuristic (scoring 47.50) that encodes the same dimension-level statistics information. Therefore, we find that the bottleneck for model routing is information deficit rather than reasoning failure (§3.1). To close this performance gap, a router must acquire and accumulate execution-grounded information during deployment. Static routers are structurally unable to do this since their information state is frozen. This motivates a different class of self-adaptive router, one that evolves over the task stream, verifies each decision, and conditions future decisions on accumulated expertise.

We propose Agent-as-a-Router framework, formalizing routing as a Context-Action-Feedback (C-A-F) loop, in which each loop’s verified outcome enters the next loop’s context (Fig.⁠1). The router observes a Context (prior plus accumulated experience), selects an Action (which model to invoke), receives verification Feedback (score and cost-efficiency), and merges the feedback back into the Context for the next task. This loop relates to a contextual multi-armed bandit [li2010contextual], so cumulative regret (the running gap to the per-task oracle) becomes the natural streaming metric.

We instantiate the framework as ACRouter (Agentic Coding Router), comprising three core modules (Orchestrator, Verifier, and Memory) and backed by a comprehensive toolkit that includes diverse routing policies, sandboxed execution environments and validation tools. We evaluate ACRouter on our established CodeRouterBench, which contains ∼10K tasks across 9 in-distribution (ID) coding dimensions and an out-of-distribution (OOD) agentic-programming testbed with verified scores from 8 frontier LLMs to enable regret-based comparison on streaming tasks [openai_swebench_verified, feng2026longcli, chen2026swe, zhou2026featurebench]. ACRouter attains the lowest cumulative regret across all evaluated routers on ID streams and also generalizes well to OOD tasks, consistently outperforming other routers.

Refer to caption

Figure 1: Comparison of three routing strategies. (1) Static, heuristic-based routers (router directly dispatches via a lookup table, e.g., DimensionBest). (2) Routers based on a static trained policy (router uses a learned policy model with no memory). (3) Our proposed Agent-as-a-Router (router with iterative self-evolving capabilities in the task stream).

Our contributions are threefold:

  1. Framework. We propose Agent-as-a-Router, formalizing model routing as a Context-Action-Feedback (C-A-F) loop, with cumulative regret as the natural streaming metric.
  2. Artifacts. We build ACRouter as a C-A-F instantiation, and present CodeRouterBench (∼10K tasks, 8 LLMs, execution-verified) for regret-based router evaluation.
  3. Findings. Information deficit rather than reasoning is the routing bottleneck (+15.3% when given per-dimension performance statistics); ACRouter attains the lowest cumulative regret on both in-distribution and OOD tasks, while lightweight static routers fail to generalize on OOD tasks.

2 Related Work

LLM Routing.

The problem of selecting among multiple LLMs for a given query has attracted growing attention [dong2024automix, chen2023frugalgpt, ding2024hybrid]. RouteLLM [ong2024routellm] formulated routing as a preference learning problem, training classifiers on human preference data to predict which of two models produces better responses. Meta-modeling approaches [vsakota2024fly, shnitzer2023largelanguagemodel] learn to predict model performance from task features. Most recently, LLMRouterBench [duwal2026llmrouterbench] evaluated routing across 21 general NLU datasets with 33 models. Our work differs from these by proposing Agent-as-a-Router and formalizing it as the C-A-F loop for adaptive routing. Moreover, we specifically benchmark routers in an agentic coding setting.

Coding Agent.

Coding agents have evolved from single-call code generators [austin2021mbpp, chen2021humaneval] into multi-stage harness-based frameworks that interleave planning, retrieval, code editing, execution, and self-debugging on repository-level tasks [jimenez2024swebench, yang2024sweagent, xia2024agentless, wang2024openhands]. Multi-agent variants further decompose these stages across specialized roles [hong2024metagpt, qian2024chatdev]. Production systems further integrate these features into deployed assistants [anthropic2025claudecode, openai2025codex]. However, existing frameworks typically rely on a fixed LLM backbone, rather than dynamically selecting the best model for each specific task. ACRouter addresses this limitation by actively routing each task to the most suitable model within a continuous stream. To support the evaluation of this framework, CodeRouterBench provides a standardized streaming environment to compare different routing methods using cumulative regret.

3 Agent-as-a-Router

3.1 Preliminary: The Performance Gap Diagnosis

AblationInterpretationAvgPerf%Perf/$
OracleTheoretical upper bound using the best model for each task57.008.20
DimensionBestSelect the best model for each dimension by prior47.503.69
VanillaStandard zero-shot LLM-as-a-router41.411.97
+Dimension+ Task dimension description41.181.81
+Perf stats+ Prior performance statistics from a probing set47.741.71

Table 1: Preliminary ablation study diagnosing the performance bottleneck in LLM-as-a-Router (Claude Sonnet 4.6 tested on 2,919 tasks). AvgPerf: average performance score. Perf/$: AvgPerf% per USD. Providing prior performance statistics significantly improves routing performance.

We first conduct a preliminary experiment to diagnose the performance bottleneck of LLM-as-a-Router (Table⁠1). The Vanilla baseline is a standard LLM router (all using Claude Sonnet 4.6) that selects a model from the candidate pool given only the zero-shot task prompt. +Dimension additionally reveals the task’s coding dimension information, and +Perf stats further exposes per-dimension performance statistics collected on a separate probing set (7,080 tasks). We compare these variants against DimensionBest, which selects the best model for each dimension with full priors. When given the same statistics that DimensionBest encodes, the LLM router exceeds it (47.74 vs. 47.50 AvgPerf%) and improves over the Vanilla baseline by a relative 15.3% (from 41.41 to 47.74 AvgPerf%). This suggests that a major source of the performance gap between LLM-as-a-Router and the oracle upper bound is information deficit, rather than a lack of reasoning capability.

Two design insights follow from this diagnosis: (i) the router must acquire new execution-grounded information at each decision—that is, performance signals generated by actually running the selected model’s output in a sandbox rather than relying on static priors or model self-assessment (verification); and (ii) the router must accumulate it across the task stream so that future decisions can condition on past outcomes (memory). We formalize these insights through the C-A-F loop (§3.2) and instantiate it as ACRouter (§3.3).

3.2 The C-A-F Loop

Following the diagnosis in §3.1, we now formalize Agent-as-a-Router that operates over the task stream and updates its internal state from each loop’s verified outcome. Concretely, the router has access to an indexed model pool (\mathcal{M}={m_{1},\dots,m_{M}}) with (M) models and processes a stream of (N) tasks (\mathcal{T}=(t_{1},\dots,t_{N})) one by one. After each routing decision, the verified outcome is fed back into the context for the next decision, yielding the Context-Action-Feedback (C-A-F) loop below.

The C-A-F loop.

At task (t_{i}), the router observes Context (c_{i}), selects Action (a_{i}\in[M]), and receives verification Feedback (f_{i}), which is memorized into (c_{i+1}):

[ c_{i};\xrightarrow{;\text{Decide};};a_{i};\xrightarrow{;\text{Execute};};f_{i};\xrightarrow{;\text{Memorize};};c_{i+1}. \tag{1} ]

We refer to this as the C-A-F loop (Context, Action, Feedback), where each completed loop makes the next one more informed. The loop C→A→F→C repeats as the task stream advances.

Per-loop components.

Context (c_{i}=(p_{i},d_{i},H_{0}) to reward performance, (\epsilon_{2}<0) to penalize cost):

[ r_i(a_i) = \epsilon_1 s_i(a_i) + \epsilon_2 \kappa_i(a_i), \tag{3} ]

where (s_i(a_i)) and (\kappa_i(a_i)) denote the ground-truth score and cost of the selected model on task (t_i), respectively. The policy’s mean reward over the stream is:

[ V(\pi) = \frac{1}{N} \sum_{i=1}^{N} r_i(a_i) = \epsilon_1 \frac{1}{N} \sum_{i=1}^{N} s_i(a_i) + \epsilon_2 \frac{1}{N} \sum_{i=1}^{N} \kappa_i(a_i). \tag{4} ]

Per-task oracle and cumulative regret.

To compare routers under identical conditions, we pre-construct a full outcome matrix (O \in \mathbb{R}^{N \times M \times 2}), where (O_{ij} = (s_{ij}, \kappa_{ij})) stores the ground-truth score and cost of model (m_j) on task (t_i). The induced reward matrix (R \in \mathbb{R}^{N \times M}) is:

[ R_{ij} = \epsilon_1 s_{ij} + \epsilon_2 \kappa_{ij} \quad \text{for } i \in [N],, j \in [M]. \tag{5} ]

The per-task oracle independently selects the reward-maximizing model for each task with full prior knowledge of (R):

[ a_i^* = \argmax_{j \in [M]} R_{ij}, \qquad r_i^* = \max_{j \in [M]} R_{ij}, \quad \forall i = 1,\dots,N, \tag{6} ]

so the oracle’s overall mean reward is:

[ V^* = \frac{1}{N} \sum_{i=1}^{N} r_i^* = \frac{1}{N} \sum_{i=1}^{N} \max_{j \in [M]} R_{ij}. \tag{7} ]

Note that this per-task oracle is generally not equal to a single-best-arm policy that commits to one global optimal model. Given a policy (\pi), we report cumulative regret:

[ \mathrm{CumReg}N(\pi) = \sum{i=1}^{N} \delta_i = N\bigl(V^* - V(\pi)\bigr), \tag{8} ]

where (\delta_i = r_i^* - r_i(a_i) \ge 0).

Similar Articles

@Xudong07452910: This paper is a must-read for heavy users of Claude Code, Codex, or other AI Agents. It doesn't study how Agents fail on benchmarks, but a more real problem: In real development, what exactly are AI coding agents doing...

X AI KOLs Timeline

This paper analyzes 20,574 real-world coding-agent sessions to identify how AI agents misalign with developer intent, finding that constraint violations and inaccurate self-reporting are the most common failure modes, imposing trust and effort costs rather than irreversible damage.

@jinchenma_ai: https://x.com/jinchenma_ai/status/2061835131107860582

X AI KOLs Timeline

The article proposes an engineering methodology based on AI Agent (Skill), suggesting that deterministic tasks be solidified into scripts to reduce new decisions made by the large model at runtime, thereby improving stability and token efficiency. Taking video subtitle processing as an example, it demonstrates a four-step engineering process.

@cellinlab: https://x.com/cellinlab/status/2064144608242679822

X AI KOLs Timeline

This article introduces the concept of Loop Engineering — instead of directly writing prompts for AI agents, it designs a system (loop) that recursively lets the agent iterate on tasks until completion. The article provides a detailed comparison of how Claude Code and Codex implement five building blocks: automations, worktrees, skills, sub-agents, etc. It suggests this could be the future trend of collaborating with coding agents, but also warns about token costs and AI slop issues.

@wsl8297: When running complex tasks with AI agents, the most painful thing is often not that the model isn't strong enough, but that as the conversation gets longer, the context starts to overflow. You have to keep filling in background details, re-explaining the process, plus the redundant logs from tool calls — tokens just gush out like a broken pipe. Recently, I saw TencentDB Agent Memory open-sourced by Tencent...

X AI KOLs Timeline

Tencent has open-sourced TencentDB Agent Memory, which solves the AI agent long-context overflow problem through hierarchical memory management (symbolic short-term memory + hierarchical long-term memory). Benchmarks show token consumption reduced by up to 61% and task success rate improved by over 50%.

@vintcessun: I always thought AI agents could only write ordinary code. Turns out MIT HAN Lab is directly using an agent workflow to design and optimize CUDA kernels. Hand-tuning is time-consuming and easy to miss solutions. They came up with a workflow of "task contract + agent loop + small-step verification", letting the agent research, implement, verify...

X AI KOLs Timeline

MIT HAN Lab proposes a method to automatically design and optimize CUDA kernels using an AI agent workflow. Through a process of task contracts, agent loops, and small-step verification, the agent can autonomously iterate and optimize within a specialized toolchain, replacing manual tuning.