The Capability Frontier: Benchmarks Miss 82% of Model Performance

arXiv cs.AI 06/26/26, 04:00 AM Papers
Summary
The paper introduces the Capability Frontier, a Pareto frontier over models that corrects for biases in single-model and single-run evaluations, showing that standard benchmarks miss up to 82% of model performance and that collective LLM capabilities are substantially underestimated.
arXiv:2606.26836v1 Announce Type: new Abstract: Existing benchmarks typically report accuracy for a single model on a single run. This systematically understates real-world LLM capabilities, particularly under heterogeneous data distributions: (i) different models get different questions correct according to their specializations, and (ii) given a budget, multiple generations can be sampled and selectively retained. To quantify this gap, we introduce the Capability Frontier: a Pareto frontier over a set of models that characterizes the best achievable performance at each cost level under optimal selection across models and generations (i.e., via an oracle). Our construction corrects for two opposing biases: underestimation from single-model evaluation and overestimation from taking maxima over noisy samples. We study 21 LLMs across 16 widely used benchmarks spanning coding, reasoning, medicine, factuality, instruction following, and agentic tasks, comparing Capability Frontier performance at matched cost to each benchmark's top-performing model. Correcting for single-model evaluation yields a 54% error rate reduction; additionally correcting for single runs yields an 82% improvement, with SOTA accuracy matched at 85% cost reduction. Complementing these empirical results, we use controlled probabilistic simulations to show that higher query topic entropy produces a near-monotonic increase in the performance gap between oracle routing and the best single model. Our findings suggest collective LLM capabilities are substantially underestimated, with implications for evaluation and deployment in data-heterogeneous, multi-domain settings.
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:17 AM
# The Capability Frontier: Benchmarks Miss 82% of Model Performance
Source: [https://arxiv.org/html/2606.26836](https://arxiv.org/html/2606.26836)
Bradley Fowler1, Ryan Smith1, Daniel Thi Graviet1, William Myers1, Joshua Greaves1, Narmeen Fatimah Oozeer1, Antía García1, Philip Quirke1, Amirali Abdullah3,1, Fazl Barez1,2, Shriyash Kaustubh Upadhyay11Martian2University of Oxford3ThoughtWorks

###### Abstract

Existing benchmarks typically report accuracy for a single model on a single run\. This systematically understates real\-world LLM capabilities, particularly under heterogeneous data distributions: \(i\) different models get different questions correct according to their specializations, and \(ii\) given a budget, multiple generations can be sampled and selectively retained\. To quantify this gap, we introduce theCapability Frontier: a Pareto frontier over a set of models that characterizes the best achievable performance at each cost level under optimal selection across models and generations \(i\.e\., via an oracle\)\. Our construction corrects for two opposing biases: underestimation from single\-model evaluation and overestimation from taking maxima over noisy samples\. We study 21 LLMs across 16 widely used benchmarks spanning coding, reasoning, medicine, factuality, instruction following, and agentic tasks, comparing Capability Frontier performance at matched cost to each benchmark’s top\-performing model\. Correcting for single\-model evaluation yields a 54% error rate reduction; additionally correcting for single runs yields an 82% improvement, with SOTA accuracy matched at 85% cost reduction\. Complementing these empirical results, we use controlled probabilistic simulations to show that higher query topic entropy produces a near\-monotonic increase in the performance gap between oracle routing and the best single model\. Our findings suggest collective LLM capabilities are substantially underestimated, with implications for evaluation and deployment in data\-heterogeneous, multi\-domain settings\.

## 1Introduction

LLMs in the wild face messy and disparate workloads\. Consider the example of medical question answering systems, where life\-and\-death queries form a polyphonic mixture spanning diverse subdomains of medical knowledge such as genomic variation, and human structural physiology\. Consistent with this heterogeneity,Singhalet al\.\([2025](https://arxiv.org/html/2606.26836#bib.bib53)\)show that models excel in different medical topics: GPT\-4\-base\(Achiamet al\.,[2023](https://arxiv.org/html/2606.26836#bib.bib39)\)outperforms Med\-PaLM\-2\(Singhalet al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib53)\)on MMLU Medical Genetics \(97\.0% vs\. 92\.0%\) and Anatomy \(85\.2% vs\. 84\.4%\), while Med\-PaLM\-2 excels on Professional Medicine \(95\.2% vs\. 93\.8%\) and College Medicine \(83\.2% vs\. 80\.9%\)\. An oracle selector with access to per\-query ground\-truth correctness could therefore outperform both models, yet this achievable performance remains unmeasured in standard evaluation\.

Foundational work on LLM routing has begun to probe this gap\.Shnitzeret al\.\([2023](https://arxiv.org/html/2606.26836#bib.bib12)\)showed that an oracle router can achieve approximately 20% performance gains by switching models per prompt\. RouterBench\(Huet al\.,[2024](https://arxiv.org/html/2606.26836#bib.bib13)\)quantified model complementarity, finding that secondary models provide unique correct answers on 10–30% of prompts\. RouteLLM\(Onget al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib15)\)further demonstrated up to 2×\\timescost reductions by identifying prompts where cheaper models suffice\. These studies estimate oracle performance from finite samples of generations per prompt, selecting the highest\-performing model accordingly\. Because oracle selection takes a maximum over noisy performance estimates, such procedures are positively biased, systematically overstating achievable gains\. This effect is amplified under realistic generation budgets, where only limited samples \(G≤10G\\leq 10\) are available\.

![Refer to caption](https://arxiv.org/html/2606.26836v1/images/fig1.png)Figure 1:The Capability Frontier:Dynamic per\-prompt LLM selection substantially outperforms any fixed LLM across our 16 benchmarks\. Sample datapoints from App\.[B](https://arxiv.org/html/2606.26836#A2)are shown\. For any given cost budget, substantial quality improvements can be realized relative to a single LLM\. Conversely, for a fixed quality threshold substantial cost savings can be realized through dynamic LLM selection\.To robustify Capability Frontier estimation and correct for these finite\-sample effects, we develop debiasing methods that recover accurate frontier measurements and provide principled upper bounds on achievable performance\. Our empirical analysis spans 21 LLMs across 16 benchmarks covering coding, reasoning, medicine, factuality, instruction following, and agentic tasks\. The results reveal that standard single\-model evaluation substantially understates achievable performance: at matched cost, the Capability Frontier achieves 54% average error reduction compared to each benchmark’s top model\. When additionally accounting for multi\-run selection \(posthoc routing\), error reduction reaches 82%\. Conversely, SOTA accuracy can be matched at 85% lower cost on the frontier\. These gaps are not merely theoretical \- they represent performance that is achievable today with existing models and straightforward inference\-time strategies\.

Accurately measuring these gains requires care\. The standard oracle computation selects the model with the highest sample mean for each prompt, then reports this mean as achievable performance\. Because this takes a maximum over noisy estimates, it preferentially captures positive outliers\. With limited generations per prompt \(G≤10G\\leq 10due to cost\), this bias is substantial: we find it inflates accuracy estimates by up to 8\.7% and cost estimates by up to 88% \(Sec\.[6\.3](https://arxiv.org/html/2606.26836#S6.SS3)\)\. Our debiasing methods \- namely extrapolation\-based correction and probabilistic graphical modeling— enable accurate frontier estimation\. Complementing these empirical findings, we construct controlled probabilistic simulations showing that oracle gains scale monotonically with workload diversity\. To highlight our main contributions:

1. 1\.Capability Frontier\.We introduce a rigorous framework for quantifying the gap between single\-model benchmark evaluation and achievable system\-level performance under optimal model and generation selection\.
2. 2\.Debiasing methods\.We show that finite\-sample oracle estimators are positively biased and propose correction methods based on extrapolation and probabilistic modeling, with explicit assumptions and empirical validation\.
3. 3\.Bias characterization\.We formally analyze oracle bias, showing it decays asO\(G−λ\)O\(G^\{\-\\lambda\}\)with generations per prompt, and empirically validate this scaling across benchmarks\.
4. 4\.Empirical evaluation\.Across 21 LLMs and 16 benchmarks, we quantify both achievable frontier gains and the magnitude of bias in naive oracle estimates\. See Figure[1](https://arxiv.org/html/2606.26836#S1.F1)\.
5. 5\.Controlled simulations\.When simulating synthetic workloads spanning low to high\-diversity regimes, we find that oracle gains grow monotonically with workload entropy\. These results provide mechanistic grounding for oracle performance, demonstrating that achievable gains are fundamentally driven by data heterogeneity\.

## 2Related Work

The rapid proliferation of LLMs has increased research interest in LLM routing, the dynamic selection of models to balance quality, cost, and latency\.Shnitzeret al\.\([2023](https://arxiv.org/html/2606.26836#bib.bib12)\)first formalized this problem using benchmark datasets, introducing the oracle router as a theoretical upper bound for performance gains\. While they identified significant headroom beyond the “best\-on\-average” model, their oracle relied on biased sample means, a limitation our work addresses\. Subsequent frameworks like RouterBench\(Huet al\.,[2024](https://arxiv.org/html/2606.26836#bib.bib13)\)have standardized evaluation across routing methods, though they similarly utilize these biased estimates\.

##### Universal and Zero\-Shot Routing

Recent methods seek to solve the “model lock\-in” problem, where routers must be retrained whenever the model pool changes\. UniRoute\(Jitkrittumet al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib6)\)addresses this by representing LLMs as feature vectors based on anchor prompts, allowing for generalization to unseen models\. Similarly, ZeroRouter\(Yanet al\.,[2026](https://arxiv.org/html/2606.26836#bib.bib8)\)utilizes a universal latent space to decouple query difficulty from specific model profiles, enabling zero\-shot selection across evolving model ecosystems\.

##### Theoretical Foundations

While the industry moves toward expert orchestration for safer and more capable systems\(Quirkeet al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib19)\), a gap remains between implementable routers and theoretical optimality\. Our work builds upon the foundations of oracle routing\(Shnitzeret al\.,[2023](https://arxiv.org/html/2606.26836#bib.bib12); Huet al\.,[2024](https://arxiv.org/html/2606.26836#bib.bib13)\), but departs from them by correcting for the “optimizer’s curse”—a statistical bias well\-documented in economics\(Andrewset al\.,[2024](https://arxiv.org/html/2606.26836#bib.bib27); Capenet al\.,[1971](https://arxiv.org/html/2606.26836#bib.bib28)\)and decision analysis\(Smith and Winkler,[2006](https://arxiv.org/html/2606.26836#bib.bib29)\)\. By introducing debiased oracles, we provide a more rigorous framework for quantifying the true headroom available in the Capability Frontier\.

See Appendix[D](https://arxiv.org/html/2606.26836#A4)for more routing methods, including training\-free, cascades and preference routing\.

## 3Problem Setting

Letn∈\[N\]n\\in\[N\]index dataset prompts,l∈\[L\]l\\in\[L\]index LLMs, andg∈\[G\]g\\in\[G\]index independent stochastic generations\. For each prompt\-model pair, we observeGGgenerations and evaluate each using metricϕnlg∈ℝ\\phi\_\{nlg\}\\in\\mathbb\{R\}\(e\.g\., correctness, cost, latency\)\.

The standard formulation for routing is a two\-dimensional objective that maximizes quality whilst minimizing cost:

ϕnlg=\{\(Q,−C\)\}nlg\\phi\_\{nlg\}=\\left\\\{\(\\textbf\{Q\},\-\\textbf\{C\}\)\\right\\\}\_\{nlg\}\(1\)𝐐\\mathbf\{Q\},𝐂\\mathbf\{C\}, and𝐓95\\mathbf\{T\}^\{95\}are all tensors with identical dimensionality that represent Quality, Cost, and P95 latency\.

##### The routing problem\.

A routerπ:𝒳→\[L\]\\pi:\\mathcal\{X\}\\to\[L\]maps each prompt to a model\. The goal is to findπ\\pimaximizing expected performance:

maxπ⁡1N∑n𝔼\[ϕn,π\(xn\),g\]\\max\_\{\\pi\}\\frac\{1\}\{N\}\\sum\_\{n\}\\mathbb\{E\}\[\\phi\_\{n,\\pi\(x\_\{n\}\),g\}\]\(2\)

##### The oracle router\.

An oracle router has access to true expected performanceμnl=𝔼\[ϕnlg\]\\mu\_\{nl\}=\\mathbb\{E\}\[\\phi\_\{nlg\}\]and selects optimally:

l∗\(n\)=arg⁡maxl⁡μnll^\{\*\}\(n\)=\\arg\\max\_\{l\}\\mu\_\{nl\}\(3\)Thetrue oracle valueis:

𝒪true=1N∑nmaxl⁡μnl\\mathcal\{O\}^\{true\}=\\frac\{1\}\{N\}\\sum\_\{n\}\\max\_\{l\}\\mu\_\{nl\}\(4\)This is the fundamental upper bound for routing: the best achievable performance given perfect knowledge of each model’s expected performance on each prompt\.

##### The estimation problem\.

We cannot observeμnl\\mu\_\{nl\}directly, only noisy realizationsϕnlg\\phi\_\{nlg\}\. The standard approach estimatesμnl\\mu\_\{nl\}with the sample meanϕ¯nl=1G∑gϕnlg\\bar\{\\phi\}\_\{nl\}=\\frac\{1\}\{G\}\\sum\_\{g\}\\phi\_\{nlg\}and computes:

𝒪biased=1N∑nmaxl⁡ϕ¯nl\\mathcal\{O\}^\{biased\}=\\frac\{1\}\{N\}\\sum\_\{n\}\\max\_\{l\}\\bar\{\\phi\}\_\{nl\}\(5\)We show next this estimator is positively biased:𝔼\[𝒪biased\]\>𝒪true\\mathbb\{E\}\[\\mathcal\{O\}^\{biased\}\]\>\\mathcal\{O\}^\{true\}\.

## 4Oracle Bias and Debiasing Methods

### 4\.1Characterizing Oracle Bias\.

Why the biased oracle is biased\.The bias arises because taking the maximum over sample means preferentially selects models whose samples exceeded their true means\. The bias crops up in many fields from EconomicsAndrewset al\.\([2024](https://arxiv.org/html/2606.26836#bib.bib27)\), to ManagementSmith and Winkler \([2006](https://arxiv.org/html/2606.26836#bib.bib29)\); however, it was first spotted in Auctions byCapenet al\.\([1971](https://arxiv.org/html/2606.26836#bib.bib28)\)\. This paper formalizes the bias in LLM Routing, presenting new methods to remove this bias\. We formalize this under two distributional assumptions\.

### 4\.2Gaussian case

Assumeϕnlg∼𝒩\(μnl,σnl2\)\\phi\_\{nlg\}\\sim\\mathcal\{N\}\(\\mu\_\{nl\},\\sigma^\{2\}\_\{nl\}\)independently\. The sample mean satisfiesϕ¯nl∼𝒩\(μnl,σnl2/G\)\\bar\{\\phi\}\_\{nl\}\\sim\\mathcal\{N\}\(\\mu\_\{nl\},\\sigma^\{2\}\_\{nl\}/G\)\.

To derive the bias in closed form, we make a simplifying assumption:

μnl=μn,σnl2=σn2∀l\\mu\_\{nl\}=\\mu\_\{n\},\\quad\\sigma^\{2\}\_\{nl\}=\\sigma^\{2\}\_\{n\}\\quad\\forall l\(6\)
Remark\.Assumption \([6](https://arxiv.org/html/2606.26836#S4.E6)\) is usedonlyto derive the functional form of bias decay, not to claim that𝒪true=μ¯\\mathcal\{O\}^\{true\}=\\bar\{\\mu\}\. Under heterogeneous means, the true oracle remains1N∑nmaxl⁡μnl\\frac\{1\}\{N\}\\sum\_\{n\}\\max\_\{l\}\\mu\_\{nl\}, which our debiasing methods estimate without requiring equal means\.

Under \([6](https://arxiv.org/html/2606.26836#S4.E6)\), the expected maximum ofLLi\.i\.d\. Gaussians with varianceσn2/G\\sigma^\{2\}\_\{n\}/Gis approximately:

𝔼\[maxl⁡ϕ¯nl\]≈μn\+σn2log⁡LG\\mathbb\{E\}\[\\max\_\{l\}\\bar\{\\phi\}\_\{nl\}\]\\approx\\mu\_\{n\}\+\\sigma\_\{n\}\\sqrt\{\\frac\{2\\log L\}\{G\}\}\(7\)
Averaging over prompts:

𝒪biased≈μ⏟¯True Oracle\+σ¯2log⁡LG⏟Bias\\mathcal\{O\}^\{biased\}\\approx\\underbrace\{\\bar\{\\mu\}\}\_\{\\text\{True Oracle\}\}\+\\underbrace\{\\bar\{\\sigma\}\\sqrt\{\\frac\{2\\log L\}\{G\}\}\}\_\{\\text\{Bias\}\}\(8\)whereμ¯=1N∑nμn\\bar\{\\mu\}=\\frac\{1\}\{N\}\\sum\_\{n\}\\mu\_\{n\}andσ¯=1N∑nσn\\bar\{\\sigma\}=\\frac\{1\}\{N\}\\sum\_\{n\}\\sigma\_\{n\}\.

Key insight:The bias reduces asO\(G−0\.5\)O\(G^\{\-0\.5\}\)and increases withLL\(more models\) andσ¯\\bar\{\\sigma\}\(higher variance\)\. ForG=10G=10andL=21L=21, this bias is non\-negligible\.

#### 4\.2\.1Bernoulli case

For binary metrics \(correct/incorrect\), letϕnlg∼Bernoulli\(pnl\)\\phi\_\{nlg\}\\sim\\text\{Bernoulli\}\(p\_\{nl\}\)\. Under the simplifying assumptionpnl=pnp\_\{nl\}=p\_\{n\}:

Ynl\\displaystyle Y\_\{nl\}=∑gϕnlg∼Binomial\(G,pn\)\\displaystyle=\\sum\_\{g\}\\phi\_\{nlg\}\\sim\\text\{Binomial\}\(G,p\_\{n\}\)\(9\)𝔼\[maxl⁡Ynl\]\\displaystyle\\mathbb\{E\}\[\\max\_\{l\}Y\_\{nl\}\]=1NG∑n,g\[1−F\(g;pn\)L\]\\displaystyle=\\dfrac\{1\}\{NG\}\\sum\_\{n,g\}\\left\[1\-F\(g;p\_\{n\}\)^\{L\}\\right\]\(10\)whereF\(g;pn\)F\(g;p\_\{n\}\)is the Binomial CDF\.

There is no clean separation of true oracle and bias term, however through empirical study we can determine the characteristics of the bias decay\. We know that for large G, the Oracle should tend towardspnp\_\{n\}\. Fig\.[2](https://arxiv.org/html/2606.26836#S4.F2)shows how the bias decays in different scenarios\. Whenp=0p=0orp=1p=1, there is no variance in LLM performance per data point and so, the bias is zero for all G\.

Through a synthetic study \(Appendix[A](https://arxiv.org/html/2606.26836#A1)\), we found the bias decayed withO\(G−0\.5\)∀L\>1,p∈\(0,1\)O\(G^\{\-0\.5\}\)\\ \\forall\\ L\>1,\\ p\\in\(0,1\)in the limit of large G for heterogeneous\(μnl,σnl\)\(\\mu\_\{nl\},\\sigma\_\{nl\}\)generations across models, consistent with the Gaussian analysis\. For correlations generations between models, we found the exponent varied in the range\[0\.25,0\.75\]\[0\.25,0\.75\]for sensible hyper\-parameters\. Both heterogeneous and correlated scenarios required roughlyG=50G=50generations in order to fit Eqn\.[11](https://arxiv.org/html/2606.26836#S4.E11)accurately\.

Key insight:The bias reduces asO\(G−λ\)O\(G^\{\-\\lambda\}\)whereλ∈\[0\.25,0\.75\]\\lambda\\in\[0\.25,0\.75\]\. At leastG\>=50G\>=50generations are needed forO\(G−λ\)O\(G^\{\-\\lambda\}\)to be the dominate term in the bias decay\.

![Refer to caption](https://arxiv.org/html/2606.26836v1/images/oracle_2.png)\(a\)L=2L=2LLMs
![Refer to caption](https://arxiv.org/html/2606.26836v1/images/oracle_10.png)\(b\)L=10L=10LLMs

Figure 2:Oracle bias reduces with generations\.Oracle bias is greatest when each LLM is prompted once\(G=1\)\(G=1\), and tends towards zero asGGgrows\. Oracle bias decays withO\(G−0\.5\)O\(G^\{\-0\.5\}\)in the limit\. Curves show LLM success rates,pp\. Always correct/incorrect LLMs \(p=0,1p=0,1\) have no bias and curves are horizontal\.

### 4\.3Debiasing Methods

#### 4\.3\.1Method 1: Extrapolation

Given that bias decays asO\(G−λ\)O\(G^\{\-\\lambda\}\)whereλ∈\[0\.25,0\.75\]\\lambda\\in\[0\.25,0\.75\], we fit:

𝒪biased\(G\)=α\+βG−λ\\mathcal\{O\}^\{biased\}\(G\)=\\alpha\+\\beta G^\{\-\\lambda\}\(11\)and estimate𝒪true=α\\mathcal\{O\}^\{true\}=\\alpha\.

In practice, due to cost constraints we are not in the regime ofG≥50G\\geq 50and Equation[11](https://arxiv.org/html/2606.26836#S4.E11)does not hold as can be seen in Fig\.[3](https://arxiv.org/html/2606.26836#S4.F3)\. As such, a smooth transition formulation can be used to better approximate the bias decay:

𝒪biased\(G\)=α\+β\[1\+\(G−γδ\)2\]−λ/2\\mathcal\{O\}^\{biased\}\(G\)=\\alpha\+\\beta\\left\[1\+\\left\(\\frac\{G\-\\gamma\}\{\\delta\}\\right\)^\{2\}\\right\]^\{\-\\lambda/2\}\(12\)
![Refer to caption](https://arxiv.org/html/2606.26836v1/images/oracle_curve_start.png)

Figure 3:Bias decay deviates fromO\(G−0\.5\)O\(G^\{\-0\.5\}\)for smallGG\.Atp=0\.9p=0\.9, the curve only follows the asymptotic form forG\>20G\>20, motivating our smooth transition formulation\. This curve is closeup ofp=0\.9p=0\.9from Fig\.[2\(b\)](https://arxiv.org/html/2606.26836#S4.F2.sf2)\.

Limitations\.WithG<10G<10, extrapolation carries risk\. We validate by: \(1\) testing on synthetic data with known ground truth, and \(2\) comparing to PGM estimates\.

#### 4\.3\.2Method 2: Probabilistic Graphical Model

We introduce a generative modelKoller and Friedman \([2009](https://arxiv.org/html/2606.26836#bib.bib52)\)for observationsϕnlg\\phi\_\{nlg\}\(shown in Fig\.[4](https://arxiv.org/html/2606.26836#S4.F4)\) that allows direct estimation of true performance parameters\. The intuition behind the model is: \(1\) every prompt has a difficultyDD, \(2\) every prompt belongs to a topicTT, e\.g\. coding, math, or some weighted combination of them, \(3\) every LLM has some aptitudeAAon each topic\. The observed performance of an LLM on a given prompt is a function of the prompt’s difficulty, topic combination of the prompt, and the LLM’s aptitude on those topics\.

Latent variables:

- •Dn∈\[0,1\]D\_\{n\}\\in\[0,1\]: Task difficulty for promptnn
- •Tn∈\{1,…,K\}T\_\{n\}\\in\\\{1,\\ldots,K\\\}: Topic assignment for promptnn
- •Atl∈\[0,1\]A\_\{tl\}\\in\[0,1\]: Aptitude of modelllon topictt

Generative process:

Dn\\displaystyle D\_\{n\}∼Beta\(αD,βD\)\\displaystyle\\sim\\text\{Beta\}\(\\alpha\_\{D\},\\beta\_\{D\}\)\(13\)Tn\\displaystyle T\_\{n\}∼Categorical\(𝜽\)where𝜽∼Dirichlet\(𝜶\)\\displaystyle\\sim\\text\{Categorical\}\(\\boldsymbol\{\\theta\}\)\\quad\\text\{where \}\\boldsymbol\{\\theta\}\\sim\\text\{Dirichlet\}\(\\boldsymbol\{\\alpha\}\)\(14\)Atl\\displaystyle A\_\{tl\}∼Beta\(αtl,βtl\)\\displaystyle\\sim\\text\{Beta\}\(\\alpha\_\{tl\},\\beta\_\{tl\}\)\(15\)ϕnlg\\displaystyle\\phi\_\{nlg\}∼Bernoulli\(πnl\)\\displaystyle\\sim\\text\{Bernoulli\}\(\\pi\_\{nl\}\)\(16\)
Link function:

πnl=f\(Dn,ATn,l\)\\pi\_\{nl\}=f\(D\_\{n\},A\_\{T\_\{n\},l\}\)\(17\)A simple multiplicative form of\(1−Dn\)⋅ATn,l\(1\-D\_\{n\}\)\\cdot A\_\{T\_\{n\},l\}would capture the intuition that success requires both low difficulty and high model aptitude\. However, we find the most accurate results were obtained using a feedforward neural network\.

![Refer to caption](https://arxiv.org/html/2606.26836v1/x1.png)Figure 4:Probabilistic graphical model \(PGM\)\.We model an LLM’s inherent accuracy, as indirectly observed over generations \(G\) for topics \(T\) as a function of the prompt difficulty \(D\) and the model aptitude \(A\)\. We depict this model here in Plate Notation, a standard way of writing the generative process for Bayesian models\.DnD\_\{n\}induces correlation across models for each prompt\.Limitations\.The PGM has ad\-hoc structural choices which can influence results\. Checking alignment against synthetic data with known ground truth across a variety of regimes de\-risks these choices\.

Inference\.We use stochastic variational inference with factorized posteriorq\(Dn\)q\(Tn\)∏t,lq\(Atl\)q\(D\_\{n\}\)q\(T\_\{n\}\)\\prod\_\{t,l\}q\(A\_\{tl\}\)\. We set uniform priors \(αD=βD=1\\alpha\_\{D\}=\\beta\_\{D\}=1,αtl=βtl=1\\alpha\_\{tl\}=\\beta\_\{tl\}=1,αt=1\\alpha\_\{t\}=1\) and run until convergence\.

Computing the unbiased oracle:

𝒪true=1N∑nmaxl⁡π^nl\\mathcal\{O\}^\{true\}=\\frac\{1\}\{N\}\\sum\_\{n\}\\max\_\{l\}\\hat\{\\pi\}\_\{nl\}\(18\)whereπ^nl\\hat\{\\pi\}\_\{nl\}are the inferred success probabilities\.

Independence assumptions\.Fig\.[4](https://arxiv.org/html/2606.26836#S4.F4)assumes conditional independence across generations given latent variables\. This is reasonable when temperature\-based sampling dominates, but may underestimate correlations when models share training data and architecture\.

### 4\.4Capability Frontier for Multi\-Objective Routing

Real routing decisions involve multiple objectives\. We define the Capability Frontier as the Pareto\-optimal surface achievable through routing\.

For normalized qualityQ∗Q^\{\*\}and costC∗C^\{\*\}:

ϕ\(α\)\\displaystyle\\phi\(\\alpha\)=αQnlg∗\+\(1−α\)\(−Cnlg∗\)\\displaystyle=\\alpha Q^\{\*\}\_\{nlg\}\+\(1\-\\alpha\)\(\-C^\{\*\}\_\{nlg\}\)\(19\)Qnlg∗\\displaystyle Q^\{\*\}\_\{nlg\}=Qnlg−min⁡𝐐max⁡𝐐−min⁡𝐐\\displaystyle=\\frac\{Q\_\{nlg\}\-\\min\\mathbf\{Q\}\}\{\\max\\mathbf\{Q\}\-\\min\\mathbf\{Q\}\}\(20\)Cnlg∗\\displaystyle C^\{\*\}\_\{nlg\}=Cnlg−min⁡𝐂max⁡𝐂−min⁡𝐂\\displaystyle=\\frac\{C\_\{nlg\}\-\\min\\mathbf\{C\}\}\{\\max\\mathbf\{C\}\-\\min\\mathbf\{C\}\}\(21\)
Sweepingα∈\[0,1\]\\alpha\\in\[0,1\]traces the Capability Frontier\. For debiasing, we:

1. 1\.Useϕ\(α\)\\phi\(\\alpha\)to determine routing decisions
2. 2\.Apply debiasing separately to quality and cost

For cost \(positive real values\), we replace the Bernoulli likelihood with LogNormal in our PGM\.

### 4\.5Posthoc Oracle

When a verifier is available at inference time, we can select among multiple generationsafterobserving outputs\. Withkkgenerations per model and a perfect judge:

𝒪kshot\(k\)=1N\(Gk\)∑nmaxl∑𝒮⊆\[G\]\|𝒮\|=kmaxj∈𝒮⁡ϕnlj\\mathcal\{O\}^\{kshot\}\(k\)=\\frac\{1\}\{N\\binom\{G\}\{k\}\}\\sum\_\{n\}\\max\_\{l\}\\sum\_\{\\begin\{subarray\}\{c\}\\mathcal\{S\}\\subseteq\[G\]\\\\ \|\\mathcal\{S\}\|=k\\end\{subarray\}\}\\max\_\{j\\in\\mathcal\{S\}\}\\phi\_\{nlj\}\(22\)
Using the PGM:

𝒪kshot\(k\)=1N∑n\[1−∏l\(1−πnl\)k\]\\mathcal\{O\}^\{kshot\}\(k\)=\\frac\{1\}\{N\}\\sum\_\{n\}\\left\[1\-\\prod\_\{l\}\(1\-\\pi\_\{nl\}\)^\{k\}\\right\]\(23\)
Eqn\.[22](https://arxiv.org/html/2606.26836#S4.E22)&[23](https://arxiv.org/html/2606.26836#S4.E23)formulation is the most naive form of a posthoc router, where all LLMs are queried for every prompt\. A tighter bound can be achieved using more efficient posthoc techniques, such as sequential prompting LLMs with a return early rule\. This paper does not discuss those approaches but we believe the gains can be attained at lower cost\.

Critical caveats:

- •Assumes aperfectjudge \(zero error\)
- •Assumes the judge isfree\(zero cost\)

## 5Experimental Setup

##### Benchmarks\.

We evaluate on 16 benchmarks with verifiable correct answers, spanning:

- •Coding:LiveCodeBenchJainet al\.\([2024](https://arxiv.org/html/2606.26836#bib.bib30)\), BigCodeBenchZhuo and others \([2024](https://arxiv.org/html/2606.26836#bib.bib31)\), HumanEval\-X\-Python, HumanEval\-X\-CPP, HumanEval\-X\-Javascript, HumanEval\-Java, HumanEval\-X\-GoZheng and others \([2023](https://arxiv.org/html/2606.26836#bib.bib32)\), MBPPAustinet al\.\([2021](https://arxiv.org/html/2606.26836#bib.bib33)\), LeetCode HardLeetCode \([2026](https://arxiv.org/html/2606.26836#bib.bib17)\)
- •Reasoning:LiveBench\-ReasoningWhiteet al\.\([2024](https://arxiv.org/html/2606.26836#bib.bib34)\), GPQA DiamondRein and others \([2023](https://arxiv.org/html/2606.26836#bib.bib35)\)
- •Instruction\-following:LiveBench\-IFEvalWhiteet al\.\([2024](https://arxiv.org/html/2606.26836#bib.bib34)\)
- •Medical:MedCalcBenchKhandekaret al\.\([2024](https://arxiv.org/html/2606.26836#bib.bib36)\)
- •Factuality:TruthfulQALinet al\.\([2022](https://arxiv.org/html/2606.26836#bib.bib37)\)
- •Agentic:Terminal\-Bench 2\.0Institute and contributors \([2024](https://arxiv.org/html/2606.26836#bib.bib38)\), LiveCodeBenchJainet al\.\([2024](https://arxiv.org/html/2606.26836#bib.bib30)\)

These benchmarks have binary correctness metrics \(pass/fail for code, exact match for QA\), enabling clean oracle analysis\.

##### Models\.

We evaluate 21 LLMs spanning major providers:

- •OpenAI:GPT\-5\-nano, GPT\-5\-mini, GPT\-5\.1OpenAI \([2025](https://arxiv.org/html/2606.26836#bib.bib41)\)
- •Anthropic:Claude Haiku 4\.5, Claude Sonnet 4\.5Anthropic \([2025](https://arxiv.org/html/2606.26836#bib.bib43)\)
- •Google:Gemini 2\.5 Pro, Gemini 2\.5 Flash, Gemini 2\.5 Flash\-LiteGoogle Cloud \([2025](https://arxiv.org/html/2606.26836#bib.bib45)\)
- •Meta:Llama 4 Scout, Llama 4 MaverickHugging Face \([2025](https://arxiv.org/html/2606.26836#bib.bib46)\)
- •Mistral:Codestral 2508, Devstral Medium 2505, Devstral Small 2505, Mistral Small InstructMistral AI \([2025](https://arxiv.org/html/2606.26836#bib.bib47)\)
- •Qwen:Qwen3 Coder Plus, Qwen3 Coder Flash, Qwen 2\.5 Max, Qwen 2\.5 72B InstructQwen Team and others \([2025](https://arxiv.org/html/2606.26836#bib.bib51)\)
- •Moonshot:Kimi K2Moonshot AI \([2025](https://arxiv.org/html/2606.26836#bib.bib48)\)
- •DeepSeek:DeepSeek R1DeepSeek\-AI and others \([2025](https://arxiv.org/html/2606.26836#bib.bib49)\)
- •Z\.AI:GLM\-4\.6Z\.ai \([2025](https://arxiv.org/html/2606.26836#bib.bib50)\)

##### Generation parameters\.

For all models we use the provider’s default hyper\-parameters\. Where benchmarks have a max tokens specified, we preserve that setting\.Metrics\.

- •Quality:Accuracy \(fraction of prompts answered correctly\)\. For coding benchmarks, we use execution based verification\.
- •Cost:Total API cost in USD \(input \+ output tokens×\\timesprovider pricing\)\.

##### Generations\.

Each prompt\-model pair evaluated withG=10G=10independent generations\. This yieldsN×L×GN\\times L\\times Gtotal observations per benchmark\.

##### Cost measurement\.

Costs computed using provider API pricing as of 01 Jan 2026\.

##### Agentic Benchmarks\.

For agentic benchmarks, computing the true oracle is combinatorially hard since the optimal LLM may differ at each trajectory step\. To simplify, we fix the LLM within each trajectory\. This mayunderstaterouting benefits; true per\-step routing could yield higher gains\. We use the mini\-SWE\-agentSWE\-agent team \([2025](https://arxiv.org/html/2606.26836#bib.bib16)\)with default parameters\.

##### Synthetic oracle evaluation \(PGM study\)\.

In addition to real benchmarks, we run a controlled synthetic study to isolate how task heterogeneity drives oracle gains\. Data are generated from the probabilistic graphical model defined in Figure[4](https://arxiv.org/html/2606.26836#S4.F4)\. We simulate multiple runs ofL=10L=10LLMs acrossT=30T=30latent topics, withN=1,000N=1\{,\}000datapoints andG=10,000G=10\{,\}000generations per LLM per datapoint, with differing distributions of diversity\. Full details can be found in Appendix[G](https://arxiv.org/html/2606.26836#A7)\.

Each datapoint is assigned a latent topic drawn from a Dirichlet distribution, whose concentration parameters are varied to sweep from high\-entropy topic mixtures \(near\-uniform\) to low\-entropy regimes dominated by a single topic\. Task difficulty is sampled per datapoint asD∼Beta\(1,1\)D\\sim\\text\{Beta\}\(1,1\), yielding a uniform difficulty distribution\. Model aptitude is topic\-specific, with each model–topic pair assigned an aptitudeAl,t∼Beta\(5,5\)A\_\{l,t\}\\sim\\text\{Beta\}\(5,5\), inducing moderate specialization without extreme outliers\.

For each instantiation of a simulation run, we compute the entropy and measure oracle performance as the accuracy of the best LLM selected per datapoint\. We compare this to the accuracy of the globally best single LLM, reporting oracle uplift as their difference\.

## 6Results

Table 1:Combining LLMs boosts Quality\.Benchmark level breakdown comparing the SOTA LLM with𝒪true\(α=1\)\\mathcal\{O\}^\{true\}\(\\alpha=1\)\. Error rate is reduced by 53\.7% on average\.
Table 2:Combining LLMs reduces cost\.Cost breakdown comparing SOTA with𝒪true\(α=α∗\)\\mathcal\{O\}^\{true\}\(\\alpha=\\alpha^\{\*\}\)\. Total token cost is reduced by 85\.2% on average\.

### 6\.1Finding \#1: LLM Routing Gives Substantial Gains

Using debiased oracles, we quantify achievable routing gains \(Table[2](https://arxiv.org/html/2606.26836#S6.T2)&[2](https://arxiv.org/html/2606.26836#S6.T2)\) through computation of the Capability Frontier as described in Sec\.[4\.4](https://arxiv.org/html/2606.26836#S4.SS4)using Eq\.[12](https://arxiv.org/html/2606.26836#S4.E12)\. We define the SOTA LLM as the model achieving the highest average quality\.𝒪true\(α=1\)\\mathcal\{O\}^\{true\}\(\\alpha=1\)corresponds to the most accurate achievable router, i\.e\., an oracle selecting the optimal model per query\.𝒪true\(α=α∗\)\\mathcal\{O\}^\{true\}\(\\alpha=\\alpha^\{\*\}\)denotes the oracle evaluated at the alpha value that matches SOTA quality, capturing the maximal cost savings attainable at equivalent performance\.

Error rate reduction:Compared to SOTA LLM,𝒪true\(α=1\)\\mathcal\{O\}^\{true\}\(\\alpha=1\)achieves a 54% average error reduction\.

Cost savings at SOTA quality:Compared to SOTA LLM,𝒪true\(α=α∗\)\\mathcal\{O\}^\{true\}\(\\alpha=\\alpha^\{\*\}\)achieves 85% average cost savings\.

### 6\.2Finding \#2: Posthoc Routing Increases Gains

By leveraging a free and perfect judge at inference time as described in Eq\.[22](https://arxiv.org/html/2606.26836#S4.E22), the error rate can be reduced further \(Appendix\.[E](https://arxiv.org/html/2606.26836#A5)Tab\.[4](https://arxiv.org/html/2606.26836#A5.T4)&[5](https://arxiv.org/html/2606.26836#A5.T5)\)\. The results quantify not only that gain, but how quickly it changes as the number of attempts,kk, increases from1→101\\to 10\.

As described in Sec\.[4\.5](https://arxiv.org/html/2606.26836#S4.SS5), this paper uses the most naive form of a posthoc router\. We believe these gains can be attained at significantly at lower cost\.

k = 1:66% error reduction vs SOTA LLM\.

k=10:82% error reduction vs SOTA LLM\.

### 6\.3Finding \#3: Naive Oracles Overestimate Gains

We compare𝒪biased\\mathcal\{O\}^\{biased\}to𝒪true\\mathcal\{O\}^\{true\}across benchmarks \(App\.[F](https://arxiv.org/html/2606.26836#A6)Tab\.[6](https://arxiv.org/html/2606.26836#A6.T6), Fig\.[12](https://arxiv.org/html/2606.26836#A6.F12)\)\.

Quality bias:Average 1\.2% overestimation\.Cost bias:Average 37\.5% overestimation\.

The larger cost bias arises because cost distributions are more skewed, amplifying selection effects\.

### 6\.4Finding \#4: Model Reliability Varies Substantially

LLMs with default hyper parameters by design output different responses for the same input when prompted multiple times\. An obvious question though, is how consistent are LLMs in solving the problem across these generations\. Appendix\.[C](https://arxiv.org/html/2606.26836#A3)Tab\.[3](https://arxiv.org/html/2606.26836#A3.T3)shows how different LLMs rank for reliability:

reliability\(l\)=2×1N∑n\|ϕ¯nl−0\.5\|\\text\{reliability\}\(l\)=2\\times\\frac\{1\}\{N\}\\sum\_\{n\}\\left\|\\bar\{\\phi\}\_\{nl\}\-0\.5\\right\|\(24\)
The most reliable LLM we tested was GPT\-5\-mini with a score of90\.2%90\.2\\%and the least reliable was GLM\-4\.6 at76\.3%76\.3\\%\. There is no significant correlation between Reliability and either Quality or Cost\.

### 6\.5Finding \#5: Simulations show oracle uplift increases with data diversity\.

Figure[5](https://arxiv.org/html/2606.26836#S6.F5)shows oracle uplift as topic entropy varies in the synthetic PGM study\. Uplift increases monotonically with entropy: it is minimal in single\-topic regimes and largest under uniform mixtures\. This possibly explains why oracle gains may vary across benchmarks and settings, details in App\.[G](https://arxiv.org/html/2606.26836#A7)

![Refer to caption](https://arxiv.org/html/2606.26836v1/images/entropy.png)Figure 5:Synthetic PGM study measuring the performance gap between an oracle router and best single LLM as task diversity increases\. The x\-axis shows entropy of the topic distribution, interpolating from single\-topic \(low entropy\) to highly mixed workloads \(high entropy\)\. The y\-axis reports oracle minus best\-single\-model accuracy\.

## 7Limitations

##### Limited generations\.

WithG=10G=10, extrapolation carries uncertainty\. We mitigate with \(1\) testing on synthetic data with known ground truth, and \(2\) comparing to PGM estimates\. However largerGGwould improve estimates\.

##### Agentic Benchmarks Gains May Be Underestimated\.

For agentic benchmarks, computing𝒪true\\mathcal\{O\}^\{true\}is combinatorially hard since the optimal LLM may differ at each trajectory step\. To simplify, we fix the LLM within each trajectory\. This may understate routing benefits; true per\-step routing could yield higher gains\.

##### Perfect judge assumption\.

Posthoc oracles assume error\-free, cost\-free judges\. Real verifiers introduce errors and costs that reduce achievable gains\.

## 8Conclusion

This work re\-evaluates how the performance of large language models is measured\. We show that standard benchmark evaluations, typically based on a single model and a single sampled output per prompt, do not capture the full range of performance that is already attainable with existing models and inference budgets\. At the same time, we demonstrate that naive aggregation across models or runs can lead to overly optimistic estimates due to noise\. To address both effects, we introduced the*Capability Frontier*, a quality\-cost Pareto frontier that characterizes achievable performance while explicitly correcting for these opposing biases\.

Empirically, across 21 LLMs and 16 benchmarks, the Capability Frontier substantially outperforms standard single\-model evaluations\. At matched cost, correcting for single\-model evaluation reduces error by 54% on average, while additionally accounting for single\-run variability yields an 82% reduction\. Conversely, at matched accuracy, frontier points often achieve performance comparable to the SOTA LLM at a fraction of the cost\. These results suggest that commonly reported benchmark scores can significantly understate achievable system\-level performance\. Our simulations suggest these gains scale with data heterogeneity: more diverse workloads induce greater model complementarity and larger frontier improvements\.

##### Implications\.

Our findings have several implications for the evaluation and use of LLMs:

- •Evaluation methodology\.Single\-model, single\-run benchmarks provide a limited view of model capability\. Capability Frontier based analysis offers a complementary perspective that accounts for model diversity and sampling effects, and can help contextualize results\.
- •System design\.While the Capability Frontier itself is not a deployment strategy, it highlights regimes where simple routing or repeated sampling may be sufficient to achieve large gains, and where more sophisticated methods are necessary to approach the attainable limits\.

##### Future work\.

Several extensions are clear\. First, incorporating judge error and cost directly into posthoc frontier construction\. Second, extending agentic evaluation beyond fixed trajectory routing\. Third, developing and evaluating practical routing policies that can approach frontier performance under realistic deployment constraints\. Fourthly, studying how system prompt selection and hyper\-parameter sampling on LLMs can affect the frontier\. Finally, empirically characterizing the link between data diversity and frontier gains remains an important direction\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2606.26836#S1.p1.1)\.
- LLMRank: understanding llm strengths for model routing\.arXiv preprint arXiv:2510\.01234\.External Links:2510\.01234,[Link](https://arxiv.org/abs/2510.01234)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px3.p1.1)\.
- I\. Andrews, T\. Kitagawa, and A\. McCloskey \(2024\)Inference on winners\.The Quarterly Journal of Economics139\(1\),pp\. 305–358\.Cited by:[§2](https://arxiv.org/html/2606.26836#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26836#S4.SS1.p1.1)\.
- Anthropic \(2025\)What’s new in claude 4\.5\.Note:[https://platform\.claude\.com/docs/en/about\-claude/models/whats\-new\-claude\-4\-5](https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-5)Cited by:[2nd item](https://arxiv.org/html/2606.26836#S5.I2.i2.p1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le, and C\. Sutton \(2021\)Program synthesis with large language models\.arXiv\.Note:arXiv:2108\.07732Cited by:[1st item](https://arxiv.org/html/2606.26836#S5.I1.i1.p1.1)\.
- E\. C\. Capen, R\. V\. Clapp, and W\. M\. Campbell \(1971\)Competitive bidding in high\-risk situations\.Journal of Petroleum Technology\.Cited by:[§2](https://arxiv.org/html/2606.26836#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26836#S4.SS1.p1.1)\.
- L\. Chen, M\. Zaharia, and J\. Zou \(2023\)FrugalGPT: how to use large language models while reducing cost and improving performance\.External Links:2305\.05176,[Link](https://arxiv.org/abs/2305.05176)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px2.p1.1)\.
- DeepSeek\-AIet al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv\.Note:arXiv:2501\.12948Cited by:[8th item](https://arxiv.org/html/2606.26836#S5.I2.i8.p1.1)\.
- J\. Dekoninck, M\. Baader, and M\. Vechev \(2025\)A unified approach to routing and cascading for llms\.External Links:2410\.10347,[Link](https://arxiv.org/abs/2410.10347)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px2.p1.1)\.
- D\. Ding, A\. Mallick, S\. Zhang, C\. Wang, D\. Madrigal, M\. D\. C\. H\. Garcia, M\. Xia, L\. V\. S\. Lakshmanan, Q\. Wu, and V\. Rühle \(2025\)BEST\-route: adaptive llm routing with test\-time optimal compute\.External Links:2506\.22716,[Link](https://arxiv.org/abs/2506.22716)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px3.p1.1)\.
- Google Cloud \(2025\)Gemini 2\.5 updates: flash/pro ga, sft, flash\-lite on vertex ai\.Note:[https://cloud\.google\.com/blog/products/ai\-machine\-learning/gemini\-2\-5\-flash\-lite\-flash\-pro\-ga\-vertex\-ai](https://cloud.google.com/blog/products/ai-machine-learning/gemini-2-5-flash-lite-flash-pro-ga-vertex-ai)Cited by:[3rd item](https://arxiv.org/html/2606.26836#S5.I2.i3.p1.1)\.
- N\. Guha, M\. F\. Chen, T\. Chow, I\. S\. Khare, and C\. Ré \(2024\)Smoothie: label free language model routing\.arXiv preprint arXiv:2412\.04692\.External Links:2412\.04692,[Link](https://arxiv.org/abs/2412.04692)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px1.p1.1)\.
- Q\. J\. Hu, J\. Bieker, X\. Li, N\. Jiang, B\. Keigwin, G\. Ranganath, K\. Keutzer, and S\. K\. Upadhyay \(2024\)RouterBench: a benchmark for multi\-llm routing system\.External Links:2403\.12031,[Link](https://arxiv.org/abs/2403.12031)Cited by:[§1](https://arxiv.org/html/2606.26836#S1.p2.2),[§2](https://arxiv.org/html/2606.26836#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.26836#S2.p1.1)\.
- Hugging Face \(2025\)Welcome llama 4 maverick & scout on hugging face\.Note:[https://huggingface\.co/blog/llama4\-release](https://huggingface.co/blog/llama4-release)Cited by:[4th item](https://arxiv.org/html/2606.26836#S5.I2.i4.p1.1)\.
- L\. Institute and contributors \(2024\)Terminal\-bench\.Note:[https://github\.com/laude\-institute/terminal\-bench](https://github.com/laude-institute/terminal-bench)Cited by:[6th item](https://arxiv.org/html/2606.26836#S5.I1.i6.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.arXiv\.Note:arXiv:2403\.07974Cited by:[1st item](https://arxiv.org/html/2606.26836#S5.I1.i1.p1.1),[6th item](https://arxiv.org/html/2606.26836#S5.I1.i6.p1.1)\.
- W\. Jitkrittum, H\. Narasimhan, A\. S\. Rawat, J\. Juneja, C\. Wang, Z\. Wang, A\. Go, C\. Lee, P\. Shenoy, R\. Panigrahy, A\. K\. Menon, and S\. Kumar \(2025\)Universal model routing for efficient llm inference\.arXiv arXiv:2502\.08773\.External Links:2502\.08773,[Link](https://arxiv.org/abs/2502.08773)Cited by:[§2](https://arxiv.org/html/2606.26836#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Khandekar, Q\. Jin, G\. Xiong, S\. Dunn, S\. S\. Applebaum, Z\. Anwar, M\. Sarfo\-Gyamfi, C\. W\. Safranek, A\. A\. Anwar, A\. Zhang, A\. Gilson, M\. B\. Singer, A\. Dave, A\. Taylor, A\. Zhang, Q\. Chen, and Z\. Lu \(2024\)MedCalc\-bench: evaluating large language models for medical calculations\.arXiv\.Note:arXiv:2406\.12036Cited by:[4th item](https://arxiv.org/html/2606.26836#S5.I1.i4.p1.1)\.
- S\. Kolawole, D\. Dennis, A\. Talwalkar, and V\. Smith \(2025\)Agreement\-based cascading for efficient inference\.External Links:2407\.02348,[Link](https://arxiv.org/abs/2407.02348)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px3.p1.1)\.
- D\. Koller and N\. Friedman \(2009\)Probabilistic graphical models: principles and techniques\.MIT Press,Cambridge, MA\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.26836#S4.SS3.SSS2.p1.4)\.
- LeetCode \(2026\)LeetCode\.Note:[https://leetcode\.com/](https://leetcode.com/)Accessed: 2026\-01\-19Cited by:[1st item](https://arxiv.org/html/2606.26836#S5.I1.i1.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of ACL 2022,Cited by:[5th item](https://arxiv.org/html/2606.26836#S5.I1.i5.p1.1)\.
- Mistral AI \(2025\)Announcing codestral 25\.08 and the complete mistral coding stack for enterprises\.Note:[https://mistral\.ai/news/codestral\-25\-08](https://mistral.ai/news/codestral-25-08)Cited by:[5th item](https://arxiv.org/html/2606.26836#S5.I2.i5.p1.1)\.
- Moonshot AI \(2025\)Kimi k2: open agentic intelligence\.Note:[https://moonshotai\.github\.io/Kimi\-K2/](https://moonshotai.github.io/Kimi-K2/)Cited by:[7th item](https://arxiv.org/html/2606.26836#S5.I2.i7.p1.1)\.
- I\. Ong, A\. Almahairi, V\. Wu, W\. Chiang, T\. Wu, J\. E\. Gonzalez, M\. W\. Kadous, and I\. Stoica \(2025\)RouteLLM: learning to route llms with preference data\.External Links:2406\.18665,[Link](https://arxiv.org/abs/2406.18665)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.26836#S1.p2.2)\.
- OpenAI \(2025\)Introducing gpt\-5\.1 for developers\.Note:[https://openai\.com/index/gpt\-5\-1\-for\-developers/](https://openai.com/index/gpt-5-1-for-developers/)Cited by:[1st item](https://arxiv.org/html/2606.26836#S5.I2.i1.p1.1)\.
- C\. Qian, Z\. Liu, S\. Kokane, A\. Prabhakar, J\. Qiu, H\. Chen, Z\. Liu, H\. Ji, W\. Yao, S\. Heinecke, S\. Savarese, C\. Xiong, and H\. Wang \(2025\)XRouter: training cost\-aware llms orchestration system via reinforcement learning\.External Links:2510\.08439,[Link](https://arxiv.org/abs/2510.08439)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px2.p1.1)\.
- P\. Quirke, N\. Oozeer, C\. Bandi, A\. Abdullah, J\. Hoelscher\-Obermaier, J\. M\. Phillips, J\. Greaves, C\. Neo, M\. Lan, F\. Barez, and S\. Upadhyay \(2025\)Beyond monoliths: expert orchestration for more capable, democratic, and safe language models\.External Links:2506\.00051,[Link](https://arxiv.org/abs/2506.00051)Cited by:[§2](https://arxiv.org/html/2606.26836#S2.SS0.SSS0.Px2.p1.1)\.
- Qwen Teamet al\.\(2025\)Qwen3 technical report\.arXiv\.Note:arXiv:2505\.09388Cited by:[6th item](https://arxiv.org/html/2606.26836#S5.I2.i6.p1.1)\.
- D\. Reinet al\.\(2023\)GPQA: a graduate\-level google\-proof q&a benchmark\.arXiv\.Note:arXiv:2311\.12022Cited by:[2nd item](https://arxiv.org/html/2606.26836#S5.I1.i2.p1.1)\.
- R\. Shirkavand, S\. Gao, P\. Yu, and H\. Huang \(2025\)Cost\-aware contrastive routing for llms\.arXiv preprint arXiv:2508\.12491\.External Links:2508\.12491,[Link](https://arxiv.org/abs/2508.12491)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px1.p1.1)\.
- T\. Shnitzer, A\. Ou, M\. Silva, K\. Soule, Y\. Sun, J\. Solomon, N\. Thompson, and M\. Yurochkin \(2023\)Large language model routing with benchmark datasets\.External Links:2309\.15789,[Link](https://arxiv.org/abs/2309.15789)Cited by:[§1](https://arxiv.org/html/2606.26836#S1.p2.2),[§2](https://arxiv.org/html/2606.26836#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.26836#S2.p1.1)\.
- K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, M\. Amin, L\. Hou, K\. Clark, S\. R\. Pfohl, H\. Cole\-Lewis,et al\.\(2025\)Toward expert\-level medical question answering with large language models\.Nature Medicine31\(3\),pp\. 943–950\.Cited by:[§1](https://arxiv.org/html/2606.26836#S1.p1.1)\.
- J\. E\. Smith and R\. L\. Winkler \(2006\)The optimizer’s curse: skepticism and postdecision surprise in decision analysis\.Management Science52\(3\),pp\. 311–322\.Cited by:[§2](https://arxiv.org/html/2606.26836#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26836#S4.SS1.p1.1)\.
- SWE\-agent team \(2025\)Mini\-swe\-agent: the 100 line ai agent that solves github issues or helps you in your command line\.Note:[https://github\.com/SWE\-agent/mini\-swe\-agent](https://github.com/SWE-agent/mini-swe-agent)Cited by:[§5](https://arxiv.org/html/2606.26836#S5.SS0.SSS0.Px6.p1.1)\.
- A\. Valkanas, S\. Pal, P\. Rumiantsev, Y\. Zhang, and M\. Coates \(2025\)C3PO: optimized large language model cascades with probabilistic cost constraints for reasoning\.External Links:2511\.07396,[Link](https://arxiv.org/abs/2511.07396)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px2.p1.1)\.
- C\. White, S\. Dooley, M\. Roberts, A\. Pal, B\. Feuer, S\. Jain, R\. Shwartz\-Ziv, N\. Jain, K\. Saifullah, S\. Naidu, C\. Hegde, Y\. LeCun, T\. Goldstein, W\. Neiswanger, and M\. Goldblum \(2024\)LiveBench: a challenging, contamination\-free llm benchmark\.arXiv\.Note:arXiv:2406\.19314Cited by:[2nd item](https://arxiv.org/html/2606.26836#S5.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2606.26836#S5.I1.i3.p1.1)\.
- F\. Wu and S\. Silwal \(2025\)Efficient training\-free online routing for high\-volume multi\-llm serving\.External Links:2509\.02718,[Link](https://arxiv.org/abs/2509.02718)Cited by:[Appendix D](https://arxiv.org/html/2606.26836#A4.SS0.SSS0.Px1.p1.1)\.
- C\. Yan, W\. Zhang, Z\. Ning, F\. Xu, Z\. Tao, L\. Zhang, B\. Yin, and Y\. Zhang \(2026\)Breaking model lock\-in: cost\-efficient zero\-shot llm routing via a universal latent space\.arXiv arXiv:2601\.06220\.External Links:2601\.06220,[Link](https://arxiv.org/abs/2601.06220)Cited by:[§2](https://arxiv.org/html/2606.26836#S2.SS0.SSS0.Px1.p1.1)\.
- Z\.ai \(2025\)GLM\-4\.6v: open source multimodal models with native tool use\.Note:[https://z\.ai/blog/glm\-4\.6v](https://z.ai/blog/glm-4.6v)Cited by:[9th item](https://arxiv.org/html/2606.26836#S5.I2.i9.p1.1)\.
- Q\. Zhenget al\.\(2023\)HumanEval\-x: a new benchmark for multilingual program synthesis\.Note:CodeGeeX benchmark release[https://github\.com/zai\-org/CodeGeeX](https://github.com/zai-org/CodeGeeX)Cited by:[1st item](https://arxiv.org/html/2606.26836#S5.I1.i1.p1.1)\.
- T\. Y\. Zhuoet al\.\(2024\)BigCodeBench: benchmarking code generation with diverse function calls and complex instructions\.arXiv\.Note:arXiv:2406\.15877Cited by:[1st item](https://arxiv.org/html/2606.26836#S5.I1.i1.p1.1)\.

## Appendix ASynthetic Study of Bias Decay

We studied how Oracle bias decayed for varying numbers of LLMs, correlations between them, and LLM success probabilities\. To do this, we leveraged the PGM defined in Sec\.[4\.3\.2](https://arxiv.org/html/2606.26836#S4.SS3.SSS2)to generate binary tensors of synthetic data\.

For zero LLM correlation, Fig\.[2](https://arxiv.org/html/2606.26836#S4.F2)shows results of some of these experiments\. As expected, we found that both increasing the number of LLMs,LL, and LLM success probability,pp, increased the number of generations,GG, needed before the bias decayed withO\(G−0\.5\)O\(G^\{\-0\.5\}\)\. We empirically proved with large enoughGGthis is always the case\.

For the regimeL<=10;0\.3<=p<=0\.7L<=10;0\.3<=p<=0\.7, we found that at leastG=50G=50generations were needed before Eqn\.[25](https://arxiv.org/html/2606.26836#A1.E25)would fit withc=0\.5±0\.1c=0\.5\\pm 0\.1

y=ax−b\+cy=ax^\{\-b\}\+c\(25\)
When LLM correlation was increased, overall bias reduced along with the number of generations needed before the decay followed a predictable pattern\. In the limit of largeGG, through empirical study we found the exponent varied, we found this ranged from0\.250\.25to0\.750\.75for sensible hyper\-parameters\. Similar to before, approximatelyG=50G=50generations were needed for a consistent fit\.

## Appendix BBenchmarks

Figs\.[6](https://arxiv.org/html/2606.26836#A2.F6)\-[11](https://arxiv.org/html/2606.26836#A2.F11)shows the Capability Frontier for six selected benchmarks\.

![Refer to caption](https://arxiv.org/html/2606.26836v1/benchmarks/bcb.png)Figure 6:BigCodeBench![Refer to caption](https://arxiv.org/html/2606.26836v1/benchmarks/leetcode.png)Figure 7:LeetCode![Refer to caption](https://arxiv.org/html/2606.26836v1/benchmarks/livebench-coding.png)Figure 8:LiveBench\-Coding![Refer to caption](https://arxiv.org/html/2606.26836v1/benchmarks/livebench-reasoing.png)Figure 9:LiveBench\-Reasoning![Refer to caption](https://arxiv.org/html/2606.26836v1/benchmarks/medcalcbench.png)Figure 10:MedCalcBench![Refer to caption](https://arxiv.org/html/2606.26836v1/benchmarks/terminal-bench.png)Figure 11:Terminal\-Bench 2\.0 \(agentic\)
## Appendix CLLM Reliability Results

Tab\.[3](https://arxiv.org/html/2606.26836#A3.T3)summarizes the reliability of LLMs\. The scores shown are averaged across all 16 benchmarks\.

Table 3:LLMs have variable reliabilityComparison of LLMs for reliability in solving a problem\. Scores averaged across all benchmarks\.
## Appendix DMore routing methods

##### Training\-Free and Online Methods

To support high\-volume serving without extensive labeled data, several training\-free approaches have emerged\.Wu and Silwal \([2025](https://arxiv.org/html/2606.26836#bib.bib7)\)proposed an online routing mechanism using approximate nearest neighbor search to estimate query features with theoretical performance guarantees\. Building on this, CSCR\(Shirkavandet al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib10)\)employs cost\-aware contrastive learning to map prompts and models into a shared embedding space, facilitating low\-latency routing sensitive to both cost and quality\. For environments lacking ground\-truth labels, Smoothie\(Guhaet al\.,[2024](https://arxiv.org/html/2606.26836#bib.bib11)\)provides a label\-free framework that leverages weak supervision and model ensembles to estimate query\-specific quality\.

##### Cascades and Multi\-Objective Optimization

Routing is often implemented as a cascade, where simpler models are queried before deferring to more expensive ones\. FrugalGPT\(Chenet al\.,[2023](https://arxiv.org/html/2606.26836#bib.bib14)\)pioneered this via learned cascades to reduce spend without degrading quality\. More recently, C3PO\(Valkanaset al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib4)\)achieved cost\-controlled cascades using conformal prediction to provide provable coverage bounds, whileDekonincket al\.\([2025](https://arxiv.org/html/2606.26836#bib.bib2)\)derived optimal stopping rules for sequential model invocation\. Other multi\-objective systems, such as xRouter\(Qianet al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib1)\), utilize reinforcement learning with explicit monetary rewards to navigate the quality\-cost trade\-off surface\.

##### Agreement\-Based and Preference Routing

Alternative signals for routing include model agreement and human preferences\. ABC\(Kolawoleet al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib3)\)uses agreement between models to make deferral decisions, while BEST\-Route\(Dinget al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib5)\)jointly optimizes model selection and best\-of\-n sampling\. In the absence of objective correctness, RouteLLM\(Onget al\.,[2025](https://arxiv.org/html/2606.26836#bib.bib15)\)trains routers on “Chatbot Arena” style preference data to maintain quality at significantly reduced costs\. To improve transparency, LLMRank\(Agrawal and Gupta,[2025](https://arxiv.org/html/2606.26836#bib.bib9)\)analyzes specific reasoning patterns to provide a granular understanding of model utility beyond aggregate scores\.

## Appendix EPosthoc Routing Gains

Tab\.[4](https://arxiv.org/html/2606.26836#A5.T4)and Tab\.[5](https://arxiv.org/html/2606.26836#A5.T5)shows the error rate reductions fork=1k=1shot andk=10k=10shot posthoc routing\. The average error rate reduction is 66\.1% and 82\.4% respectively\.

Table 4:Benchmark level breakdown comparing SOTA LLM with𝒪kshot\(k=1\)\\mathcal\{O\}^\{kshot\}\(k=1\)\.Table 5:Benchmark level breakdown comparing SOTA LLM with𝒪kshot\(k=10\)\\mathcal\{O\}^\{kshot\}\(k=10\)\.
## Appendix FQuantifying the Bias

Fig\.[12](https://arxiv.org/html/2606.26836#A6.F12)shows the bias quantification across 16 benchmarks\.

022446688LiveBench\-Coding \(agentic\)Terminal\-Bench 2\.0 \(agentic\)GPQA DiamondLiveBench\-ReasoningLiveBench\-IFEvalTruthfulQAMedCalcBenchMBPPHumanEval\-X \(Go\)HumanEval\-X \(Javas\)HumanEval\-X \(Javascript\)HumanEval\-X \(CPP\)HumanEval\-X \(Python\)LeetCodeBigCodeBenchLiveBench\-CodingQuality Decrease \(%\)02020404060608080Cost Decrease \(%\)Figure 12:Bias quantification across benchmarks\. The biased oracle𝒪biased\\mathcal\{O\}^\{biased\}overestimates both Quality \(left, blue circles\) and Cost \(right, red triangles\) relative to the true oracle𝒪true\\mathcal\{O\}^\{true\}\. Quality bias is modest \(under 9%, averaging≈1\.0\\approx 1\.0%\), while cost bias varies substantially \(1\.5–68%, averaging≈23\.8\\approx 23\.8%\)\. Dotted vertical lines indicate averages\. See Tab\.[6](https://arxiv.org/html/2606.26836#A6.T6)for detailed values\.Table 6:Naive oracles systematically overestimate\.Comparison of biased and debiased oracle estimates\. Cost bias is substantially larger than quality bias\.
## Appendix GTopic Entropy Study

We ran a Synthetic study measuring the affect of topic distribution entropy on Oracle uplift, defined as:

uplift=𝒪true−maxl⁡1N∑nϕ¯nluplift=\\mathcal\{O\}^\{true\}\-\\max\_\{l\}\\frac\{1\}\{N\}\\sum\_\{n\}\\bar\{\\phi\}\_\{nl\}\(26\)
10 million observations \(1000 datapoints, 10000 generations\) were generated using the PGM defined in Sec\.[4\.3\.2](https://arxiv.org/html/2606.26836#S4.SS3.SSS2)\. It was configured with:

1. 1\.30 latent topics, T
2. 2\.10 LLMs, K
3. 3\.Task Difficulty Distribution,Dn∼Beta\(alpha=1,beta=1\)i\.e\.uniformD\_\{n\}\\sim Beta\(alpha=1,beta=1\)\\ i\.e\.\\ uniform
4. 4\.LLM Topic Aptitude,Atl∼Beta\(alpha=5,beta=5\)A\_\{tl\}\\sim Beta\(alpha=5,beta=5\)

We linearly varied Topic Distribution,Tn∼Diriclet\(𝜶\)T\_\{n\}\\sim Diriclet\(\\boldsymbol\{\\alpha\}\), from uniformαk=1K\\alpha\_\{k\}=\\frac\{1\}\{K\}\(high entropy\), to a single topicα1=1,α2:K=0\\alpha\_\{1\}=1,\\ \\alpha\_\{2:K\}=0\(low entropy\)\. Topic entropy given by:

H\(𝜶\)=−∑k=1Kαklog⁡αkH\(\\boldsymbol\{\\alpha\}\)=\-\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\\log\\alpha\_\{k\}\(27\)

## Appendix HLLM usage

We used large language models \(LLMs\) solely for light writing and editorial assistance\. Specifically, LLMs were used to suggest minor improvements to grammar, clarity, and flow in portions of the manuscript\. All technical contributions, empirical findings, and conclusions are the original work of the authors\. We reviewed and verified all LLM\-assisted edits to ensure accuracy and alignment with the intended meaning\.
The Capability Frontier: Benchmarks Miss 82% of Model Performance

Similar Articles

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

You Don't Need to Run Every Eval

FORTIS: Benchmarking Over-Privilege in Agent Skills

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

Submit Feedback

Similar Articles

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
You Don't Need to Run Every Eval
FORTIS: Benchmarking Over-Privilege in Agent Skills
The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models