How Inference Compute Shapes Frontier LLM Evaluation

arXiv cs.AI 06/17/26, 04:00 AM Papers
Summary
This paper systematically studies how inference-time compute (token budgets, context compaction, repeated submissions) affects frontier LLM performance on challenging benchmarks, demonstrating that scores are protocol-dependent and advocating for evaluations that report capability as a function of inference compute.
arXiv:2606.17930v1 Announce Type: new Abstract: AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time ("inference compute"). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model's underlying capability. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:39 AM
# How Inference Compute Shapes Frontier LLM Evaluation
Source: [https://arxiv.org/html/2606.17930](https://arxiv.org/html/2606.17930)
###### Abstract

AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving\. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time \(“inference compute”\)\. Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model’s underlying capability\. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity\. We use a controlled setup combining three simple inference\-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback\. We find three main results\. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity’s Last Exam, and TerminalBench\. Second, fixed\-budget evaluations can increasingly understate frontier capability as models advance\. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably\. Third, benchmarks differ in which inference\-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark\. Overall, our results show that benchmark scores are protocol\-dependent\. We therefore argue that evaluations should report capability as a function of inference\-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety\- or policy\-relevant settings\.

\\AddToShipoutPictureBG

\*\\AtPageUpperLeft![[Uncaptioned image]](https://arxiv.org/html/2606.17930v1/images/aisilogo.png)\\NAT@set@cites

## 1Introduction

As frontier AI benchmarks saturate, evaluations are shifting toward harder, longer\-horizon tasks that benefit from extended trajectories, multi\-step planning, tool use, and interaction with complex environments\(Phan et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib48);Glazer et al\.,[2024](https://arxiv.org/html/2606.17930#biba.bib35);Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51);Merrill et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib9);Deng et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib23); Kapoor et al\.,[2026](https://arxiv.org/html/2606.17930#bib.bib6)\)\. Performance on these tasks increasingly depends on how much inference\-time compute evaluations allow\(Ord,[2025](https://arxiv.org/html/2606.17930#bib.bib7); Bengio et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib8)\)\. Yet many evaluations still use modest token budgets, give models only one chance to submit a solution, and report a single protocol\-dependent score — the equivalent of evaluating a human expert under severe time pressure\. This risks understating model capability, since a failure may mean the model simply ran out of inference budget rather than that it could not solve the task\.

Here, we systematically study this scaling of performance with inference\-time compute \(hereafter, “inference scaling”\) within a unified experimental framework\. We evaluate six frontier language models spanning multiple generations, released between May 2025 and March 2026, on five challenging benchmarks across software engineering, mathematics, and medicine, and additionally draw on two closely related cybersecurity evaluations from the UK AI Security Institute spanning an overlapping set of 10 models\(UK AI Security Institute,[2026b](https://arxiv.org/html/2606.17930#biba.bib50);Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51);UK AI Security Institute,[2026a](https://arxiv.org/html/2606.17930#biba.bib49)\)\. Across these settings, we apply a consistent inference\-scaling protocol: expanded total token budgets \(one to three orders of magnitude above published benchmark defaults\), context compaction, and unlimited submission attempts with or without minimal correctness feedback\. The techniques are intentionally straightforward so that we establish a lower bound on what simple, general, and reproducible inference\-scaling methods can*elicit*, rather than aiming to maximally elicit each model with benchmark\-specific scaffolding\.

![Refer to caption](https://arxiv.org/html/2606.17930v1/figures/fig1.png)Figure 1:Inference scaling curves and plateau locations for frontier models tested on seven benchmarks\.\(A\)Cumulative performance as a function of total tokens used \(excluding LLM judge tokens\), with one curve per \(model, benchmark, condition\) on each benchmark: solid lines for oracle score feedback, dashed for no feedback; each condition includes 5 independent trajectories per task\. The two cyber benchmarks use previously collected data under a closely similar oracle\-scored protocol only \(5 trajectories per task\) and are shown with solid lines only\. The y\-axis reports cumulative mean score: continuous scores in\[0,1\]\[0,1\]for*HealthBench*, partial progress for*The Last Ones*as the proportion of 32 steps completed, and binary 0/1 task success otherwise\. The legend orders models by chronological release date within each provider\. Vertical dashed lines mark the typical published token budget for each benchmark \(two lines indicating a typical range\); vertical solid lines mark the token limits imposed in this evaluation\.\(B\)Token budget at which each inference\-scaling curve plateaus, bucketed relative to the typical published budget and the tested range:*Before typical*,*In typical*\(for benchmarks with a range of typical values\),*After typical*\(between the typical budget and our tested limit\), or*Beyond tested*\(no plateau within the tested budget\)\. Points represent models, filled for oracle score feedback and hollow for no feedback on the five main benchmarks; the two cyber benchmarks contribute oracle score feedback points only\. Horizontal line separates Anthropic \(top\) from OpenAI \(bottom\) models\.We find three main patterns:

1. 1\.Inference scaling is substantial but highly benchmark\-dependent\.Some benchmarks continue to improve well beyond typical published token budgets — including FrontierMath, TerminalBench, and Humanity’s Last Exam \(HLE\) — while others show weaker marginal gains under our protocol\.
2. 2\.Newer model generations usually achieve higher performance at large budgets, where they unlock harder tasks and solve them more reliably\.These findings suggest that low\-budget evaluations may fail to track progress in the ability to convert additional inference\-time compute into performance, and therefore may fail to elicit capabilities that are visible only at larger budgets\. Fixed\-budget scores therefore give an incomplete picture of the performance frontier reachable under broader inference\-time budgets, and this omission may grow as models advance\.
3. 3\.Gains from inference scaling do not arise from one universal intervention\.Benchmark scores depend partly on protocol choices about if and how models can iterate on their solutions, and whether compute is allocated to a single deep trajectory \(“serial” scaling\) or spread across multiple shallower ones \(“parallel” scaling\)\. Repeated submissions materially improve performance on all benchmarks, and feedback on submission correctness matters most where it can guide continued search \(HLE and SWE\-Bench Pro\)\. Parallel scaling is strongest on stateless benchmarks \(HealthBench and HLE, which do not involve a persistent interactive environment\) and weakest on stateful ones\. Together, these results demonstrate that different tasks respond to distinct ways of allocating inference\-time compute, implying that elicitation is partly protocol\-dependent\.

Overall, these findings suggest that frontier capability cannot be fully characterized by a single benchmark score measured under a single inference\-time protocol\. Observed performance depends not only on the model, but also on how much inference\-time compute it is given and how that compute is allocated\. As a result, evaluations should \(i\) report capability as a function of inference\-time compute rather than as a single fixed\-budget number, \(ii\) treat protocol choices as part of the evaluation design and report them explicitly, and \(iii\) control for the compute range and protocol when comparing capability across model generations, particularly in safety\-critical or policy\-relevant contexts\.

## Background

Existing evidence suggests that additional inference\-time compute can improve performance on difficult evaluations, including cybersecurity\(Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51); Meta Superintelligence Labs,[2026](https://arxiv.org/html/2606.17930#bib.bib11)\), software engineering\(Ma et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib12); Ding and Zhang,[2026](https://arxiv.org/html/2606.17930#bib.bib13); Epoch AI and METR,[2026](https://arxiv.org/html/2606.17930#bib.bib14)\), mathematics\(Muennighoff et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib15); Wu et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib16)\), medicine\(Huang et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib17); Byun et al\.,[2026](https://arxiv.org/html/2606.17930#bib.bib18)\), and other interactive tasks\(Anthropic,[2026a](https://arxiv.org/html/2606.17930#biba.bib3); Wei et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib20)\)\. Inference scaling can be characterised by a cumulative success curve, which shows performance as a function of tokens consumed rather than at one fixed budget\. Such curves reveal whether success continues to improve as more inference\-time compute is allocated, and differences between frontier models often emerge more clearly at high compute levels\. If performance is still rising across the tested range, the evaluation has measured only part of the achievable performance under that protocol rather than the full limit\(Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51); Epoch AI and METR,[2026](https://arxiv.org/html/2606.17930#bib.bib14)\)\. This is especially important for cross\-generation comparisons, since fixed\-budget evaluations may miss improvements in how newer models use additional inference\-time compute\(UK AI Security Institute,[2026a](https://arxiv.org/html/2606.17930#biba.bib49)\)\.

Weak or absent inference scaling can mean either that additional inference compute genuinely does not help in that setting or that the apparent limit is induced by the evaluation protocol\. Longer trajectories can hurt performance\(Gema et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib21); Laban et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib22)\), and some domains may be intrinsically less responsive to additional inference compute\(Sprague et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib23)\)\. But weak scaling can also result from tight budgets, turn caps, timeouts, poor context management, or limited opportunities for iterative refinement\(Jurkovic,[2026](https://arxiv.org/html/2606.17930#bib.bib24);Merrill et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib9); Sun et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib25)\)\. These protocol choices restrict both serial depthwithina trajectory and parallel breadthacrossindependent trajectories\(Wang et al\.,[2023](https://arxiv.org/html/2606.17930#bib.bib26); Snell et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib27)\)\. Evaluations with similar cumulative success curves may in fact permit very different opportunities for search, recovery, or refinement\. Moreover, apparent plateaus in these curves may reflect turn caps or timeouts rather than a true limit to productive inference\-time compute\. Whether an evaluation observes inference scaling or not therefore cannot be interpreted without considering which inference\-time opportunities the protocol permits\. Assessing what additional inference\-time compute enables models to achieve may require a more decomposed analysis of how inference\-time compute is allocated and what forms of search, interaction, or refinement the protocol actually enables\(Marchand et al\.,[2026](https://arxiv.org/html/2606.17930#bib.bib28)\)\.

Two gaps limit what can be concluded from current evidence\. First, existing studies rarely characterise performance over a wide enough inference\-compute range\. This is costly but necessary to distinguish near\-saturation from continued improvement\. Both ends of the curve matter: the high\-compute tail is informative about what well\-resourced actors could achieve, while the low\-compute regime is informative about accessibility and less\-resourced misuse\. Second, evaluators seldom apply a sufficiently comparable setup across benchmarks or model generations within a single framework\. This makes it hard to separate differences due to models and tasks from differences due to scaffolding, tools, and protocol\. Comparability does not, however, guarantee full elicitation: even under a shared framework, weak scaling may still reflect genuine capability limits or a protocol that fails to elicit capabilities the model could exhibit under a different allocation of inference\-time compute\.

## 2Methods

Ourmain experimentfollows a fully crossed design with six frontier models \([Table1](https://arxiv.org/html/2606.17930#S2.T1)\) evaluated on five non\-cyber benchmarks under two feedback conditions, using one shared ReAct\-style scaffold\(Yao et al\.,[2023](https://arxiv.org/html/2606.17930#bib.bib29)\)implemented in Inspect AI\(UK AI Security Institute,[2024a](https://arxiv.org/html/2606.17930#bib.bib30)\)\. For each task in each \(benchmark, model, condition\) cell, we run 5 independent trajectories\. Per\-benchmark dataset loading, sandbox configuration, scoring details, and additional protocol specifics are reported in[SectionsA\.1\.1](https://arxiv.org/html/2606.17930#A1.SS1.SSS1),[A\.1\.4](https://arxiv.org/html/2606.17930#A1.SS1.SSS4)and[A\.1\.3](https://arxiv.org/html/2606.17930#A1.SS1.SSS3)\.

We also include two previously collectedcyber benchmarks\(UK AI Security Institute,[2026b](https://arxiv.org/html/2606.17930#biba.bib50);Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51);UK AI Security Institute,[2026a](https://arxiv.org/html/2606.17930#biba.bib49)\)in the inference\-scaling analyses only\. These are not part of the fully crossed main design and cover a different set of models spanning a similar release\-date range \([Table1](https://arxiv.org/html/2606.17930#S2.T1)\)\. They were run with 5 independent trajectories per task under a near\-identical high\-budget protocol described below in[Section2\.2](https://arxiv.org/html/2606.17930#S2.SS2)\.

### 2\.1Models

For the main benchmark suite, we evaluate six frontier models spanning three generations \(from May 2025 to March 2026\) drawn from two model families \([Table1](https://arxiv.org/html/2606.17930#S2.T1)\)\. Within each family, three successive releases \(Opus 4→\\to4\.5→\\to4\.6 and GPT\-5→\\to5\.2→\\to5\.4\) enable controlled cross\-generation comparisons while holding scaffold, prompts, and inference\-time compute budgets fixed\. All models run with high reasoning effort \(xhighfor Anthropic,highfor OpenAI\) and a reasoning\-token budget of 16,000 per generation call — the budget at which Humanity’s Last Exam accuracy is reported to peak\(Center for AI Safety et al\.,[2026](https://arxiv.org/html/2606.17930#bib.bib31)\)\. This reasoning\-token budget is separate from the per\-trajectory total budget and governs each individual model call\.

Data for the two cyber benchmarks span a different set of models: 10 for the capture\-the\-flag \(CTF\) suite \(April 2025 to April 2026\) and 5 for The Last Ones \(September 2025 to April 2026\)\. These both include Mythos Preview, a frontier model checkpoint that Anthropic provided to the UK AI Security Institute \(AISI\) for evaluation\(Anthropic,[2026a](https://arxiv.org/html/2606.17930#biba.bib3)\)\. AISI tested two such checkpoints; the one we evaluate is the newer of the two, postdating the checkpoint in AISI’s initial Mythos Preview evaluation\(UK AI Security Institute,[2026b](https://arxiv.org/html/2606.17930#biba.bib50)\)and matching the more recent reporting in which this newer checkpoint was the first model to solve both AISI cyber ranges end\-to\-end\(UK AI Security Institute,[2026c](https://arxiv.org/html/2606.17930#bib.bib32)\)\.

Main benchmark suiteCyber benchmarksModelReleaseTBSBPFMHBHLECTFTLOo3Apr, 2025✗✗✗✗✗✓✗Opus 4May, 2025✓✓✓✓✓✗✗GPT\-5Aug, 2025✓✓✓✓✓✓✗Sonnet 4\.5Sep, 2025✗✗✗✗✗✓✓Opus 4\.5Nov, 2025✓✓✓✓✓✓✗GPT\-5\.1 CodexNov, 2025✗✗✗✗✗✓✓GPT\-5\.2Dec, 2025✓✓✓✓✓✗✗GPT\-5\.2 CodexDec, 2025✗✗✗✗✗✓✗Opus 4\.6Feb, 2026✓✓✓✓✓✓✓GPT\-5\.3 CodexFeb, 2026✗✗✗✗✗✓✗GPT\-5\.4Mar, 2026✓✓✓✓✓✓✓Mythos PreviewApr, 2026✗✗✗✗✗✓✓Table 1:Model coverage across the seven benchmarks\. Checkmarks indicate the \(model, benchmark\) cells we evaluated\. Models are ordered by release date\. The five main benchmarks \(TB: TerminalBench; SBP: SWE\-Bench Pro; FM: FrontierMath; HB: HealthBench; HLE: Humanity’s Last Exam\) enter all downstream analyses\. The two cyber benchmarks \(CTF: Capture the Flag; TLO: The Last Ones\) enter the inference\-scaling analyses only\.
### 2\.2Inference\-scaling techniques

We apply three deliberately simple inference\-scaling techniques uniformly across every \(benchmark, model, condition\) cell:

1. 1\.Expanded total token budgetsof 5M–30M tokens per trajectory \([Table2](https://arxiv.org/html/2606.17930#S2.T2)\), sized to task complexity and one to three orders of magnitude above typical benchmark defaults\.
2. 2\.Context compaction, which replaces earlier turns with a model\-generated summary to enable serial scaling beyond the nominal context\-window size\. We adopt Inspect AI’s default summary\-based strategy\(UK AI Security Institute,[2024a](https://arxiv.org/html/2606.17930#bib.bib30)\), triggered when the running context exceeds 130k tokens \(about 65% of the smallest 200k context window across models\)\.
3. 3\.Iterative resubmissionwith a hard cap of 999 submissions per trajectory, allowing the model either to refine its previous answer or to try a substantially different approach between submissions\. To reduce unproductive loops of near\-identical submissions, we also employ a lightweightrepetition guardthat terminates a trajectory when a separate LLM judge finds three or more consecutive submissions semantically equivalent \([SectionA\.1\.3](https://arxiv.org/html/2606.17930#A1.SS1.SSS3)\)\.

These techniques all operate within a single trajectory and target*serial*inference scaling; we study*parallel*inference scaling separately in[Section3\.3\.2](https://arxiv.org/html/2606.17930#S3.SS3.SSS2), by reanalysing the same trajectories under different fixed\-total\-budget allocations\.

### 2\.3Feedback conditions

Whether a model can productively use a large token budget depends strongly on the feedback available during the trajectory\(Balachandran et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib33)\)\. We therefore run two conditions for each \(model, benchmark\) pair:

1. 1\.No feedback\. The model receives only an ambiguous acknowledgement after each submission \(*“Your answer has been saved”*\) with no indication of correctness, and the trajectory continues until the budget is exhausted or the repetition guard fires\. This isolates the model’s own ability to judge when it has solved a task\.
2. 2\.Oracle score feedback\. The model is told whether each submission is correct \(or, for HealthBench, given a partial\-credit score\)\. The trajectory terminates on the first fully correct submission, as the model is thereafter aware that it has solved the task\.

Both conditions share an adaptive continuation prompt that invites the agent either to refine its previous answer or to try a substantially different approach\. The full prompt text and benchmark\-specific feedback variants are reproduced in[SectionA\.1\.4](https://arxiv.org/html/2606.17930#A1.SS1.SSS4)\. Each \(benchmark, model, task\) cell contains 5 trajectories per condition \(10 in total\)\.

Unless otherwise stated, token budgets are all\-inclusive for the target model and cover input, output, and reasoning tokens; LLM judge tokens are excluded from these budgets and from the token counts shown in scaling plots, but we report when judges are used\. A 90\-minute timeout is imposed on each individual model generation call\. Any trajectories hitting this timeout were re\-run, so no trajectories in the final dataset are affected\.

The cyber benchmarks were run under a closely matching oracle\-feedback protocol, where models were told if they had successfully completed the task or not, without any partial\-progress signals\.

### 2\.4Benchmarks and scoring

We select five benchmarks that are both challenging for current frontier models and diverse in domain and task structure \([Table2](https://arxiv.org/html/2606.17930#S2.T2)\)\. Three are ”stateful” benchmarks requiring environment state persistence across multi\-turn tool use:TerminalBench 2\.0\(Merrill et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib9)\),SWE\-Bench Pro\(Deng et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib23)\), andFrontierMath\(Glazer et al\.,[2024](https://arxiv.org/html/2606.17930#biba.bib35)\)\. Two are ”stateless” knowledge\-and\-reasoning benchmarks:HealthBench \(Hard\)\(Arora et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib39)\)andHumanity’s Last Exam \(HLE\)\(Phan et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib48)\)with multiple\-choice questions filtered out\. Together they measure terminal\-based task execution, real\-world software engineering, expert\-level mathematics, clinical reasoning, and cross\-disciplinary expert knowledge\. We additionally reuse data from two cyber benchmarks:Cyber CTFs\(71 capture\-the\-flag tasks\) andThe Last Ones\(a single long\-horizon cyber range scored by 32 milestones toward a final objective\)\. Because these data were not collected under the controlled framework of the main experiment, we use them only for the inference\-scaling curves and exclude them from analyses of submission behaviour and feedback\. Per\-benchmark loading, grading, sandbox configuration, and the cyber data collection protocol are given in[SectionA\.1\.1](https://arxiv.org/html/2606.17930#A1.SS1.SSS1)\.

The stateful benchmarks are givenbashandpythontools, with a customsubmit\_answertool for FrontierMath’s code\-based submissions\. The stateless benchmarks are tool\-less and submit via a genericsubmittool\. Scoring is by programmatic verification for TerminalBench \(bundled unit tests\), SWE\-Bench Pro \(bundled test harness\), FrontierMath \(sandboxed code execution against per\-problem reference implementations\), Cyber CTFs \(flag capture\), and The Last Ones \(milestone progress\), while HLE and HealthBench are graded by a separate LLM judge \(GPT\-4o\-mini at temperature 0\)\. All benchmarks use binary scores, except for HealthBench which is scored continuously against per\-conversation physician\-designed rubrics\. For each benchmark, we use a canonical task set of up to 100 tasks fixed deterministically across all runs\. Together with our expanded\-budget protocol, this subsampling means absolute scores on the non\-cyber benchmarks are not directly comparable to published benchmark results\.

### 2\.5Typical\-budget reference points

To quantify the uplift from typical published budgets to our expanded budgets \([Table3](https://arxiv.org/html/2606.17930#S3.T3)\), we define a benchmark\-specific reference for what counts as a typical stopping point\. Published evaluations express limits in different units \(output tokens, turn caps, or wall\-clock time\), which we convert to approximate per\-trajectory total\-token budgets on a common scale\. For HLE, HealthBench, and FrontierMath we use publicly reported token limits \(64k, 16k, and 1M total tokens respectively\)\. For TerminalBench and SWE\-Bench Pro, which do not report token budgets directly, we estimate token\-equivalent stopping points from our own trajectories by measuring cumulative tokens consumed at the point at which the published wall\-clock or turn limit would have applied\. These converted values are used only as descriptive reference points, to mark typical published budgets in[Figure1](https://arxiv.org/html/2606.17930#S1.F1)A and to quantify uplift in[Table3](https://arxiv.org/html/2606.17930#S3.T3)\. Full sources, conversions, and resulting estimates are given in[SectionA\.1\.2](https://arxiv.org/html/2606.17930#A1.SS1.SSS2)\.

Table 2:Benchmark characteristics and specifications\. Token cap refers to the per\-trajectory total token budget for the target model \(input, output, and reasoning tokens\), excluding LLM judge tokens\.

## 3Results

We organise the results around our three main findings:

1. 1\.[Section3\.1](https://arxiv.org/html/2606.17930#S3.SS1), inference scaling curves: Scaling with larger token budgets is substantial but varies sharply across benchmarks, and we characterise where the cumulative curves plateau\.
2. 2\.[Section3\.2](https://arxiv.org/html/2606.17930#S3.SS2), mechanistic decomposition: The cross\-generational gains visible in these curves come mainly from greater task reach and reliability rather than improved efficiency, which we establish through a task\-level decomposition\.
3. 3\.[Section3\.3](https://arxiv.org/html/2606.17930#S3.SS3), feedback and iteration: These gains do not arise from any single intervention: benchmarks respond differently to repeated submission with feedback and to serial versus parallel allocation of a fixed compute budget\.

Throughout the results that follow, weak scaling on a given benchmark should be read as a property of our protocol — it does not rule out larger gains under different scaffolds, tools, or compute allocations\.

### 3\.1Inference scaling is substantial but varies across benchmarks

To examine performance as a function of total tokens consumed \(excluding LLM judge tokens\), we compute inference scaling curves as follows\. For trajectoryii, letsis\_\{i\}denote its final credited score — 1 if any submission is correct and 0 otherwise for the binary benchmarks, and the maximum submission score for HealthBench — and letκi\\kappa\_\{i\}be the total token count at which trajectoryiifirst attainssis\_\{i\}\. The trajectory inference curveSi\(t\)S\_\{i\}\(t\)is zero untilsis\_\{i\}is first attained and equal tosis\_\{i\}thereafter:

Si\(t\):=si1\(κi≤t\),S\_\{i\}\(t\):=s\_\{i\}\\,\\mathbf\{1\}\\\!\\left\(\\kappa\_\{i\}\\leq t\\right\),\(1\)withSi\(t\)≡0S\_\{i\}\(t\)\\equiv 0whensi=0s\_\{i\}=0\. The aggregate inference curve is the mean ofSi\(t\)S\_\{i\}\(t\)over a specified set of trajectories,

Sagg\(t\):=N−1∑iSi\(t\)\.S\_\{\\mathrm\{agg\}\}\(t\):=N^\{\-1\}\\sum\_\{i\}S\_\{i\}\(t\)\.\(2\)
[Figure1](https://arxiv.org/html/2606.17930#S1.F1)A plots the resultant inference scaling curves, with separate curves per no feedback and oracle score feedback conditions\. The quantitative summaries in this subsection pool trajectories across both feedback conditions and across tasks, characterising overall performance under our full evaluation protocol rather than any single feedback setting\.

#### 3\.1\.1Typical budgets can miss substantial additional gains

##### Software engineering benchmarks show little headroom beyond typical budgets\.

For the two software engineering benchmarks, which are already commonly evaluated at relatively large budgets, extending further yields only limited improvement \([Table3](https://arxiv.org/html/2606.17930#S3.T3)\)\. SWE\-Bench Pro is typically evaluated at up to about 16M total tokens \(equivalent to 250 turns\)\. Near\-doubling this budget to our 30M cap adds only\+0\.3±0\.3\+0\.3\\pm 0\.3percentage points across models\. TerminalBench is typically evaluated at up to about 7\.3M total tokens \(3\.3\-hour limit\) and extending to 10M adds only\+1\.3±1\.0\+1\.3\\pm 1\.0percentage points\. This pattern indicates that, for these benchmarks, the typical published budget ranges are already fairly large and that extending them further buys only modest additional performance at substantial per\-run compute cost \(an extra 14M tokens for SWE\-Bench Pro and 2M for TerminalBench\)\. This should be interpreted as limited responsiveness to the specific serial scaling interventions tested here, not as evidence that alternative allocations of inference compute would necessarily yield similarly small gains\.

##### FrontierMath and HLE show substantial headroom beyond typical budgets\.

In contrast, some of the benchmarks with smaller typical budgets see a larger effect from additional compute\. FrontierMath’s commonly used budget of 1M total tokens\(Burnham,[2025](https://arxiv.org/html/2606.17930#biba.bib27)\)is itself an order\-of\-magnitude increase over the earlier 100k\-token scaffold\(Glazer et al\.,[2024](https://arxiv.org/html/2606.17930#biba.bib35)\), and extending from 1M to 10M adds a further\+11\.7±11\.0\+11\.7\\pm 11\.0percentage points on average\. HLE also continues to improve well beyond its typical 64k\-output\-token budget, gaining\+9\.3±12\.0\+9\.3\\pm 12\.0points up to our 5M cap\. This uplift is larger under oracle feedback \(\+12\.4±14\.6\+12\.4\\pm 14\.6points\) than under no feedback \(\+6\.1±8\.9\+6\.1\\pm 8\.9\), consistent with the stronger submission\-feedback effects on HLE reported in[Section3\.3\.1](https://arxiv.org/html/2606.17930#S3.SS3.SSS1)\.

##### HealthBench shows minimal headroom despite a small typical budget\.

The exception is HealthBench, which shows little change over the same comparison: only\+0\.3±0\.4\+0\.3\\pm 0\.4points between a typical 16k\-output\-token budget and our much larger 10M cap\. Under our scaffold and grading set\-up, this benchmark therefore shows limited sensitivity to the serial inference\-scaling interventions tested here within the observed range\. This does not rule out larger gains under different scaffolds, stopping rules, or width–depth allocations\.

Table 3:Performance gain from each benchmark’s typical published budget to the evaluated budget used in this study, shown separately for pooled trajectories, oracle score feedback trajectories, and no feedback trajectories\. Gains are averaged across models\. Where the published budget spans a range \(TerminalBench, SWE\-Bench Pro\), we report both the lower and upper bounds and compute the gain from the upper bound\. Values are reported in percentage points, except for HealthBench, where the native benchmark point scale is used\.Taken together, these comparisons show that some typical published budgets already capture most of the gains visible under our protocol, whereas others leave substantial performance unrealised\.

#### 3\.1\.2Evidence of plateauing differs across benchmarks

The cumulative curves in[Figure1](https://arxiv.org/html/2606.17930#S1.F1)A suggest that some benchmark–model–condition curves begin to plateau within the tested range, while others continue to improve at the largest budget we evaluate\. Throughout, we use “diminishing returns” descriptively to mean a local plateau or weak remaining growth at the right edge of the tested range, on a log\-xxscale\. To summarise this pattern, we estimate the local rate of improvement at the cap as

gcap=y\(cap\)−y\(cap/2\)log10⁡\(2\),g\_\{\\mathrm\{cap\}\}\\;=\\;\\frac\{y\(\\mathrm\{cap\}\)\-y\(\\mathrm\{cap\}/2\)\}\{\\log\_\{10\}\(2\)\},\(3\)expressed in percentage points per10×10\\timesincrease in total tokens\. As a descriptive convention, “weak remaining growth” meansgcap<1g\_\{\\mathrm\{cap\}\}<1percentage point per10×10\\timesincrease\. Under this definition, we observe three broad patterns \([Figure1](https://arxiv.org/html/2606.17930#S1.F1)B\)\.

##### TerminalBench and cyber benchmarks show continued growth across all or most tested models\.

Three benchmarks continue to rise for most tested models within the observed range: TerminalBench \(all six models above threshold at our 10M cap\), Cyber CTFs \(all ten models\), and The Last Ones \(three of five models\)\. TerminalBench and the Cyber CTFs are the clearest cases where the tested range still appears insufficient to reveal compute\-saturated performance\. On The Last Ones, the oldest and newest models — Sonnet 4\.5 and Mythos Preview — plateau within the tested range, likely for opposite reasons: Sonnet 4\.5 appears to hit a capability floor, whereas Mythos Preview plateaus at high \(but not maximal\) average milestone completion\.

##### FrontierMath, HLE, and SWE\-Bench Pro show mixed evidence of plateauing\.

Three benchmarks show mixed signs of plateauing, with different patterns across models\. FrontierMath shows the clearest generational pattern, where the two oldest models \(Opus 4, GPT\-5\) plateau earlier, while later models generally continue improving to larger budgets\. This split is somewhat weaker under the no feedback condition \([Figure1](https://arxiv.org/html/2606.17930#S1.F1)B\)\. HLE shows the clearest provider\-level split, with the GPT family plateauing earlier than the Opus family\. Under no feedback, all three GPT\-family models plateau before the typical budget, whereas the Opus models plateau only after the typical budget or beyond the tested range\. Oracle score feedback shifts plateau locations later for newer models in both families: GPT\-5\.4 moves past the typical budget, and two Opus models \(Opus 4 and Opus 4\.6\) remain beyond the tested range\. SWE\-Bench Pro has the least structured pattern: GPT\-5 and GPT\-5\.2 are the only Open AI models above threshold in both feedback conditions \(though visual inspection suggests their curves likely flatten shortly beyond the tested range\), while Anthropic’s Opus 4\.5 is above threshold only under no feedback and Opus 4\.6 only under oracle feedback \([Figure1](https://arxiv.org/html/2606.17930#S1.F1)A\)\.

##### HealthBench shows near diminishing returns across all tested models\.

HealthBench is the only benchmark on which all six tested models plateau within the typical token budget of 16k total tokens\. This indicates that, under our scaffold and grading set\-up, this benchmark appears close to diminishing returns under the tested protocol, well within a typical evaluation budget\.

Overall, these findings indicate that diminishing returns from additional inference\-time compute vary substantially across benchmarks, even under a controlled, shared inference\-scaling protocol\. Whether these inference scaling curves plateau within the tested range can also depend on the model family, model generation, and feedback condition\. Additional curve\-level summaries reported in[SectionA\.2\.1](https://arxiv.org/html/2606.17930#A1.SS2.SSS1)show the same broad pattern: newer models achieve higher performance, begin to succeed on tasks at lower token budgets, and — except on HealthBench — gain performance more rapidly after onset\.

### 3\.2Generational gains are driven mainly by reach and reliability rather than efficiency

The aggregate curve\-level patterns above do not reveal whether later generations solve*new*tasks, solve the*same*tasks with fewer tokens, or simply solve already\-solvable tasks more consistently across repeated trajectories\. To separate these possibilities, we decompose improvement at the task level into three components:

1. 1\.Reach— the proportion of tasks a model reliablyunlocks, meaning that the task is solved at some point within at least two independent trajectories \(ruling out lucky one\-off successes\)
2. 2\.Efficiency— how many output tokens are needed to solve unlocked tasks
3. 3\.Reliability— the proportion of trajectories per unlocked task in which a solution is found\.

The analysis covers the five main benchmarks plus Cyber CTFs; The Last Ones is excluded because it contains only a single long\-horizon task\. Task difficulty is defined as the task’s median solve rate across models \(higher = easier\)\. For the binary benchmarks, a trajectory solves a task if at least one submission receives a score of 1; for HealthBench, submission scores are first binarised using a global median split at 0\.38 \(the median score achieved across all trajectories\)\. Each component is quantified per benchmark by regressing its outcome — the unlock indicator for reach \(a linear probability model\), log tokens\-to\-solve for efficiency, and solve rate for reliability — on model generation, task difficulty, and their interaction\. Full model specifications are in[SectionA\.3\.1](https://arxiv.org/html/2606.17930#A1.SS3.SSS1)\.

Efficiency and reliability are evaluated only on tasks that a given model unlocks, rather than on a single task set shared by all models\. These analyses therefore describe performance conditional on reach: they capture how models behave on tasks they can solve at least once\. Because newer models often unlock harder tasks than earlier models, and harder tasks may require more tokens, comparisons of efficiency across generations may understate true gains in token efficiency\. To address this, we conduct a sensitivity analysis restricted to a balanced panel of tasks unlocked by all models and find the qualitative pattern is unchanged, indicating that this compositional effect does not drive the main result \([SectionA\.3\.3](https://arxiv.org/html/2606.17930#A1.SS3.SSS3)\)\.

![Refer to caption](https://arxiv.org/html/2606.17930v1/figures/fig2.png)Figure 2:Task\-level decomposition of generational gains\.\(A\) Reach increases as newer models unlock harder tasks\.Per\-task scatter\.xx\-axis: release date of the earliest model to unlock the task \(solving it in at least 2 of 5 attempts\)\.yy\-axis: per\-task difficulty \(1 minus the cross\-model median solve rate of tasktt; higher = harder\)\. The shaded region beneath the frontier is filled with a difficulty\-scaled colour gradient, green for easy and red for hard, reflecting the space of task difficulties unlocked as of each release date\. The dotted step line traces the running maximum task difficulty unlocked over time — the hardest task any model has solved as of each release date\.\(B\) Efficiency gains are benchmark\-dependent\.Output tokens per solve over model release date, restricted to unlocked \(model, task\) cells\. Lines are simple\-slope predictions from a mixed\-effects regression fit on continuous task difficulty \(see[SectionA\.3\.1](https://arxiv.org/html/2606.17930#A1.SS3.SSS1)\), evaluated at the within\-benchmark median split used for visualisation \(easiest half in green, hardest half in red\); they are therefore not direct fits to the plotted binned medians\. Shaded bands: SEM\. Dots: observed \(model, bin\) medians of tokens\-to\-solve over unlocked tasks, with error bars at the 25th and 75th percentiles\.\(C\) Reliability gains are more widespread\.Per\-benchmark heat map of within\-model mean solve rate over unlocked tasks\. Rows: five within\-benchmark task\-difficulty\-quantile bins \(hardest at top, easiest at bottom\)\. Columns: models in release order\. Cells where the model has no unlocked tasks in that difficulty bin are rendered in grey\.Table 4:Summary of task\-level generational effects\. For each component,βgen\\beta\_\{\\mathrm\{gen\}\}is the main effect of model generation andβint\\beta\_\{\\mathrm\{int\}\}its interaction with task difficulty, estimated per benchmark; full specifications are given in[SectionA\.3\.1](https://arxiv.org/html/2606.17930#A1.SS3.SSS1)\.Reachcoefficients come from a linear probability model for whether a model unlocks a task, so positiveβgen\\beta\_\{\\mathrm\{gen\}\}indicates that later models unlock more tasks\. Forefficiency, negativeβgen\\beta\_\{\\mathrm\{gen\}\}indicates that later models use fewer output tokens to solve unlocked tasks\. Forreliability, positiveβgen\\beta\_\{\\mathrm\{gen\}\}indicates that later models solve unlocked tasks more consistently across trajectories\. Higher difficulty values denote easier tasks, so negativeβint\\beta\_\{\\mathrm\{int\}\}in the reach and reliability analyses indicates gains concentrated on harder tasks\.p∗<0\.05\{\}^\{\*\}p<0\.05,p∗∗<0\.01\{\}^\{\*\*\}p<0\.01,p∗⁣∗∗<0\.001\{\}^\{\*\*\*\}p<0\.001#### 3\.2\.1Reach increases with model generation, often more so on harder tasks

For each benchmark, later model generations unlock a larger proportion of tasks\. This is indicated by a positive, significant main effect of model generation on the probability of unlocking a task on all six benchmarks \([Table4](https://arxiv.org/html/2606.17930#S3.T4)\), with the largest effect on FrontierMath and Cyber CTFs\. At the task level, the tasks first unlocked by newer models also tend to be harder \([Figure2](https://arxiv.org/html/2606.17930#S3.F2)A\)\. The generation\-by\-difficulty interaction is negative — indicating that reach gains are concentrated on harder tasks — on five of six benchmarks, and significantly so on Cyber CTFs, FrontierMath, and TerminalBench\. HealthBench is the exception, showing no such concentration\.

#### 3\.2\.2Efficiency gains are uneven across benchmarks and conditional on reach

Newer models use fewer output tokens to solve unlocked tasks on four of six benchmarks, indicated by a negative, significant main effect of model generation that is largest on Cyber CTFs and HealthBench and smaller but still significant on TerminalBench and SWE\-Bench Pro \([Table4](https://arxiv.org/html/2606.17930#S3.T4);[Figure2](https://arxiv.org/html/2606.17930#S3.F2)B\)\. FrontierMath and HLE show no significant generation effect on efficiency\. Collectively, these results indicate that token\-efficiency improvements are present but uneven across benchmarks\.

Because efficiency is estimated on unlocked*\(model, task\)*cells, it should be interpreted as conditional on reach rather than on a fixed common task set\. Re\-fitting on only those tasks unlocked by all models gives similar generation effects on Cyber CTFs, TerminalBench, SWE\-Bench Pro, and HealthBench, but less stable generation\-by\-difficulty interactions \([SectionA\.3\.1](https://arxiv.org/html/2606.17930#A1.SS3.SSS1)\)\. Generational improvement therefore appears to expand the set of solvable tasks more consistently than it reduces the token cost of solving tasks already within reach\.

#### 3\.2\.3Reliability gains are more widespread

Reliability improves more consistently than efficiency, with newer models solving unlocked tasks more often across repeated trajectories on all benchmarks except HealthBench \([Table4](https://arxiv.org/html/2606.17930#S3.T4);[Figure2](https://arxiv.org/html/2606.17930#S3.F2)C\)\. On Cyber CTFs, FrontierMath, and TerminalBench these gains are concentrated more strongly on harder tasks\. HLE shows gains more uniform across difficulty, while SWE\-Bench Pro shows a weak negative interaction in the same direction but without clear difficulty\-specific concentration\. HealthBench is the main exception, with no clear overall reliability gain\. A significant interaction between model generation and task difficulty indicates that what little reliability changes with generation does so on easier rather than harder tasks\.

In sum, these results suggest that newer generations improve not only reach but also reliability on reachable tasks — though whether these gains concentrate on harder tasks depends on the benchmark\.

### 3\.3Protocol choices shape inference\-scaling gains

The results above show that inference\-scaling gains vary across benchmarks and are driven mainly by greater reach and reliability\. We therefore examine two protocol dimensions that may shape those gains within our setup: repeated submission under different feedback conditions \([Section3\.3\.1](https://arxiv.org/html/2606.17930#S3.SS3.SSS1)\), and whether a fixed total budget is concentrated in one deep trajectory or spread across several shallower ones \([Section3\.3\.2](https://arxiv.org/html/2606.17930#S3.SS3.SSS2)\)\.

![Refer to caption](https://arxiv.org/html/2606.17930v1/figures/fig3.png)Figure 3:Serial scaling under no feedback versus oracle score feedback\.\(A\) Cumulative solve rate versus submission indexkk, pooled with equal weight across models within each benchmark–condition cell\. Bands:±1\\pm 1SEM across models\. Grey dashed lines: no feedback; navy solid lines: oracle score feedback\. \(B\) Iteration upliftΔ\\Deltaagainst model release date\. Points are models; lines are per\-condition Theil–Sen fits \([SectionA\.4\.2](https://arxiv.org/html/2606.17930#A1.SS4.SSS2)\)\. Per\-facet Kendallτ\\taubetween release date andΔ\\Deltais annotated at the top of each panel \(grey: no feedback; navy: oracle score feedback\)\.#### 3\.3\.1Repeated submission helps most when feedback enables continued search

On longer\-horizon tasks, the return to additional serial inference compute may depend on whether models are allowed to resubmit answers within a trajectory and on what feedback they receive after doing so\. To test this, we compare the two conditions defined in Methods: no feedback \(ambiguous acknowledgement\) and oracle score feedback \(simple correctness signal\)\. The two cyber benchmarks are excluded because they were collected under only one feedback condition\. For comparability across benchmarks, these analyses use binary trajectory outcomes; for HealthBench, which is natively continuously scored, we take the highest\-scored submission within each trajectory and binarise using a median split \(median score = 0\.38\)\.

##### Repeated submissions improve performance on all benchmarks\.

Allowing repeated submissions improves cumulative performance on all five evaluated benchmarks \([Figure3](https://arxiv.org/html/2606.17930#S3.F3)A;[Table5](https://arxiv.org/html/2606.17930#S3.T5)\)\. Averaged over conditions, cumulative performance rises by\+6\.4\+6\.4points on FrontierMath,\+10\.4\+10\.4on TerminalBench,\+14\.2\+14\.2on HealthBench,\+14\.4\+14\.4on SWE\-Bench Pro, and\+17\.1\+17\.1on HLE from the first submission to the highest cumulative level reached, corresponding to uplift multipliers from1\.11×1\.11\\timeson FrontierMath to1\.70×1\.70\\timeson HLE\.

Table 5:Descriptive summaries of repeated\-submission gains\. Iteration gain is the increase in cumulative performance, in percentage points, from the first submission to the highest observed cumulative level\. Uplift multipliers are reported for the overall, no\-feedback, and oracle\-score\-feedback settings\.k90k\_\{90\}is the number of submissions required to realise 90% of this gain\.
##### Oracle feedback increases iteration gains most where it supports continued search\.

A regression analysis comparing no\-feedback and oracle\-feedback trajectories within tasks \([SectionA\.4\.3](https://arxiv.org/html/2606.17930#A1.SS4.SSS3)\) shows that oracle feedback significantly improves eventual success on HLE, with iteration adding\+25\.1\+25\.1points under oracle feedback versus\+9\.1\+9\.1under no feedback \(2\.76×2\.76\\times\)\. Oracle feedback also significantly improves eventual success on SWE\-Bench Pro, though with a smaller descriptive uplift \(\+16\.0\+16\.0vs\.\+12\.7\+12\.7;1\.26×1\.26\\times\)\. By contrast, FrontierMath, TerminalBench, and HealthBench show no significant benchmark\-level success effect of oracle feedback, despite modest descriptive uplift under oracle feedback on FrontierMath \(\+7\.5\+7\.5vs\.\+5\.3\+5\.3;1\.42×1\.42\\times\), TerminalBench \(\+12\.4\+12\.4vs\.\+8\.4\+8\.4;1\.48×1\.48\\times\), and HealthBench \(\+15\.7\+15\.7vs\.\+12\.7\+12\.7;1\.24×1\.24\\times\)\. On HealthBench, this repeated\-submission gain coexists with the weak inference scaling over total tokens consumed reported in[Section3\.1](https://arxiv.org/html/2606.17930#S3.SS1), suggesting that, under our protocol, iterative refinement can help even when longer token trajectories do not\.

##### Some benchmarks benefit from only a few extra attempts, while others support longer serial search\.

The shape of these gains differs across benchmarks\. FrontierMath and HealthBench realise 90% of their total iteration gain within three submissions in both conditions, and SWE\-Bench Pro within three to four, suggesting that repeated submission is useful but shallow in these settings\. Consistent with this, most unsolved trajectories on these benchmarks end after the model begins repeating semantically similar answers rather than exhausting the token budget \(92–100% of unsolved trajectories across these three benchmarks;[Table6](https://arxiv.org/html/2606.17930#S3.T6)\), indicating that models typically converge after a small number of distinct attempts rather than sustaining long productive search\. By contrast, HLE continues improving for 12–13 submissions, indicating a longer search process composed largely of many short answer attempts\. TerminalBench also benefits from more extended search and shows the strongest dependence on feedback, reaching 90% of its total iteration gain by submission 5 under oracle feedback but requiring up to 14 submissions under no feedback\. This matches its much higher rate of budget exhaustion \(34% under no feedback and 39% of non\-correct trajectories under oracle feedback;[Table6](https://arxiv.org/html/2606.17930#S3.T6)\), consistent with longer, more interaction\-heavy attempts in which correctness feedback helps models decide sooner when to stop, revise, or switch approaches\. Consistent with this interpretation, among trajectories that eventually succeed, oracle feedback does not reliably reduce submissions to the first correct answer, and on HLE and SWE\-Bench Pro successful oracle\-feedback trajectories use slightly more submissions on average \([SectionA\.4\.3](https://arxiv.org/html/2606.17930#A1.SS4.SSS3)\), suggesting that feedback can enable productive continued search rather than merely accelerating convergence\.

Table 6:Trajectory termination outcomes by benchmark and feedback condition\. Percentages are over trajectories within each \(benchmark, condition\) cell\. Under oracle score feedback, correct trajectories terminate immediately upon a fully correct submission, so the balance between repeated similar answers and budget exhaustion is most comparable across benchmarks within, rather than across, feedback conditions\. \*Under oracle feedback, HealthBench counts a trajectory ”Correct” only when a perfect continuous score of 1\.0 is achieved, which never occurred in this evaluation\.
##### Self\-guided iteration often matters less for newer models\.

We also asked whether the value of repeated submission changes across model generations\. For each \(benchmark, model, condition\) cell, we measure iteration uplift as the increase in cumulative success from the first submission to the highest level reached thereafter, and relate it to model release date using Kendall rank correlations \(τ\\tau;[Figure3](https://arxiv.org/html/2606.17930#S3.F3)B; Appendix[Table11](https://arxiv.org/html/2606.17930#A1.T11)\)\. These correlations are descriptive only, as each \(benchmark, condition\) cell contains just six models\. With that caveat, the broad pattern is that newer models tend to find repeated submission less helpful on HLE and TerminalBench, consistent with newer models solving more tasks on their first submission and leaving less room for improvement\. On HealthBench, newer models appear to benefit somewhat more \(positiveτ\\tauin both conditions\)\. SWE\-Bench Pro shows essentially no generational effect \(τ=\+0\.07\\tau=\+0\.07in both conditions\), and arguably neither does FrontierMath, where the two feedback conditions point in opposite directions and estimates rest on only twelve tasks\.

![Refer to caption](https://arxiv.org/html/2606.17930v1/figures/fig4.png)Figure 4:Parallel scaling: width vs depth, and uplift across generations\.\(A\) Per benchmark,xx\-axis: total token budgetBBon a log scale;yy\-axis:pass@k\\text\{pass@\}kat total budgetBBwhen the budget is split intokktrajectories ofB/kB/ktokens each\. Lines showk=1k=1\(light grey\), 2, 5, and 10 \(navy\)\. Bands:±1\\pm 1SEM across models\. \(B\) Per benchmark,xx\-axis: model release date;yy\-axis:Δparallel=pass@10\(B⋆\)−pass@1\(B⋆\)\\Delta\_\{\\text\{parallel\}\}=\\text\{pass@\}10\(B^\{\\star\}\)\-\\text\{pass@\}1\(B^\{\\star\}\)\. One marker per model\. The line is a Theil–Sen trend fit \([SectionA\.5\.1](https://arxiv.org/html/2606.17930#A1.SS5.SSS1)\); the inset reports Kendallτ\\tau\.

#### 3\.3\.2Serial versus parallel allocation of fixed total compute

The interventions so far all allocate more compute to a single trajectory —*serial*scaling\. The same total compute can instead be split across multiple independent trajectories\. This matters for evaluation because fixed\-budget single\-trajectory protocols may understate capability either by starving individual runs of depth or by failing to capture gains from independent restarts\. We compare pass@kkat fixed total budgetBB, splittingBBacrossk∈\{1,2,5,10\}k\\in\\\{1,2,5,10\\\}independent trajectories; the exact estimator, pooling details, and our choice of pass@kkover consensus\-based aggregation are reported in[SectionA\.5\.1](https://arxiv.org/html/2606.17930#A1.SS5.SSS1)\.

Table 7:Parallel\-sampling gains at each benchmark’s token cap\. pass@1 and pass@10 are averaged across models\.Δparallel\\Delta\_\{\\text\{parallel\}\}is pass@10 minus pass@1 at the benchmark’s token cap\.τgen\\tau\_\{\\text\{gen\}\}is the Kendall rank correlation between per\-modelΔparallel\\Delta\_\{\\text\{parallel\}\}and release date\.##### Benchmarks differ in how much they benefit from parallel scaling\.

At each benchmark’s token cap, distributing total token budget from a single, deep trajectory to ten shallower trajectories improves average performance on all five benchmarks, but by different amounts: FrontierMath0\.625→0\.6530\.625\\rightarrow 0\.653\(Δparallel=\+0\.028\\Delta\_\{\\text\{parallel\}\}=\+0\.028\), HLE0\.412→0\.5950\.412\\rightarrow 0\.595\(\+0\.183\+0\.183\), HealthBench0\.501→0\.7840\.501\\rightarrow 0\.784\(\+0\.283\+0\.283\), SWE\-Bench Pro0\.673→0\.7010\.673\\rightarrow 0\.701\(\+0\.028\+0\.028\), and TerminalBench0\.657→0\.7200\.657\\rightarrow 0\.720\(\+0\.063\+0\.063\) \([Table7](https://arxiv.org/html/2606.17930#S3.T7);[Figure4](https://arxiv.org/html/2606.17930#S3.F4)A\)\. Parallel sampling therefore matters substantially for HLE and HealthBench, but much less for FrontierMath, TerminalBench, and SWE\-Bench Pro\. We separately verified that the larger gains on the LLM\-graded benchmarks \(HLE and HealthBench\) are not driven by judge unreliability \([SectionA\.5\.2](https://arxiv.org/html/2606.17930#A1.SS5.SSS2)\)\.

##### Benefit of parallel sampling depends on budget\.

To assess when width becomes preferable to depth, we assess the total budgetBBat which pass@1010exceeds pass@11\([Figure4](https://arxiv.org/html/2606.17930#S3.F4)A\)\. This crossover occurs very early on HealthBench \(from roughly 23k total tokens\) and HLE \(from roughly 177k\), but only near the cap on FrontierMath \(9M–10M\), TerminalBench \(5M–10M\), and SWE\-Bench Pro \(23M–30M\)\. These patterns indicate that parallel sampling helps only once each branch receives enough compute to make progress within each trajectory, and that benchmarks differ substantially in how much depth each independent attempt requires before extra width becomes useful\.

##### Newer models generally benefit less from parallel sampling\.

On all five benchmarks, the gain from parallel sampling decreases with model release date \([Table7](https://arxiv.org/html/2606.17930#S3.T7);[Figure4](https://arxiv.org/html/2606.17930#S3.F4)B\): TerminalBench \(τ=−0\.73\\tau=\-0\.73\), FrontierMath \(−0\.60\-0\.60\), HealthBench \(−0\.47\-0\.47\), HLE \(−0\.47\-0\.47\), and most weakly SWE\-Bench Pro \(τ=−0\.33\\tau=\-0\.33\)\. For the newest models, parallel sampling sometimes does not help at all: on FrontierMath, GPT\-5\.2, Opus 4\.6, and GPT\-5\.4 all achieve lower pass@10 than pass@1 at the cap, and Opus 4\.6 shows the same pattern on TerminalBench\. Newer models therefore often extract more value from a single long trajectory, reducing the marginal benefit of independent restarts\.

## 4Discussion

### 4\.1Summary

Across seven challenging benchmarks, frontier\-model performance improves with additional inference\-time compute — however, the magnitude and shape of those gains differ substantially across settings\. Under our protocol, typical published budgets leave meaningful additional performance unrealised on FrontierMath, HLE, and the cyber evaluations, but yield much smaller gains on HealthBench and the two software engineering benchmarks\. Cross\-generational improvements at larger budgets come mainly from greater reach and reliability rather than from consistently better token efficiency\. Measured performance is also sensitive to protocol choices: repeated submission improves performance on all five benchmarks in the main suite, oracle feedback helps most where it enables continued search, and the value of parallel sampling depends strongly on task structure and budget regime\. Taken together, these results show that benchmark scores reflect not only the model being evaluated, but also the inference\-time budget and protocol used to elicit its performance\.

### 4\.2Interpretation

A central question is why responsiveness to the inference\-scaling interventions tested here differs so much across benchmarks\. These differences should be interpreted as properties of the aggregate performance curves observed under our protocol, not as evidence that any benchmark is intrinsically compute\-saturated\. Apparent flattening may still conceal headroom under larger budgets, alternative scaffolds, or different allocations of inference\-time compute, and benchmark\-level curves can mask substantial heterogeneity across tasks\.

One plausible interpretation is that the key distinction is not domain alone, but whether a benchmark gives models opportunities to use additional compute productively through extended search, revision, tool use, or self\-verification\(Brown et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib36); Snell et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib27); Balachandran et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib33)\)\. This view is consistent with the stronger scaling we observe on FrontierMath and the cyber tasks, the more modest but continuing gains on TerminalBench, the smaller gains on SWE\-Bench Pro, and the weak scaling on HealthBench under our scaffold\. HLE appears intermediate: although stateless, it still benefits considerably from repeated submissions, especially when correctness feedback is available\. More broadly, this perspective may help explain why recent work has found strong inference scaling in some settings\(Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51); Muennighoff et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib15); Wu et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib16); Ma et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib12); Epoch AI and METR,[2026](https://arxiv.org/html/2606.17930#bib.bib14)\)and weak or absent scaling in others\(Gema et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib21); Jurkovic,[2026](https://arxiv.org/html/2606.17930#bib.bib24);Merrill et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib9)\)\.

Our cross\-generational results suggest that recent model progress has improved the ability to use inference\-time compute productively on tasks that permit it\. Later models usually achieve higher performance at large budgets and often begin succeeding earlier in the token allowance, but these gains come mainly from increased reach and reliability rather than from large improvements in token efficiency\. This pattern is consistent with recent time\-horizon analyses attributing frontier\-model progress primarily to greater reliability and the ability to tackle longer or harder tasks\(Kwa et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib37); METR,[2026](https://arxiv.org/html/2606.17930#bib.bib38)\)\. It also implies that fixed\-budget evaluations may become increasingly uneven in what they capture, understating progress most on tasks that benefit from extended search, verification, or repeated attempts\.

The submission\-protocol and parallel\-sampling results reinforce the same broader conclusion\. Repeated submissions improved performance on all five main benchmarks, oracle feedback helped most where it supported continued search, and parallel sampling helped most when tasks admitted multiple viable independent attempts\. This is broadly consistent with prior work on width–depth trade\-offs, external verification, and repeated search in LLM evaluation and elicitation\(Brown et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib36); Snell et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib27); Huang et al\.,[2024](https://arxiv.org/html/2606.17930#bib.bib39); Stechly et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib40); Balachandran et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib33)\)\. More generally, inference\-time compute, feedback conditions, submission rules, and width–depth allocation should be treated as part of the measurement definition rather than as incidental implementation details\.

### 4\.3Recommendations

##### Report capability over a range of inference\-time compute, not just at a single budget

A single fixed\-budget score can either approximate high\-budget performance reasonably well or omit substantial additional gains, depending on the benchmark\. Reporting performance as a function of inference\-time compute helps distinguish near\-saturated evaluations from budget\-censored lower bounds, and clarifies when additional compute continues to buy measurable performance\. This is especially important when evaluations are used to characterise what well\-resourced actors could achieve, rather than only what is accessible under restrictive default settings\. Where possible, evaluators should also distinguish whether gains come from solving new tasks, solving the same tasks more reliably, or solving already\-reachable tasks with less compute, since aggregate scaling curves can obscure these differences\.

##### Treat protocol choices as part of the evaluation design and clearly document them

In this study, repeated submissions, feedback after submission, and the allocation of compute across serial depth or parallel width all meaningfully changed measured performance on at least some benchmarks\. These choices should therefore be reported explicitly and justified relative to the goals of the evaluation, the deployment setting, or the actor scenario the evaluation is meant to capture\. Oracle correctness feedback is most ecologically valid where external verifiers already exist, such as code execution, formal proofs, or grounded tool outputs, and less so for open\-ended expert reasoning tasks\. Likewise, some benchmarks benefited much more from parallel sampling than others, implying that the same total compute can elicit meaningfully different levels of performance depending on how it is allocated\.

##### For trend tracking, compare models at matched budgets within a large, shared compute range

Newer models often reach higher performance only at larger budgets, and they do not always use additional inference\-time compute in the same way as earlier generations\. In our task\-level analyses, generational gains came mainly from greater reach and reliability rather than from consistently improved token efficiency\. Evaluations used for trend tracking, long\-horizon performance estimation, or safety thresholding should therefore compare models under the same protocol and over the same broad compute range, or explicitly account for differences in those choices\. Otherwise, fixed\-budget comparisons may become increasingly misleading as models improve, especially on tasks where performance depends strongly on extended search, verification, or repeated attempts\.

### 4\.4Limitations

First, we study a single ReAct\-style scaffold, even though bespoke scaffolds and elicitation effort can substantially shift measured capability on the same benchmark and model\(METR,[2024](https://arxiv.org/html/2606.17930#bib.bib41); UK AI Security Institute,[2024b](https://arxiv.org/html/2606.17930#bib.bib42); Jurkovic,[2026](https://arxiv.org/html/2606.17930#bib.bib24); Sun et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib25)\)\. Some benchmark differences we observe may therefore be scaffold\-dependent rather than task\-intrinsic, and the weak scaling on HealthBench in particular may reflect genuine knowledge\-boundedness, judge saturation, scaffold mismatch, or some combination\(Ravichandran et al\.,[2025](https://arxiv.org/html/2606.17930#bib.bib43)\)\.

Second, our protocol is deliberately simple — a general combination of larger token budgets, context compaction, and iterative resubmission, but not more elaborate strategies such as adaptive branching, verifier\-guided search, or hybrid serial–parallel allocation within and across trajectories\. Such methods may materially change absolute performance, possibly differently across domains, so our results should be interpreted as a conservative lower bound on what broadly reproducible inference scaling can achieve rather than an upper bound under more optimised elicitation\.

Third, FrontierMath contains only twelve tasks in our subset, reducing power for the task\-level analyses and making benchmark\-level patterns sensitive to individual items; newer models may additionally have training overlap with the public set\(Burnham,[2025](https://arxiv.org/html/2606.17930#biba.bib27)\)\.

Fourth, the repetition guard relies on an LLM judge, whose reliability at detecting semantic equivalence may vary across benchmarks, making the guard’s effective strictness non\-uniform across domains\.

\\c@NAT@ctr

## References

- Phan et al\. \[2025\]Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, et al\.Humanity’s last exam\.2025\.arXiv:2501\.14249\.
- Glazer et al\. \[2024\]Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean\-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, and Mark Wildon\.FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI\.2024\.
- Folkerts et al\. \[2026\]Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, John Wilkinson, Alan Steer, Vy Hong, and Jessica Wang\.Measuring AI agents’ progress on multi\-step cyber attack scenarios\.2026\.AI Security Institute and Irregular\.
- Merrill et al\. \[2026\]Mike A\. Merrill, Alexander G\. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E\. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap, Jan\-Lucas Uslu, Jeffrey Li, Jianbo Wu, Minghao Yan, Song Bian, Vedang Sharma, Ke Sun, Steven Dillmann, Akshay Anand, Andrew Lanpouthakoun, Bardia Koopah, Changran Hu, Etash Guha, Gabriel H\. S\. Dreiman, Jiacheng Zhu, Karl Krauth, Li Zhong, Niklas Muennighoff, Robert Amanfu, Shangyin Tan, Shreyas Pimpalgaonkar, Tushar Aggarwal, Xiangning Lin, Xin Lan, Xuandong Zhao, Yiqing Liang, Yuanli Wang, Zilong Wang, Changzhi Zhou, David Heineman, Hange Liu, Harsh Trivedi, John Yang, Junhong Lin, Manish Shetty, Michael Yang, Nabil Omi, Negin Raoof, Shanda Li, Terry Yue Zhuo, Wuwei Lin, Yiwei Dai, Yuxin Wang, Wenhao Chai, Shang Zhou, Dariush Wahdany, Ziyu She, Jiaming Hu, Zhikang Dong, Yuxuan Zhu, Sasha Cui, Ahson Saiyed, Arinbjörn Kolbeinsson, Jesse Hu, Christopher Michael Rytting, Ryan Marten, Yixin Wang, Alex Dimakis, Andy Konwinski, and Ludwig Schmidt\.Terminal\-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces\.2026\.
- Deng et al\. \[2025\]Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler\.SWE\-Bench Pro: Can AI agents solve long\-horizon software engineering tasks?2025\.
- Kapoor et al\. \[2026\]Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J\.J\. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K\. Hadfield, Andrew B\. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, and Arvind Narayanan\.Open\-world evaluations for measuring frontier AI capabilities\.2026\.
- Ord \[2025\]Toby Ord\.Inference scaling and AI governance\.Technical report, Centre for the Governance of AI \(GovAI\), October 2025\.URL[https://www\.governance\.ai/research\-paper/inference\-scaling\-and\-ai\-governance](https://www.governance.ai/research-paper/inference-scaling-and-ai-governance)\.
- Bengio et al\. \[2025\]Yoshua Bengio, Stephen Clare, Carina Prunkl, Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Malcolm Murray, Charlotte Stix, Lucia Velasco, Nicole Wheeler, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G\. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Gopal Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya\-Qin Zhang, Leandro Aguirre, Olubunmi Ajala, Fahad Albalawi, Noora AlMalek, Christian Busch, André Carvalho, Jonathan Collas, Amandeep Gill, Ahmet Hatip, Juha Heikkilä, Chris Johnson, Gill Jolly, Ziv Katzir, Mary Kerema, Hiroaki Kitano, Antonio Krüger, Aoife McLysaght, Oleksii Molchanovskyi, Andrea Monti, Kyoung Mu Lee, Mona Nemer, Nuria Oliver, Raquel Pezoa, Audrey Plonk, José Portillo, Balaraman Ravindran, Hammam Riza, Crystal Rugege, Haroon Sheikh, Denise Wong, Yi Zeng, and Liming Zhu\.International AI safety report 2025: First key update: Capabilities and risk implications\.Technical Report DSIT 2025/033, Department for Science, Innovation and Technology \(DSIT\), UK, 2025\.
- UK AI Security Institute \[2026a\]UK AI Security Institute\.Claude mythos preview cyber capabilities\.[https://www\.aisi\.gov\.uk/blog/our\-evaluation\-of\-claude\-mythos\-previews\-cyber\-capabilities](https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities), 2026a\.Accessed 2026\-04\-22\.
- UK AI Security Institute \[2026b\]UK AI Security Institute\.Evidence for inference scaling in AI cyber tasks\.[https://www\.aisi\.gov\.uk/blog/evidence\-for\-inference\-scaling\-in\-ai\-cyber\-tasks\-increased\-evaluation\-budgets\-reveal\-higher\-success\-rates](https://www.aisi.gov.uk/blog/evidence-for-inference-scaling-in-ai-cyber-tasks-increased-evaluation-budgets-reveal-higher-success-rates), 2026b\.Accessed 2026\-04\-22\.
- Meta Superintelligence Labs \[2026\]Meta Superintelligence Labs\.Muse spark safety & preparedness report\.Technical report, Meta, 2026\.URL[https://ai\.meta\.com/static\-resource/muse\-spark\-safety\-and\-preparedness\-report/](https://ai.meta.com/static-resource/muse-spark-safety-and-preparedness-report/)\.
- Ma et al\. \[2025\]Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Rongyu Cao, Jue Chen, Fei Huang, and Binhua Li\.Thinking longer, not larger: Enhancing software engineering agents via scaling test\-time compute\.2025\.
- Ding and Zhang \[2026\]Yifeng Ding and Lingming Zhang\.SWE\-Replay: Efficient test\-time scaling for software engineering agents\.2026\.
- Epoch AI and METR \[2026\]Epoch AI and METR\.Evidence that AI can already do some weeks\-long coding tasks\.Epoch AI blog, 2026\.URL[https://epoch\.ai/blog/mirrorcode\-preliminary\-results](https://epoch.ai/blog/mirrorcode-preliminary-results)\.
- Muennighoff et al\. \[2025\]Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei\-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto\.s1: Simple test\-time scaling\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2025\.
- Wu et al\. \[2024\]Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang\.Inference scaling laws: An empirical analysis of compute\-optimal inference for problem\-solving with language models\.2024\.
- Huang et al\. \[2025\]Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou\.m1: Unleash the potential of test\-time scaling for medical reasoning with large language models\.2025\.
- Byun et al\. \[2026\]Ji Young Byun, Young\-Jin Park, Navid Azizan, and Rama Chellappa\.Enhancing fine\-tuning\-free clinical reasoning via test\-time scaling\.2026\.
- Anthropic \[2026\]Anthropic\.Claude mythos preview release\.[https://www\.anthropic\.com/glasswing](https://www.anthropic.com/glasswing), 2026\.Accessed 2026\-04\-22\.
- Wei et al\. \[2025\]Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese\.BrowseComp: A simple yet challenging benchmark for browsing agents\.2025\.
- Gema et al\. \[2025\]Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman\-Wetzler, Kit Fraser\-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, and Ethan Perez\.Inverse scaling in test\-time compute\.*Transactions on Machine Learning Research \(TMLR\)*, 2025\.
- Laban et al\. \[2025\]Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville\.LLMs get lost in multi\-turn conversation\.2025\.
- Sprague et al\. \[2024\]Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett\.To CoT or not to CoT? chain\-of\-thought helps mainly on math and symbolic reasoning\.2024\.
- Jurkovic \[2026\]Nikola Jurkovic\.Measuring time horizon using Claude Code and Codex\.METR blog, February 2026\.URL[https://metr\.org/notes/2026\-02\-13\-measuring\-time\-horizon\-using\-claude\-code\-and\-codex/](https://metr.org/notes/2026-02-13-measuring-time-horizon-using-claude-code-and-codex/)\.
- Sun et al\. \[2025\]Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen\.Scaling long\-horizon LLM agent via context\-folding\.2025\.
- Wang et al\. \[2023\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.In*International Conference on Learning Representations*, 2023\.
- Snell et al\. \[2024\]Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.*arXiv preprint arXiv:2408\.03314*, 2024\.
- Marchand et al\. \[2026\]Rahul Marchand, Art O Cathain, Jerome Wynne, Philippos Maximos Giavridis, Sam Deverett, John Wilkinson, Jason Gwartz, and Harry Coppock\.Quantifying frontier LLM capabilities for container sandbox escape\.2026\.
- Yao et al\. \[2023\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.ReAct: Synergizing reasoning and acting in language models\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- UK AI Security Institute \[2024a\]UK AI Security Institute\.Inspect: An open\-source framework for large language model evaluations\.Software package, 2024a\.URL[https://inspect\.aisi\.org\.uk/](https://inspect.aisi.org.uk/)\.
- Center for AI Safety et al\. \[2026\]Center for AI Safety, Scale AI, and HLE Contributors Consortium\.A benchmark of expert\-level academic questions to assess AI capabilities\.*Nature*, 649:1139, 2026\.doi:10\.1038/s41586\-025\-09962\-4\.
- UK AI Security Institute \[2026c\]UK AI Security Institute\.How fast is autonomous AI cyber capability advancing?[https://www\.aisi\.gov\.uk/blog/how\-fast\-is\-autonomous\-ai\-cyber\-capability\-advancing](https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing), 2026c\.Accessed 2026\-05\-15\.
- Balachandran et al\. \[2025\]Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, and Safoora Yousefi\.Inference\-time scaling for complex tasks: Where we stand and what lies ahead\.2025\.Microsoft Research, MSR\-TR\-2025\-16\.
- Arora et al\. \[2025\]Rahul K\. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal\.HealthBench: Evaluating large language models towards improved human health\.2025\.OpenAI\.
- Burnham \[2025\]Greg Burnham\.Less than 70% of FrontierMath is within reach for today’s models\.Epoch AI, Gradient Updates, October 2025\.URL[https://epoch\.ai/gradient\-updates/less\-than\-70\-percent\-of\-frontiermath\-is\-within\-reach\-for\-todays\-models](https://epoch.ai/gradient-updates/less-than-70-percent-of-frontiermath-is-within-reach-for-todays-models)\.
- Brown et al\. \[2024\]Bradley C\. A\. Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V\. Le, Christopher Ré, and Azalia Mirhoseini\.Large language monkeys: Scaling inference compute with repeated sampling\.2024\.
- Kwa et al\. \[2025\]Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M\. Ziegler, Elizabeth Barnes, and Lawrence Chan\.Measuring AI ability to complete long tasks\.METR blog, March 2025\.URL[https://metr\.org/blog/2025\-03\-19\-measuring\-ai\-ability\-to\-complete\-long\-tasks/](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/)\.
- METR \[2026\]METR\.Time horizon 1\.1\.METR blog, January 2026\.URL[https://metr\.org/blog/2026\-1\-29\-time\-horizon\-1\-1/](https://metr.org/blog/2026-1-29-time-horizon-1-1/)\.
- Huang et al\. \[2024\]Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou\.Large language models cannot self\-correct reasoning yet\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Stechly et al\. \[2025\]Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati\.On the self\-verification limitations of large language models on reasoning and planning tasks\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.
- METR \[2024\]METR\.Guidelines for capability elicitation\.METR Autonomy Evaluation Resources, March 2024\.URL[https://evaluations\.metr\.org/elicitation\-protocol/](https://evaluations.metr.org/elicitation-protocol/)\.
- UK AI Security Institute \[2024b\]UK AI Security Institute\.Early lessons from evaluating frontier AI systems\.AISI blog, October 2024b\.URL[https://www\.aisi\.gov\.uk/blog/early\-lessons\-from\-evaluating\-frontier\-ai\-systems](https://www.aisi.gov.uk/blog/early-lessons-from-evaluating-frontier-ai-systems)\.
- Ravichandran et al\. \[2025\]Sandhanakrishnan Ravichandran, Shivesh Kumar, Rogerio Corga Da Silva, Miguel Romano, Reinhard Berkels, Michiel van der Heijden, Olivier Fail, and Valentine Emmanuel Gnanapragasam\.OpenAI’s HealthBench in action: Evaluating an LLM\-based medical assistant on realistic clinical queries\.2025\.Synduct; evaluates DR\. INFO agentic RAG system on HealthBench Hard\.
- Scale AI \[2026\]Scale AI\.SEAL leaderboards: SWE\-Bench Pro public\.Scale Labs leaderboard, 2026\.URL[https://labs\.scale\.com/leaderboard/swe\_bench\_pro\_public](https://labs.scale.com/leaderboard/swe_bench_pro_public)\.Accessed April 2026\.
- Chen et al\. \[2021\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert\-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N\. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\.

## Appendix AAppendix

### A\.1A\. Experimental protocol

#### A\.1\.1Benchmark details

This section gives per\-benchmark details on dataset loading, filtering, and grading\. All benchmarks are loaded via Inspect AI\-compatible wrappers, and all LLM\-graded scoring uses GPT\-4o\-mini at temperature 0\. For stateful benchmarks, tool calls that exceed 180 seconds are terminated and return an error to the model, which may then continue to the next turn\.

##### TerminalBench\.

We use the Terminal\-Bench 2\.0 release\[Merrill et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib9)\]accessed via theinspect\_harborregistry, restricted to the 82 tasks, out of 89, whose scorer reliably emits a score event\. Each task ships with its own pre\-built Docker image, with CPU, memory, and GPU specifications defined per task, and verification is performed by the task’s bundled test harness with binary pass/fail output\. Tasks are executed in a Kubernetes\-backed sandbox, one pod per trajectory, usinginspect\-k8s\-sandbox, with per\-task images pulled from a private ECR registry\.

##### SWE\-Bench Pro\.

We use SWE\-Bench Pro\[Deng et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib23)\]via theinspect\_harborregistry\. The registry ships two prompt variants per task, 1,462 samples across 731 unique instances\. For comparability with published baselines, we keep a single variant per task, the SWE\-bench\-format variant whose input begins with<uploaded\_files\>, yielding 731 unique problems\. A canonical subset of the first 100 problems is drawn deterministically and fixed across all runs\. Verification uses each instance’s bundled test harness against the candidate patch with binary pass/fail output\. Per\-task Docker images are pre\-built and pushed to a private ECR registry, and trajectories run in Kubernetes\-backed sandboxes\.

##### FrontierMath\.

We use the public dataset for FrontierMath\[Glazer et al\.,[2024](https://arxiv.org/html/2606.17930#biba.bib35)\], loaded from a local JSONL file\. Each record contains the question text, the expected return type, and a per\-problem verification code block\. Solutions are submitted as Python functions, ananswer\(\)function that takes no arguments and returns a value of the declared type, with no comments orprintoutput\. The scorer runs the submittedanswer\(\)followed by the problem’sverify\(a\)in a sandboxed Python environment and returns CORRECT or INCORRECT based on the verifier’s exit code\. Theanswer\(\)function has a 30\-second execution timeout, and the verification code has a 120\-second timeout\. The available Python libraries aresympy,numpy,scipy,mpmath,gmpy2,pyadic,galois, andnetworkx\. The full pinned version set is distributed with the sample via the system prompt\. The sandbox is a self\-managed Docker compose stack, one container per trajectory\.

##### HealthBench\.

We use HealthBench\[Arora et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib39)\]via itsinspect\_evalswrapper, restricted to the subset of hard samples, “HealthBench Hard”\. Each sample’s reference is a physician\-designed rubric, a list of criteria, each with an associated integer point value, positive for desirable behaviours and negative for penalised ones\. Scoring proceeds by evaluating each criterion independently with an LLM judge, GPT\-4o\-mini at temperature 0, which returns a boolean judgment\. The sample’s continuous score is the sum of the points of met criteria divided by the sum of positive points\. This generally lies in\[0,1\]\[0,1\]but can fall below zero when negative\-weighted criteria are triggered\. No tools or sandbox are required\.

##### HLE\.

We use Humanity’s Last Exam\[Phan et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib48)\]via itsinspect\_evalswrapper\. We filter the dataset to exact\-match questions only, identified by the presence of the phrase “Answer:” or “Exact Answer:” in the sample’s system message, excluding multiple\-choice questions that would otherwise allow an iterative agent to guess its way to the correct answer\. Grading is performed by an LLM judge, GPT\-4o\-mini at temperature 0, using HLE’s provided judge prompt\. The judge is directed to focus on whether the response is semantically equivalent to the reference answer and to emit a final verdict in the formatGRADE: C\(correct\) orGRADE: I\(incorrect\)\. Where available, a “Confidence” tag reported by the model is also recorded as metadata\. No tools or sandbox are required\.

In addition to the five main benchmarks that we ran directly for this study, we include previously collected results on two cybersecurity benchmarks, Cyber CTFs and The Last Ones\[UK AI Security Institute,[2026b](https://arxiv.org/html/2606.17930#biba.bib50),Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51),UK AI Security Institute,[2026a](https://arxiv.org/html/2606.17930#biba.bib49)\]\. These data were not re\-run under the fully crossed design of the present paper\. We include them only in the inference\-scaling analyses because their collection protocol is closely comparable on the dimensions relevant here, namely large inference budgets, repeated agent interaction in externally verifiable environments, context compaction, and multiple independent trajectories per task\.

##### Cyber CTFs\.

Cyber CTFs is a suite of 71 isolated capture\-the\-flag tasks spanning common offensive\-security skill areas such as web exploitation, cryptography, and reverse engineering\[UK AI Security Institute,[2026b](https://arxiv.org/html/2606.17930#biba.bib50),[a](https://arxiv.org/html/2606.17930#biba.bib49)\]\. Each task is scored by binary flag capture\. In the reused data, models were evaluated with five independent trajectories per task under a high\-budget, externally verifiable protocol with a 50M total\-token cap, and only the final submission per trajectory is retained for scoring\.

##### The Last Ones\.

The Last Ones is a single long\-horizon cyber range task representing a simulated corporate network attack\[Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51)\]\. The objective is to exfiltrate sensitive data from a protected internal database by chaining together reconnaissance, exploitation, credential theft, lateral movement, reverse engineering, and subsequent access\-establishment steps across a multi\-host environment\. The attack chain contains 32 steps grouped into 9 milestones, and progress is scored continuously by milestones toward the final objective\. In the reused data, each in\-trajectory milestone reached is treated as a submission event for inference\-scaling purposes\. Runs use five independent trajectories and large total token budgets, up to 100M, with context compaction used to support long trajectories\. The original evaluation employed a standard ReAct\-style agent with access to Kali Linux tooling and additional cyber\-specific tooling in a virtualised environment\[Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51)\]\.

#### A\.1\.2Typical published budget estimates

To compare our expanded inference\-time budgets against standard evaluation practice, we define a benchmark\-specific “typical” budget for each benchmark\. Because published evaluations use different stopping rules, we convert all reported limits into approximate per\-trajectory total\-token budgets\.

For three benchmarks, we retrieved specific per\-trajectory token limits from public sources\. These are 16–64k output tokens for HLE\[Phan et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib48)\], 16k output tokens for HealthBench\[Arora et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib39)\], and 1M total tokens for FrontierMath as reported by Epoch AI\[Burnham,[2025](https://arxiv.org/html/2606.17930#biba.bib27)\]\. The latter is a tenfold increase over their earlier 100k\-token scaffold, while the original preprint does not specify a fixed per\-trajectory budget\[Glazer et al\.,[2024](https://arxiv.org/html/2606.17930#biba.bib35)\]\.

TerminalBench and SWE\-Bench Pro leaderboards do not specify token limits\. TerminalBench uses per\-task wall\-clock limits of 10 minutes to 3\.3 hours, with a median of 15 minutes\[Merrill et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib9)\]\. SWE\-Bench Pro uses turn caps of 50–250\[Deng et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib23),Scale AI,[2026a](https://arxiv.org/html/2606.17930#biba.bib11)\]\. To convert these to token equivalents, we use our own collected trajectories\. For each trajectory, we record the cumulative total tokens consumed by the point at which the leaderboard’s wall\-clock or turn limit would have applied\. We average these totals within each model over tasks and trajectories and take the median across models\. This yields 1M–7\.3M tokens for TerminalBench and 473k–16M tokens for SWE\-Bench Pro\. These estimated token limits are displayed in[Figure1](https://arxiv.org/html/2606.17930#S1.F1)A to illustrate inference\-scaled uplift\. A full per\-benchmark, per\-source tabulation of reported scores and stopping limits is given in AppendixLABEL:tab:eval\-setups\. The final typical budget estimates per benchmark are reported in[Table3](https://arxiv.org/html/2606.17930#S3.T3)\.

Table 8:Reported evaluation setups in prior work for each benchmark studied in this paper\. One row per \(benchmark, source citation\), aggregating across 89 distinct harness setups documented in 61 citations; where a single citation documents multiple setups, each parameter value is shown as the range across those setups \(e\.g\. “10M–100M”\)\. Sources are restricted to benchmark preprints, lab system / model cards, and prominent third\-party leaderboards\. Across the 60 citations, stopping limits are undocumented at the following rates: token budget 42/61 \(69%\), turn cap 57/61 \(93%\), wall\-clock cap 47/61 \(77%\), submissions per trajectory 28/61 \(46%\), cost cap 57/61 \(93%\), trajectories per task 34/61 \(56%\)\. An em\-dash \(—\) indicates the citation did not disclose that field in any setup\.BenchmarkCitationSource titleSource typeDateToken limitTurn capTime limitSubmissions / traj\.Cost capIndependent traj\. / taskTerminalBench 2\.0\[DeepMind,[2026](https://arxiv.org/html/2606.17930#biba.bib1)\]Google DeepMind Gemini 3\.1 Pro model cardsystem card2026\-05?——————TerminalBench 2\.0\[tbench\.ai,[2026](https://arxiv.org/html/2606.17930#biba.bib2)\]tbench\.ai leaderboard submission rulesleaderboard2026\-05?——per\-task \(benchmark default; not modifiable\)1—5TerminalBench 2\.0\[Anthropic,[2026a](https://arxiv.org/html/2606.17930#biba.bib3)\]Claude Mythos Preview release \(Glasswing\)blog post2026\-04?1M per task—per\-task \(benchmark default; up to 4h on TB 2\.1\)1—5TerminalBench 2\.0\[Anthropic,[2026b](https://arxiv.org/html/2606.17930#biba.bib4)\]Claude Opus 4\.7 announcementblog post2026\-04?——per\-task \(benchmark default\)1—5TerminalBench 2\.0\[Anthropic,[2026c](https://arxiv.org/html/2606.17930#biba.bib5)\]Claude Sonnet 4\.6 system card, section 2\.3system card2026\-04?——per\-task \(benchmark default\)1—5TerminalBench 2\.0\[Anthropic,[2026d](https://arxiv.org/html/2606.17930#biba.bib6)\]Claude Opus 4\.6 announcementblog post2026\-03?——per\-task \(benchmark default\)1—5\-15TerminalBench 2\.0\[OpenAI,[2026a](https://arxiv.org/html/2606.17930#biba.bib7)\]OpenAI GPT\-5\.4 releaseblog post2026\-03?——————TerminalBench 2\.0\[OpenAI,[2026b](https://arxiv.org/html/2606.17930#biba.bib8)\]OpenAI GPT\-5\.3 Codex release \(”Simple Codex” harness\)blog post2026\-02?——————TerminalBench 2\.0\[Merrill et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib9)\]Terminal\-Bench 2\.0 paper, Appendices A & F \(Terminus\-2 harness\)preprint2026\-01no capno hard capper\-task \(default ~6 min, max 2h\)1—5SWE\-Bench Pro\[LLM\-Stats,[2026](https://arxiv.org/html/2606.17930#biba.bib10)\]LLM\-Stats SWE\-Bench Pro leaderboardleaderboard2026\-05——————SWE\-Bench Pro \(public\)\[Scale AI,[2026a](https://arxiv.org/html/2606.17930#biba.bib11)\]Scale SEAL leaderboardleaderboard2026\-05—50–250—1—1SWE\-Bench Pro \(public\)\[Team,[2026a](https://arxiv.org/html/2606.17930#biba.bib12)\]Alibaba Qwen 3\.6 Max Preview launchblog post2026\-04——————SWE\-Bench Pro \(public\)\[Team,[2026b](https://arxiv.org/html/2606.17930#biba.bib13)\]Alibaba Qwen 3\.6 Plus reportblog post2026\-04——————SWE\-Bench Pro \(public\)\[Anthropic,[2026a](https://arxiv.org/html/2606.17930#biba.bib3)\]Anthropic Claude Mythos Preview \(Project Glasswing\)blog post2026\-04——————SWE\-Bench Pro \(public\)\[Anthropic,[2026b](https://arxiv.org/html/2606.17930#biba.bib4)\]Anthropic Claude Opus 4\.7 releaseblog post2026\-04——————SWE\-Bench Pro \(public\)\[Augment,[2026](https://arxiv.org/html/2606.17930#biba.bib14)\]Augment Auggie blog \(Opus 4\.5 on Auggie\)blog post2026\-04——————SWE\-Bench Pro \(public\)\[Blitzy,[2026](https://arxiv.org/html/2606.17930#biba.bib15)\]Blitzy SWE\-Bench Pro auditblog post2026\-04———1——SWE\-Bench Pro \(public\)\[Cursor,[2026](https://arxiv.org/html/2606.17930#biba.bib16)\]Cursor Composer 2 releaseblog post2026\-04——————SWE\-Bench Pro \(public\)\[MiniMax,[2026](https://arxiv.org/html/2606.17930#biba.bib17)\]MiniMax M2\.7 launchblog post2026\-04——————SWE\-Bench Pro \(public\)\[Moonshot AI,[2026](https://arxiv.org/html/2606.17930#biba.bib18)\]Moonshot Kimi K2\.6 tech blogblog post2026\-04——————SWE\-Bench Pro \(public\)\[Labs,[2026](https://arxiv.org/html/2606.17930#biba.bib19)\]Morph WarpGrep v2 self\-reportedblog post2026\-04——————SWE\-Bench Pro \(public\)\[Zhipu AI,[2026](https://arxiv.org/html/2606.17930#biba.bib20)\]Zhipu GLM\-5\.1 release \(Z\.ai\)blog post2026\-0432k output—————SWE\-Bench Pro \(public\)\[DeepMind,[2026](https://arxiv.org/html/2606.17930#biba.bib1)\]Google DeepMind Gemini 3\.1 Pro model cardsystem card2026\-03———1—1SWE\-Bench Pro \(public\)\[OpenAI,[2026a](https://arxiv.org/html/2606.17930#biba.bib7)\]OpenAI GPT\-5\.4 releaseblog post2026\-03——————SWE\-Bench Pro \(public\)\[Anthropic,[2026d](https://arxiv.org/html/2606.17930#biba.bib6)\]Anthropic Claude Opus 4\.6 releaseblog post2026\-02——————SWE\-Bench Pro \(public\)\[OpenAI,[2026b](https://arxiv.org/html/2606.17930#biba.bib8)\]OpenAI GPT\-5\.3\-Codex releaseblog post2026\-02——————SWE\-Bench Pro\[OpenAI,[2026c](https://arxiv.org/html/2606.17930#biba.bib21)\]OpenAI GPT\-5\.2 releaseblog post2026\-01——————SWE\-Bench Pro \(public\)\[Windsurf,[2025](https://arxiv.org/html/2606.17930#biba.bib22)\]Windsurf/Cognition SWE\-1\.5 launchblog post2025\-10——————SWE\-Bench Pro \(public, commercial\)\[Deng et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib23)\]SWE\-Bench Pro paperpreprint2025\-09—50—1$21FrontierMath \(tier 4\)\[Epoch AI,[2026](https://arxiv.org/html/2606.17930#biba.bib24)\]Epoch AI substack \(GPT\-5\.2 Pro Tier 4 record\)blog post2026\-01—————1FrontierMath \(tiers 1\-3\)\[OpenAI,[2025a](https://arxiv.org/html/2606.17930#biba.bib25)\]OpenAI GPT\-5\.2 announcement \(science & math\)blog post2025\-12———1—1FrontierMath \(full\)\[Anthropic,[2025](https://arxiv.org/html/2606.17930#biba.bib26)\]Anthropic Claude Opus 4\.5 announcementblog post2025\-11——————FrontierMath \(tiers 1\-3\)\[Burnham,[2025](https://arxiv.org/html/2606.17930#biba.bib27)\]Epoch AI 1M\-token scaffold sweep \(Burnham\)blog post2025\-111M—30s/tool call1—32FrontierMath \(tiers 1\-3\)\[Epoch AI,[2025a](https://arxiv.org/html/2606.17930#biba.bib28)\]Epoch AI X post \(Gemini 3 Pro\)blog post2025\-111M—30s/tool call1—1FrontierMath \(tiers 1\-3\)\[Epoch AI,[2025b](https://arxiv.org/html/2606.17930#biba.bib29)\]Epoch AI benchmarks hub \(GPT\-5\.1 era\)leaderboard2025\-111M—30s/tool call1—1FrontierMath \(tiers 1\-3\)\[Epoch AI,[2025c](https://arxiv.org/html/2606.17930#biba.bib30)\]Epoch AI X post \(Opus 4\.5\)blog post2025\-111M—30s/tool call1—1FrontierMath \(tiers 1\-3\)\[Epoch AI,[2025d](https://arxiv.org/html/2606.17930#biba.bib31)\]Epoch AI Grok 4 math eval blogblog post2025\-07100k—30s/tool call1—1FrontierMath \(tier 4\)\[Epoch AI,[2025e](https://arxiv.org/html/2606.17930#biba.bib32)\]Epoch AI Tier 4 benchmark pageleaderboard2025\-071M—30s/tool call1—1FrontierMath \(tiers 1\-3\)\[Epoch AI,[2025f](https://arxiv.org/html/2606.17930#biba.bib33)\]Epoch AI scaffold v1\.0 \(o4\-mini eval\)blog post2025\-04100k—30s/tool call1—1FrontierMath \(tiers 1\-3\)\[Epoch AI,[2025g](https://arxiv.org/html/2606.17930#biba.bib34)\]Epoch AI scaffold v1\.0 \(o3\-mini era\)blog post2025\-03100k—30s/tool call1—1FrontierMath \(full\)\[Glazer et al\.,[2024](https://arxiv.org/html/2606.17930#biba.bib35)\]FrontierMath paper \(Glazer et al\.\)preprint2024\-1110k conversation——1—1HealthBench \(full\)\[OpenAI,[2026d](https://arxiv.org/html/2606.17930#biba.bib36)\]GPT\-5\.4 Thinking system cardsystem card2026\-03———1——HealthBench \(full, Hard\)\[OpenAI,[2025b](https://arxiv.org/html/2606.17930#biba.bib37)\]GPT\-5\.2 system card, Section 3\.6system card2025\-12———1——HealthBench \(full, Hard, Consensus\)\[OpenAI,[2025c](https://arxiv.org/html/2606.17930#biba.bib38)\]GPT\-5 system cardsystem card2025\-08———1——HealthBench \(full, Hard, Consensus\)\[Arora et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib39)\]HealthBench paperpreprint2025\-054096 output, 16k output, 9k output——1——HLE \(text\-only\)\[Scale AI,[2025](https://arxiv.org/html/2606.17930#biba.bib40)\]Scale SEAL HLE text\-only leaderboard methodologyleaderboard2026\-05———1——HLE \(full\)\[Scale AI,[2026b](https://arxiv.org/html/2606.17930#biba.bib41)\]Scale SEAL HLE leaderboard methodologyleaderboard2026\-05———1——HLE \(full\)\[DeepMind,[2025a](https://arxiv.org/html/2606.17930#biba.bib42)\]Google Gemini 3 launchblog post2025\-11———1——HLE \(full\)\[DeepMind,[2025b](https://arxiv.org/html/2606.17930#biba.bib43)\]Google Gemini 3 Deep Think blogblog post2025\-11———1——HLE \(full\)\[Moonshot AI,[2025](https://arxiv.org/html/2606.17930#biba.bib44)\]Kimi K2 Thinkingsystem card2025\-1196k thinking, 48k per step120—1—8HLE \(full\)\[OpenAI,[2025d](https://arxiv.org/html/2606.17930#biba.bib45)\]OpenAI GPT\-5 system cardsystem card2025\-08———1——HLE \(text\-only\)\[xAI,[2025](https://arxiv.org/html/2606.17930#biba.bib46)\]xAI Grok 4blog post2025\-07———1—4HLE \(full\)\[OpenAI,[2025e](https://arxiv.org/html/2606.17930#biba.bib47)\]OpenAI Deep Research launchblog post2025\-02———1——HLE \(full, text\-only\)\[Phan et al\.,[2025](https://arxiv.org/html/2606.17930#biba.bib48)\]HLEpreprint2025\-018192 output——1——Cyber CTFs \(apprentice tier, practitioner/expert tier\)\[UK AI Security Institute,[2026a](https://arxiv.org/html/2606.17930#biba.bib49)\]AISI inference scaling blogblog post2026\-04?2\.5M–50M———$6010Cyber CTFs \(expert\-level, apprentice/non\-expert\)\[UK AI Security Institute,[2026b](https://arxiv.org/html/2606.17930#biba.bib50)\]AISI Mythos Preview cyber capabilitiesblog post2026\-03?2\.5M–50M————5–10Cyber CTFs \(71 challenges\)\[Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51)\]Folkerts et al\. 2026, Appendix D \(AISI CTF suite\)preprint2026\-03—————10Cyber CTFs \(47 challenges\)\[UK AI Security Institute,[2024a](https://arxiv.org/html/2606.17930#biba.bib52)\]AISI pre\-deployment eval of OpenAI o1blog post2024\-12——————Cyber CTFs \(47 challenges\)\[UK AI Security Institute,[2024b](https://arxiv.org/html/2606.17930#biba.bib53)\]AISI pre\-deployment eval of Claude 3\.5 Sonnet \(upgraded\)blog post2024\-10——————AISI The Last Ones\[Folkerts et al\.,[2026](https://arxiv.org/html/2606.17930#biba.bib51)\]Folkerts et al\. 2026, Table 1preprint2026\-0310M–100M———~$805–10\\c@NAT@ctr

## Eval\-setup references

- DeepMind \[2026\]Google DeepMind\.Gemini 3\.1 pro model card\.[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/), 2026\.Accessed 2026\-04\-22\.
- tbench\.ai \[2026\]tbench\.ai\.tbench\.ai leaderboard\.[https://www\.tbench\.ai/leaderboard/terminal\-bench/2\.0](https://www.tbench.ai/leaderboard/terminal-bench/2.0), 2026\.Accessed 2026\-04\-22\.
- Anthropic \[2026a\]Anthropic\.Claude mythos preview release\.[https://www\.anthropic\.com/glasswing](https://www.anthropic.com/glasswing), 2026a\.Accessed 2026\-04\-22\.
- Anthropic \[2026b\]Anthropic\.Claude opus 4\.7 release\.[https://www\.anthropic\.com/news/claude\-opus\-4\-7](https://www.anthropic.com/news/claude-opus-4-7), 2026b\.Accessed 2026\-04\-22\.
- Anthropic \[2026c\]Anthropic\.Claude sonnet 4\.6 system card\.[https://www\-cdn\.anthropic\.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75\.pdf](https://www-cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf), 2026c\.Accessed 2026\-04\-22\.
- Anthropic \[2026d\]Anthropic\.Claude opus 4\.6 release\.[https://www\.anthropic\.com/news/claude\-opus\-4\-6](https://www.anthropic.com/news/claude-opus-4-6), 2026d\.Accessed 2026\-04\-22\.
- OpenAI \[2026a\]OpenAI\.GPT\-5\.4 release \(self\-reported\)\.[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/), 2026a\.Accessed 2026\-04\-22\.
- OpenAI \[2026b\]OpenAI\.GPT\-5\.3 codex release \(self\-reported\)\.[https://openai\.com/index/introducing\-gpt\-5\-3\-codex/](https://openai.com/index/introducing-gpt-5-3-codex/), 2026b\.Accessed 2026\-04\-22\.
- Merrill et al\. \[2026\]Mike A\. Merrill, Alexander G\. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E\. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap, Jan\-Lucas Uslu, Jeffrey Li, Jianbo Wu, Minghao Yan, Song Bian, Vedang Sharma, Ke Sun, Steven Dillmann, Akshay Anand, Andrew Lanpouthakoun, Bardia Koopah, Changran Hu, Etash Guha, Gabriel H\. S\. Dreiman, Jiacheng Zhu, Karl Krauth, Li Zhong, Niklas Muennighoff, Robert Amanfu, Shangyin Tan, Shreyas Pimpalgaonkar, Tushar Aggarwal, Xiangning Lin, Xin Lan, Xuandong Zhao, Yiqing Liang, Yuanli Wang, Zilong Wang, Changzhi Zhou, David Heineman, Hange Liu, Harsh Trivedi, John Yang, Junhong Lin, Manish Shetty, Michael Yang, Nabil Omi, Negin Raoof, Shanda Li, Terry Yue Zhuo, Wuwei Lin, Yiwei Dai, Yuxin Wang, Wenhao Chai, Shang Zhou, Dariush Wahdany, Ziyu She, Jiaming Hu, Zhikang Dong, Yuxuan Zhu, Sasha Cui, Ahson Saiyed, Arinbjörn Kolbeinsson, Jesse Hu, Christopher Michael Rytting, Ryan Marten, Yixin Wang, Alex Dimakis, Andy Konwinski, and Ludwig Schmidt\.Terminal\-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces\.2026\.
- LLM\-Stats \[2026\]LLM\-Stats\.SWE\-Bench Pro leaderboard\.[https://llm\-stats\.com/benchmarks/swe\-bench\-pro](https://llm-stats.com/benchmarks/swe-bench-pro), 2026\.Accessed 2026\-04\-22\.
- Scale AI \[2026a\]Scale AI\.SEAL leaderboards: SWE\-Bench Pro public\.Scale Labs leaderboard, 2026a\.URL[https://labs\.scale\.com/leaderboard/swe\_bench\_pro\_public](https://labs.scale.com/leaderboard/swe_bench_pro_public)\.Accessed April 2026\.
- Team \[2026a\]Alibaba Qwen Team\.Qwen 3\.6 max preview launch\.[https://qwen\.alibaba\.com/](https://qwen.alibaba.com/), 2026a\.Accessed 2026\-04\-22\.
- Team \[2026b\]Alibaba Qwen Team\.Qwen 3\.6 plus report\.[https://www\.alibabacloud\.com/blog/qwen3\-6\-plus\-towards\-real\-world\-agents\_603005](https://www.alibabacloud.com/blog/qwen3-6-plus-towards-real-world-agents_603005), 2026b\.Accessed 2026\-04\-22\.
- Augment \[2026\]Augment\.Auggie tops SWE\-Bench Pro \(self\-reported\)\.[https://www\.augmentcode\.com/blog/auggie\-tops\-swe\-bench\-pro](https://www.augmentcode.com/blog/auggie-tops-swe-bench-pro), 2026\.Accessed 2026\-04\-22\.
- Blitzy \[2026\]Blitzy\.Record on SWE\-Bench Pro \(audit via quesma\)\.[https://blitzy\.com/blog/blitzy\-scores\-a\-record\-66\-5\-on\-swe\-bench\-pro](https://blitzy.com/blog/blitzy-scores-a-record-66-5-on-swe-bench-pro), 2026\.Accessed 2026\-04\-22\.
- Cursor \[2026\]Cursor\.Composer 2 release\.[https://cursor\.com/blog/composer\-2](https://cursor.com/blog/composer-2), 2026\.Accessed 2026\-04\-22\.
- MiniMax \[2026\]MiniMax\.M2\.7 launch \(self\-reported\)\.[https://www\.minimax\.io/news/minimax\-m27\-en](https://www.minimax.io/news/minimax-m27-en), 2026\.Accessed 2026\-04\-22\.
- Moonshot AI \[2026\]Moonshot AI\.Kimi k2\.6 tech blog\.[https://www\.kimi\.com/blog/kimi\-k2\-6](https://www.kimi.com/blog/kimi-k2-6), 2026\.Accessed 2026\-04\-22\.
- Labs \[2026\]Morph Labs\.Warpgrep v2 self\-report\.[https://www\.morphllm\.com/swe\-bench\-pro](https://www.morphllm.com/swe-bench-pro), 2026\.Accessed 2026\-04\-22\.
- Zhipu AI \[2026\]Zhipu AI\.GLM\-5\.1 release\.[https://docs\.z\.ai/guides/llm/glm\-5](https://docs.z.ai/guides/llm/glm-5), 2026\.Accessed 2026\-04\-22\.
- OpenAI \[2026c\]OpenAI\.GPT\-5\.2 release \(self\-reported\)\.[https://openai\.com/index/introducing\-gpt\-5\-2/](https://openai.com/index/introducing-gpt-5-2/), 2026c\.Accessed 2026\-04\-22\.
- Windsurf \[2025\]Windsurf\.SWE\-1\.5 launch\.[https://windsurf\.com/blog/swe\-1\-5](https://windsurf.com/blog/swe-1-5), 2025\.Accessed 2026\-04\-22\.
- Deng et al\. \[2025\]Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler\.SWE\-Bench Pro: Can AI agents solve long\-horizon software engineering tasks?2025\.
- Epoch AI \[2026\]Epoch AI\.New record on FrontierMath tier 4 \(GPT\-5\.2 pro\)\.[https://epochai\.substack\.com/p/new\-record\-on\-frontiermath\-tier\-4](https://epochai.substack.com/p/new-record-on-frontiermath-tier-4), 2026\.Accessed 2026\-04\-22\.
- OpenAI \[2025a\]OpenAI\.GPT\-5\.2 for science and math\.[https://openai\.com/index/gpt\-5\-2\-for\-science\-and\-math/](https://openai.com/index/gpt-5-2-for-science-and-math/), 2025a\.Accessed 2026\-04\-22\.
- Anthropic \[2025\]Anthropic\.Claude opus 4\.5 announcement\.[https://www\.anthropic\.com/news/claude\-opus\-4\-5](https://www.anthropic.com/news/claude-opus-4-5), 2025\.Accessed 2026\-04\-22\.
- Burnham \[2025\]Greg Burnham\.Less than 70% of FrontierMath is within reach for today’s models\.Epoch AI, Gradient Updates, October 2025\.URL[https://epoch\.ai/gradient\-updates/less\-than\-70\-percent\-of\-frontiermath\-is\-within\-reach\-for\-todays\-models](https://epoch.ai/gradient-updates/less-than-70-percent-of-frontiermath-is-within-reach-for-todays-models)\.
- Epoch AI \[2025a\]Epoch AI\.X post: Gemini 3 on FrontierMath\.[https://x\.com/EpochAIResearch/status/1991945942174761050](https://x.com/EpochAIResearch/status/1991945942174761050), 2025a\.Accessed 2026\-04\-22\.
- Epoch AI \[2025b\]Epoch AI\.FrontierMath benchmarks page \(GPT\-5\.1 era\)\.[https://epoch\.ai/benchmarks/frontiermath](https://epoch.ai/benchmarks/frontiermath), 2025b\.Accessed 2026\-04\-22\.
- Epoch AI \[2025c\]Epoch AI\.X post: Opus 4\.5 on FrontierMath\.[https://x\.com/EpochAIResearch/status/1993431031765250119](https://x.com/EpochAIResearch/status/1993431031765250119), 2025c\.Accessed 2026\-04\-22\.
- Epoch AI \[2025d\]Epoch AI\.Grok 4 math evaluation\.[https://epoch\.ai/blog/grok\-4\-math](https://epoch.ai/blog/grok-4-math), 2025d\.Accessed 2026\-04\-22\.
- Epoch AI \[2025e\]Epoch AI\.FrontierMath tier 4 benchmark page\.[https://epoch\.ai/benchmarks/frontiermath\-tier\-4](https://epoch.ai/benchmarks/frontiermath-tier-4), 2025e\.Accessed 2026\-04\-22\.
- Epoch AI \[2025f\]Epoch AI\.X post: o4\-mini FrontierMath evaluation\.[https://x\.com/EpochAIResearch/status/1913379478778134941](https://x.com/EpochAIResearch/status/1913379478778134941), 2025f\.Accessed 2026\-04\-22\.
- Epoch AI \[2025g\]Epoch AI\.X post: o3\-mini FrontierMath evaluation\.[https://x\.com/EpochAIResearch/status/1898149924556226908](https://x.com/EpochAIResearch/status/1898149924556226908), 2025g\.Accessed 2026\-04\-22\.
- Glazer et al\. \[2024\]Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean\-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, and Mark Wildon\.FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI\.2024\.
- OpenAI \[2026d\]OpenAI\.GPT\-5\.4 thinking system card\.[https://deploymentsafety\.openai\.com/gpt\-5\-4\-thinking](https://deploymentsafety.openai.com/gpt-5-4-thinking), 2026d\.Accessed 2026\-04\-22\.
- OpenAI \[2025b\]OpenAI\.GPT\-5\.2 system card\.[https://cdn\.openai\.com/pdf/3a4153c8\-c748\-4b71\-8e31\-aecbde944f8d/oai\_5\_2\_system\-card\.pdf](https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf), 2025b\.Accessed 2026\-04\-22\.
- OpenAI \[2025c\]OpenAI\.GPT\-5 system card\.[https://cdn\.openai\.com/gpt\-5\-system\-card\.pdf](https://cdn.openai.com/gpt-5-system-card.pdf), 2025c\.Accessed 2026\-04\-22\.
- Arora et al\. \[2025\]Rahul K\. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal\.HealthBench: Evaluating large language models towards improved human health\.2025\.OpenAI\.
- Scale AI \[2025\]Scale AI\.SEAL HLE text\-only leaderboard\.[https://scale\.com/leaderboard/humanitys\_last\_exam\_text\_only](https://scale.com/leaderboard/humanitys_last_exam_text_only), 2025\.Accessed 2026\-04\-22\.
- Scale AI \[2026b\]Scale AI\.SEAL HLE leaderboard\.[https://labs\.scale\.com/leaderboard/humanitys\_last\_exam](https://labs.scale.com/leaderboard/humanitys_last_exam), 2026b\.Accessed 2026\-04\-22\.
- DeepMind \[2025a\]Google DeepMind\.Gemini 3 launch\.[https://blog\.google/products\-and\-platforms/products/gemini/gemini\-3/](https://blog.google/products-and-platforms/products/gemini/gemini-3/), 2025a\.Accessed 2026\-04\-22\.
- DeepMind \[2025b\]Google DeepMind\.Gemini 3 deep think\.[https://blog\.google/innovation\-and\-ai/models\-and\-research/gemini\-models/gemini\-3\-deep\-think/](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/), 2025b\.Accessed 2026\-04\-22\.
- Moonshot AI \[2025\]Moonshot AI\.Kimi k2 thinking release\.[https://moonshotai\.github\.io/Kimi\-K2/thinking\.html](https://moonshotai.github.io/Kimi-K2/thinking.html), 2025\.Accessed 2026\-04\-22\.
- OpenAI \[2025d\]OpenAI\.GPT\-5 announcement\.[https://openai\.com/index/introducing\-gpt\-5/](https://openai.com/index/introducing-gpt-5/), 2025d\.Accessed 2026\-04\-22\.
- xAI \[2025\]xAI\.Grok 4 launch \(grok 4 and grok 4 heavy\)\.[https://x\.ai/news/grok\-4](https://x.ai/news/grok-4), 2025\.Accessed 2026\-04\-22\.
- OpenAI \[2025e\]OpenAI\.Deep research announcement\.[https://openai\.com/index/introducing\-deep\-research/](https://openai.com/index/introducing-deep-research/), 2025e\.Accessed 2026\-04\-22\.
- Phan et al\. \[2025\]Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, et al\.Humanity’s last exam\.2025\.arXiv:2501\.14249\.
- UK AI Security Institute \[2026a\]UK AI Security Institute\.Evidence for inference scaling in AI cyber tasks\.[https://www\.aisi\.gov\.uk/blog/evidence\-for\-inference\-scaling\-in\-ai\-cyber\-tasks\-increased\-evaluation\-budgets\-reveal\-higher\-success\-rates](https://www.aisi.gov.uk/blog/evidence-for-inference-scaling-in-ai-cyber-tasks-increased-evaluation-budgets-reveal-higher-success-rates), 2026a\.Accessed 2026\-04\-22\.
- UK AI Security Institute \[2026b\]UK AI Security Institute\.Claude mythos preview cyber capabilities\.[https://www\.aisi\.gov\.uk/blog/our\-evaluation\-of\-claude\-mythos\-previews\-cyber\-capabilities](https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities), 2026b\.Accessed 2026\-04\-22\.
- Folkerts et al\. \[2026\]Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, John Wilkinson, Alan Steer, Vy Hong, and Jessica Wang\.Measuring AI agents’ progress on multi\-step cyber attack scenarios\.2026\.AI Security Institute and Irregular\.
- UK AI Security Institute \[2024a\]UK AI Security Institute\.Pre\-deployment evaluation of openai o1\.[https://www\.aisi\.gov\.uk/blog/pre\-deployment\-evaluation\-of\-openais\-o1\-model](https://www.aisi.gov.uk/blog/pre-deployment-evaluation-of-openais-o1-model), 2024a\.Accessed 2026\-04\-22\.
- UK AI Security Institute \[2024b\]UK AI Security Institute\.Pre\-deployment evaluation of upgraded claude 3\.5 sonnet\.[https://www\.aisi\.gov\.uk/blog/pre\-deployment\-evaluation\-of\-anthropics\-upgraded\-claude\-3\-5\-sonnet](https://www.aisi.gov.uk/blog/pre-deployment-evaluation-of-anthropics-upgraded-claude-3-5-sonnet), 2024b\.Accessed 2026\-04\-22\.

#### A\.1\.3Repetition guard

The repetition guard terminates a trajectory early when the agent appears stuck in an unproductive submission loop\. Whenever the agent issues a submission, the guard identifies the maximal suffix of consecutive agent turns that contain*only*submit tool calls, with no interveningbashorpythoninvocations\. If this suffix contains three or more submissions, the guard sends the full set of submissions to an LLM judge, GPT\-4o\-mini at temperature 0, and asks whether they are semantically equivalent\. On a*SAME*verdict, the trajectory is marked as stopped by the repetition guard and terminated\. On*DIFFERENT*, execution continues normally\. For FrontierMath specifically, interleaved non\-submit tool calls are permitted during the backward scan\. In other words, the guard considers all submissions in the trajectory rather than only the current submit\-only suffix\. This avoids penalising long sequences of exploratory computation between submissions\. The guard is applied live during evaluation for most runs and retroactively, with a cached judge, during analysis for any runs that were missed\.

#### A\.1\.4Experimental conditions: prompts and feedback

Both conditions share a single adaptive*continuation*prompt, appended after every non\-terminal submission to invite the agent either to refine its previous answer or to try a substantially different approach:

> *“Consider whether your previous approach is likely correct or whether a different approach would be more promising\. Then either refine your previous answer or try a substantially different approach, and submit when ready\.”*

The per\-submission feedback strings are defined in the main text\. The following protocol details supplement those definitions:

- •No feedback\.The acknowledgement is identical on correct and incorrect submissions, giving the agent no implicit signal of correctness\.
- •Oracle score feedback \(binary benchmarks: TerminalBench, SWE\-Bench Pro, FrontierMath, HLE\)\.On a correct submission no feedback message is sent, and the trajectory simply ends\.
- •Oracle score feedback \(continuous benchmark: HealthBench\)\.The full feedback string is*“Your submission was scored\. You achievedXXout ofYYpossible points\.”*, whereXXis the net points achieved on the most recent submission andYYis the sum of positive\-weighted points for the sample\. Trajectories terminate only when the agent achieves every positively weighted criterion\. Partial\-credit submissions trigger the feedback message followed by the adaptive continuation\.

### A\.2B\. Inference\-scaling curves: supplementary analyses

#### A\.2\.1Curve\-level cross\-generational summaries

To summarise cross\-generational patterns at the curve level, we extract three descriptive quantities from each pooled benchmark–model curve:

- •Success at the maximum tested budgetis the highest performance reached within our observed range, that is, the curve’s right\-edge value in[Figure1](https://arxiv.org/html/2606.17930#S1.F1)A\.
- •Onset pointis the minimum compute level at which a model first begins to solve tasks, measured as the first total\-token budget at which cumulative success exceeds 5%\.
- •Post\-onset slopemeasures how strongly performance continues to improve after onset\. We compute it as the average increase in cumulative success per tenfold increase in total tokens from onset to the point where the curve begins to flatten, as defined in[Section3\.1\.2](https://arxiv.org/html/2606.17930#S3.SS1.SSS2)\. If a curve does not flatten within the tested range, we use the maximum tested budget instead\.

##### Success at the maximum tested budget\.

Across benchmarks, the clearest and most consistent pattern is that later\-release models reach higher performance at the maximum tested budget \([Table9](https://arxiv.org/html/2606.17930#A1.T9)\)\. The Kendall rank correlation between model release date and success at the cap is strongly positive on Cyber CTFs \(τ=0\.944\\tau=0\.944\), TerminalBench \(τ=0\.867\\tau=0\.867\), FrontierMath \(τ=0\.867\\tau=0\.867\), and HLE \(τ=0\.733\\tau=0\.733\), and remains positive though weaker on SWE\-Bench Pro \(τ=0\.414\\tau=0\.414\)\. HealthBench is the main exception, showing little relationship between release date and success at the cap \(τ=0\.067\\tau=0\.067\)\. On most benchmarks in our study, newer models therefore reach a higher protocol\-conditional performance frontier when given large inference\-time budgets\.

##### Onset\.

Later\-release models also often begin succeeding earlier in the budget range\. On every benchmark, onset is negatively correlated with release date, withτ\\tauranging from−0\.333\-0\.333to−0\.467\-0\.467\. Newer models often need fewer total tokens before they start to solve tasks at all\.

##### Post\-onset slope\.

Whether newer models also use*additional*compute better after that point is more benchmark\-dependent\. Post\-onset slope increases clearly with model recency on Cyber CTFs \(τ=0\.944\\tau=0\.944\), FrontierMath \(τ=0\.600\\tau=0\.600\), and TerminalBench \(τ=0\.600\\tau=0\.600\), and remains weak on HLE \(τ=0\.200\\tau=0\.200\), SWE\-Bench Pro \(τ=0\.200\\tau=0\.200\), and HealthBench \(τ=−0\.067\\tau=\-0\.067\)\. The extent to which newer models convert extra inference\-time compute into further gains therefore depends on the task setting\.

Table 9:Kendall rank correlations between model release date and three descriptive summaries of each pooled inference\-scaling curve\. Success at cap is cumulative success at the maximum tested budget\. Onset point is the first total\-token budget at which cumulative success exceeds 5%\. Post\-onset slope is the average gain in cumulative success per tenfold increase in tokens between onset and plateau onset, or the tested cap if no plateau is observed\.

### A\.3C\. Task\-level decomposition: regression methods and sensitivity analyses

This appendix gives the regression specifications and a sensitivity check behind the task\-level decomposition of generational gains reported in[Section3\.2](https://arxiv.org/html/2606.17930#S3.SS2)\. The decomposition splits each \(model, task\) outcome into three components — reach, efficiency, and reliability — and estimates each by regressing the relevant outcome on model generation, leave\-one\-model\-out task difficulty, and their interaction\. We first define the per\-cell outcome measures and predictors, then give each component’s regression specification and the coefficient\-interpretation conventions, and finally report a balanced\-panel sensitivity check for the efficiency analysis \([SectionA\.3\.3](https://arxiv.org/html/2606.17930#A1.SS3.SSS3)\)\.

#### A\.3\.1Outcome measures and predictors

Letmmindex models andttindex tasks\.

##### Per\-cell outcome measures\.

For each \(benchmark, model, task\) cell, we aggregate over all of that model’s independent trajectories to construct three quantities\.

- •Solve ratesm,ts\_\{m,t\}, the fraction of trajectories containing at least one solving submission\. For the five binary benchmarks, a submission is counted as solving if it receives score 1\. For HealthBench, which is continuously scored, submissions are first binarised at 0\.38, the global median of pooled per\-trajectory maximum scores across all HealthBench trajectories\.
- •Median tokens\-to\-solveκm,t\\kappa\_\{m,t\}, the median output token count among solving trajectories only\. Within each \(benchmark, model\) cell, solving trajectories with\|z\|\>3\|z\|\>3onlog10⁡\(tokens\-to\-solve\)\\log\_\{10\}\(\\text\{tokens\-to\-solve\}\)are dropped before taking the median to reduce the influence of extreme right\-tail outliers\.
- •Unlock indicatorum,tu\_\{m,t\}, an indicator equal to 1 if at least two of the independent trajectories solve the task and 0 otherwise: um,t=𝟙\{nsolves≥2\}\.u\_\{m,t\}=\\mathbb\{1\}\\\{n\_\{\\mathrm\{solves\}\}\\geq 2\\\}\.This threshold is intended to distinguish tasks a model can solve reproducibly from those passed only on a single lucky trajectory\.

##### Task difficulty\.

We define task difficulty using the cross\-model pass rate

πt=medianmsm,t,\\pi\_\{t\}=\\mathrm\{median\}\_\{m\}\\,s\_\{m,t\},computed over all models evaluated on tasktt, so that higher values denote easier tasks\. For the reliability regression, we use a leave\-one\-model\-out variant

π−m,t=medianm′≠msm′,t,\\pi\_\{\-m,t\}=\\mathrm\{median\}\_\{m^\{\\prime\}\\neq m\}s\_\{m^\{\\prime\},t\},which excludes the focal model’s own solve rate from the predictor\. These difficulty measures lie in\[0,1\]\[0,1\]\.

##### Generation coding\.

For the regression analyses, generation is defined from model release date\. Within each benchmark, release dates are min–max normalised to\[0,1\]\[0,1\]and then mean\-centred:

gm=center\(minmax\(release datem\)\)\.g\_\{m\}=\\mathrm\{center\}\\\!\\left\(\\mathrm\{minmax\}\(\\text\{release date\}\_\{m\}\)\\right\)\.Difficulty predictors are likewise mean\-centred within benchmark before fitting, so that main effects are interpretable at average benchmark difficulty\. The reach regression uses these same centred predictors\.

#### A\.3\.2Reach, efficiency, and reliability regressions

##### Reach analysis\.

For each benchmark, we fit a linear probability model regressingum,tu\_\{m,t\}on model generation, leave\-one\-model\-out task difficulty, and their interaction, by ordinary least squares \(OLS\) with cluster\-robust standard errors at the task level\. In Wilkinson notation,

unlocked∼generation∗task\_pass\_rate,\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}unlocked\}\}\}\\sim\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}generation\}\}\}\*\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}task\\\_pass\\\_rate\}\}\},or equivalently

um,t=α\+βgengm\+βpassπ−m,t\+βintgmπ−m,t\+εm,t,u\_\{m,t\}=\\alpha\+\\beta\_\{\\mathrm\{gen\}\}g\_\{m\}\+\\beta\_\{\\mathrm\{pass\}\}\\pi\_\{\-m,t\}\+\\beta\_\{\\mathrm\{int\}\}\\,g\_\{m\}\\pi\_\{\-m,t\}\+\\varepsilon\_\{m,t\},fit withstatsmodels\.formula\.api\.ols\. As in the reliability analysis, the leave\-one\-model\-out difficulty predictorπ−m,t\\pi\_\{\-m,t\}excludes the focal model’s own solves, which determineum,tu\_\{m,t\}\.[Figure2](https://arxiv.org/html/2606.17930#S3.F2)A complements this with the per\-benchmark difficulty frontier: the running maximum task difficulty unlocked as a function of the first\-unlock model’s release date\.

##### Efficiency analysis\.

Efficiency asks whether later generations solve already\-unlocked tasks with fewer tokens\. This analysis is therefore restricted to unlocked \(model, task\) cells, whereum,t=1u\_\{m,t\}=1andκm,t\\kappa\_\{m,t\}is defined\. For each benchmark, we regress log median tokens\-to\-solve on model generation, task difficulty, and their interaction, with a task\-level random intercept\. In Wilkinson notation,111In Wilkinson notation,a \* bexpands toa \+ b \+ a:b, that is, both main effects plus their interaction\. The trailing\(1 \| task\)denotes a per\-task random intercept\.

log\_tokens∼generation∗task\_pass\_rate\+\(1∣task\),\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}log\\\_tokens\}\}\}\\sim\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}generation\}\}\}\*\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}task\\\_pass\\\_rate\}\}\}\\;\+\\;\(1\\mid\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}task\}\}\}\),or equivalently

log10⁡κm,t=α\+βgengm\+βpassπt\+βintgmπt\+ut\+εm,t,ut∼𝒩\(0,σu2\),\\log\_\{10\}\\kappa\_\{m,t\}=\\alpha\+\\beta\_\{\\mathrm\{gen\}\}g\_\{m\}\+\\beta\_\{\\mathrm\{pass\}\}\\pi\_\{t\}\+\\beta\_\{\\mathrm\{int\}\}\\,g\_\{m\}\\pi\_\{t\}\+u\_\{t\}\+\\varepsilon\_\{m,t\},\\qquad u\_\{t\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{u\}^\{2\}\),fit withstatsmodels\.formula\.api\.mixedlmgrouped on task\. The random intercept absorbs baseline differences in token cost across tasks, so the fixed effects capture within\-task generational differences in tokens\-to\-solve\.

##### Reliability analysis\.

Reliability asks whether later generations solve already\-unlocked tasks more consistently across repeated attempts\. We therefore analyse the same unlocked \(model, task\) cells, with solve ratesm,ts\_\{m,t\}as the outcome\. For each benchmark, we regress per\-task solve rate on model generation, leave\-one\-model\-out task difficulty, and their interaction by OLS with cluster\-robust standard errors at the task level\. In Wilkinson notation,

solve\_rate∼generation∗task\_pass\_rate,\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}solve\\\_rate\}\}\}\\sim\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}generation\}\}\}\*\{\\hbox\{\\pagecolor\{inlinecodebg\}\{\\color\[rgb\]\{0\.70703125,0\.13671875,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.13671875,0\.09375\}task\\\_pass\\\_rate\}\}\},or equivalently

sm,t=α\+βgengm\+βpassπ−m,t\+βintgmπ−m,t\+εm,t\.s\_\{m,t\}=\\alpha\+\\beta\_\{\\mathrm\{gen\}\}g\_\{m\}\+\\beta\_\{\\mathrm\{pass\}\}\\pi\_\{\-m,t\}\+\\beta\_\{\\mathrm\{int\}\}\\,g\_\{m\}\\pi\_\{\-m,t\}\+\\varepsilon\_\{m,t\}\.The leave\-one\-model\-out difficulty predictorπ−m,t\\pi\_\{\-m,t\}ensures that the focal model’s own solve rate does not appear on both sides of the regression\.

##### Coefficient interpretation\.

Because higherπt\\pi\_\{t\}andπ−m,t\\pi\_\{\-m,t\}denote easier tasks, the interaction termβint\\beta\_\{\\mathrm\{int\}\}is interpreted as follows\. In the efficiency regression,βint<0\\beta\_\{\\mathrm\{int\}\}<0indicates that newer models’ token savings are larger on harder tasks, whileβint\>0\\beta\_\{\\mathrm\{int\}\}\>0indicates larger savings on easier tasks\. In the reliability regression,βint<0\\beta\_\{\\mathrm\{int\}\}<0indicates that solve\-rate gains are concentrated on harder tasks, whileβint\>0\\beta\_\{\\mathrm\{int\}\}\>0indicates gains concentrated on easier tasks\. The reach regression follows the same convention:βint<0\\beta\_\{\\mathrm\{int\}\}<0indicates that the generational increase in unlock probability is larger on harder tasks, whileβint\>0\\beta\_\{\\mathrm\{int\}\}\>0indicates a larger increase on easier tasks\.

#### A\.3\.3Balanced\-panel sensitivity for the efficiency analysis

The efficiency analysis in the main text uses all unlocked*\(model, task\)*cells\. A row enters if that model unlocks that task and has a defined median tokens\-to\-solve value\. This design estimates token use*conditional on reach*, but it does not require that all models solve the same tasks\. A potential concern is therefore compositional\. Newer models unlock harder tasks, and those tasks may intrinsically require more tokens, which could attenuate measured efficiency gains relative to an analysis restricted to a common solved\-task set\.

To assess this, we repeated the efficiency regression on a balanced panel within each benchmark, restricting to tasks unlocked by all models evaluated on that benchmark\. Coverage in the balanced panel varied substantially across benchmarks: 28/71 tasks for Cyber CTFs \(39\.4%\), 50/82 for TerminalBench \(61\.0%\), 71/100 for SWE\-Bench Pro \(71\.0%\), 22/100 for HealthBench \(22\.0%\), and 32/100 for HLE \(32\.0%\)\. FrontierMath has only 4/12 tasks unlocked by all models, below our five\-task minimum for a refit, so it admits no balanced\-panel estimate\. On Cyber CTFs, every task unlocked by all models is solved at the maximum cross\-model rate, so task difficulty is constant on the balanced set and the refit identifies only the generation main effect \(βgen\\beta\_\{\\mathrm\{gen\}\}\), not the generation\-by\-difficulty interaction\.

Across benchmarks, the main effect of model generation on efficiency \(βgen\\beta\_\{\\mathrm\{gen\}\}\) was qualitatively robust on Cyber CTFs, TerminalBench, SWE\-Bench Pro, and HealthBench: later generations continued to use fewer tokens on the shared solved\-task set, with effect sizes attenuated, strengthened, or essentially unchanged across benchmarks but never reversed \([Table10](https://arxiv.org/html/2606.17930#A1.T10)\); HLE remained non\-significant in both panels\. The generation\-by\-difficulty interaction term \(βint\\beta\_\{\\mathrm\{int\}\}\) was substantially less stable, changing sign on TerminalBench, strengthening on HLE, and moving toward zero on SWE\-Bench Pro \(on Cyber CTFs the balanced task set has no difficulty variation, so the interaction is not identified\), so fine\-grained conclusions about whether efficiency gains concentrate on easier or harder tasks are sensitive to panel composition\.

These refits therefore support the benchmark\-level conclusion that later generations often use fewer tokens on tasks they can solve, while indicating that the main\-text efficiency estimates should be read as conditional\-on\-reach effects rather than pure within\-task efficiency improvements on a fixed common task set\. The interaction estimates by task difficulty appear less robust and are not a major basis for our conclusions\.

Table 10:Balanced\-panel sensitivity check for the efficiency analysis\. Main\-panel regressions use all unlocked*\(model, task\)*cells\. Balanced\-panel regressions restrict to tasks unlocked by all models within a benchmark\. Estimates are coefficients on model generation \(βgen\\beta\_\{\\mathrm\{gen\}\}\) and the generation\-by\-difficulty interaction \(βint\\beta\_\{\\mathrm\{int\}\}\) from the efficiency regression described in the main text\.

### A\.4D\. Serial scaling: supplementary methods and results

#### A\.4\.1Cumulative success over submissions

For a trajectory with submissions indexed byk=1,2,…k=1,2,\\dots, letCi\(k\)C\_\{i\}\(k\)indicate whether at least one of the firstkksubmissions is correct:

Ci\(k\):=𝟏\(∃j≤k:submissionjis correct\)\.C\_\{i\}\(k\):=\\mathbf\{1\}\\\!\\left\(\\exists j\\leq k:\\text\{submission \}j\\text\{ is correct\}\\right\)\.\(4\)The cumulative success curve for a set of trajectories is the mean ofCi\(k\)C\_\{i\}\(k\)over the trajectories in that set\. In[Figure3](https://arxiv.org/html/2606.17930#S3.F3)A, these curves are first averaged within model and then pooled across models with equal weight within each benchmark–condition cell\.

We summarise repeated\-submission gains using three descriptive quantities:

- •Iteration gain:cumulative success at the highest observed submission index minus cumulative success at the first submission
- •Uplift multiplier:cumulative success at the highest observed submission index divided by cumulative success at the first submission
- •k90k\_\{90\}:the smallest submission index at which the cumulative success curve reaches 90% of its total iteration gain

For plotting, the submission axis in[Figure3](https://arxiv.org/html/2606.17930#S3.F3)A is truncated at the smallest submission index for which the per\-step gain falls below 0\.1 percentage points, computed separately for each condition within benchmark and then taking the maximum across conditions, with a two\-step tail added for visual context\.

#### A\.4\.2Iteration uplift and model generation

For each benchmark–model–condition cell, we define*iteration uplift*as

Δ=Smax−S1,\\Delta=S\_\{\\max\}\-S\_\{1\},\(5\)whereS1S\_\{1\}is cumulative success at the first submission andSmaxS\_\{\\max\}is the highest cumulative success level reached over the observed submission range\. We then compute Kendall rank correlations \(Kendall’sτ\\tau\) betweenΔ\\Deltaand model release date separately for each benchmark and condition\. The per\-condition trend lines in[Figure3](https://arxiv.org/html/2606.17930#S3.F3)B are Theil–Sen \(median\-slope\) fits, whose slope sign agrees withτ\\tauby construction\. These correlations are descriptive only, since each benchmark–condition cell contains six models\.

Table 11:Kendall’sτ\\taurank correlations between model release date and iteration uplift, computed separately for each benchmark and feedback condition\. Iteration uplift is the increase in cumulative success from the first submission to the highest observed cumulative level\. These correlations are descriptive only, since each benchmark–condition cell contains six models\.
#### A\.4\.3Regression models for eventual success and submissions\-to\-success

For each benchmark separately, we fit regressions with task fixed effects and cluster\-robust standard errors at the task level\. The feedback indicatorOiO\_\{i\}takes value 1 under oracle score feedback and 0 under no feedback\. Model generationGiG\_\{i\}is encoded as a normalised release\-rank variable within benchmark, increasing from the oldest to the newest model\.

##### Model A: eventual success\.

We model whether trajectoryiion tasks\(i\)s\(i\)ever produces a correct submission:

Pr⁡\(Yi=1\)=logit−1\(αs\(i\)\+βOOi\+βGGi\+βG×O\(GiOi\)\),\\Pr\(Y\_\{i\}=1\)=\\mathrm\{logit\}^\{\-1\}\\\!\\left\(\\alpha\_\{s\(i\)\}\+\\beta\_\{O\}O\_\{i\}\+\\beta\_\{G\}G\_\{i\}\+\\beta\_\{G\\times O\}\(G\_\{i\}O\_\{i\}\)\\right\),\(6\)whereYiY\_\{i\}is an indicator for eventual success andαs\(i\)\\alpha\_\{s\(i\)\}is a task fixed effect\.

Table 12:Model A: fixed\-effects logistic regressions of eventual success\. The response is whether a trajectory ever produces a correct submission\. Predictors are oracle feedback \(OO\), normalised model generation \(GG\), and their interaction\.
##### Model B: submissions to first correct answer\.

Conditioning on trajectories that eventually succeed, we model the submission index of the first correct answer:

𝔼\[Ki\]=g−1\(αs\(i\)\+βOOi\+βGGi\+βG×O\(GiOi\)\),\\mathbb\{E\}\[K\_\{i\}\]=g^\{\-1\}\\\!\\left\(\\alpha\_\{s\(i\)\}\+\\beta\_\{O\}O\_\{i\}\+\\beta\_\{G\}G\_\{i\}\+\\beta\_\{G\\times O\}\(G\_\{i\}O\_\{i\}\)\\right\),\(7\)whereKiK\_\{i\}is the submission index of the first correct answer,αs\(i\)\\alpha\_\{s\(i\)\}is a task fixed effect, andg−1g^\{\-1\}is the inverse link function\. We use a negative\-binomial model for HLE and Poisson models for the remaining benchmarks, based on Pearson dispersion\.

Before fitting Model B, we remove successful trajectories whose total number of submissions is more than 3 standard deviations from the within\-benchmark mean, excluding 145 of 13,169 successful trajectories and leaving 13,024 for analysis\.

Table 13:Model B: regressions of submission index at first correct answer, conditional on eventual success\. Negative coefficients indicate fewer submissions to the first correct answer\.

### A\.5E\. Parallel scaling: supplementary methods and validation

#### A\.5\.1Pass@kkestimator and pooling

For total budgetBB, we estimate pass@kkby splittingBBacrosskkindependent trajectories ofB/kB/ktokens each and counting a task as solved if any trajectory’s first correct submission falls within budget\. pass@kkis computed task\-wise using the unbiased estimator ofChen et al\. \[[2021](https://arxiv.org/html/2606.17930#bib.bib45)\]\. For the main analysis, we pool self\-guided and oracle trajectories because the first submission is pre\-feedback, yielding up to ten trajectories per \(benchmark, model, task\)\. Figure[Figure4](https://arxiv.org/html/2606.17930#S3.F4)reports benchmark\-level means across models with equal weighting\. We use pass@kkrather than consensus\-based aggregation because our aim is to measure the value of independent restarts under matched total compute, and several benchmarks involve stateful trajectories or structured outputs for which a benchmark\-agnostic consensus rule is not well defined\.

#### A\.5\.2Judge reliability validation for HLE and HealthBench

Two of our five primary benchmarks, HLE and HealthBench, are scored with an LLM judge, GPT\-4o\-mini at temperature 0, rather than a deterministic verifier\. A potential confound for the parallel\-scaling results in[Section3\.3\.2](https://arxiv.org/html/2606.17930#S3.SS3.SSS2)is that, if the judge has a per\-submission false\-positive ratepFPp\_\{\\mathrm\{FP\}\}, thenNNindependent parallel attempts give a worst\-case spurious inflation of1−\(1−pFP\)N1\-\(1\-p\_\{\\mathrm\{FP\}\}\)^\{N\}, which grows withNN\. To bound this, we re\-scored a stratified subsample of trajectories with two stronger judges,openai/gpt\-5\.4andanthropic/claude\-opus\-4\-6, using the exact in\-experiment grader template for each benchmark,inspect\_evals\.hle\.judge\.JUDGE\_PROMPTandinspect\_evals\.healthbench\.scorer\.GRADER\_TEMPLATE\. The subsample is 6 models×\\times2 conditions×\\times20 sample IDs×\\timesup to 3 trajectories per bin: 228 trajectories \(2,374 submissions\) on HLE and 175 trajectories \(537 submissions\) on HealthBench\. Re\-running the pipeline with GPT\-4o\-mini itself as the re\-scoring judge reproduces the in\-experiment scores to within expected non\-determinism, 99\.5% agreement on HLE and Pearsonr=0\.87r=0\.87on HealthBench\. Any disagreement reported below is therefore attributable to judge choice rather than pipeline drift\.

##### HLE: in\-experiment judge is conservative; confound is bounded below the effect\.

The two strong judges agree on 2,370 of 2,373 paired HLE submissions \(99\.9%\), and we treat their consensus as a high\-confidence reference label\. Against this consensus, GPT\-4o\-mini produces 1 false positive in 2,186 actually incorrect submissions, a false\-positive rate of0\.05%0\.05\\%with 95% Wilson CI\[0\.01%,0\.26%\]\[0\.01\\%,0\.26\\%\], and misses 22 of 184 correct submissions, a false\-negative rate of12\.0%12\.0\\%\. WithpFP=5×10−4p\_\{\\mathrm\{FP\}\}=5\\times 10^\{\-4\}, the worst\-case independent\-error parallel\-scaling inflation is0\.5%0\.5\\%atN=10N=10and4\.9%4\.9\\%atN=100N=100\. This is an order of magnitude below the observed parallel gains on HLE \([Table7](https://arxiv.org/html/2606.17930#S3.T7)\)\. Errors are also strongly clustered at the sample level, with index of dispersion 7\.81, and 95\.7% of disagreements come from 3 of the 20 sample IDs\. This further reduces the effective number of independent dice rolls and tightens the bound\.

##### HealthBench: judge noise is large but unbiased\.

HealthBench is continuously scored and noisier\. Pairwise Pearsonrrbetween strong judges is0\.780\.78, and mean per\-judge scores differ by up to0\.150\.15\. Restricting to the 271 submissions on which the two strong judges agree within0\.150\.15, the signed errorΔ=\(GPT\-4o\-mini\)−\(strong\-judge consensus\)\\Delta=\(\\text\{GPT\-4o\-mini\}\)\-\(\\text\{strong\-judge consensus\}\)has mean−0\.015\-0\.015and standard deviation0\.2390\.239\. Tail inflations,Δ\>0\.20\\Delta\>0\.20, occur on 17\.0% of submissions and are statistically indistinguishable from tail deflations,Δ<−0\.20\\Delta<\-0\.20, which occur on 18\.1%\. This is consistent with symmetric noise rather than directional bias\. Because the parallel\-scaling statistic averages over many submissions, the symmetric per\-submission noise cancels in expectation, and per\-sample clustering \(index of dispersion 3\.14\) further reduces the effective number of independent attempts\. The HLE result above, with a per\-submission false\-positive rate too small to drive meaningful inflation, provides the strongest non\-confound test; taken together, the parallel\-scaling gains on HLE and HealthBench cannot be explained by judge unreliability\.
How Inference Compute Shapes Frontier LLM Evaluation

Similar Articles

@polynoamial: https://x.com/polynoamial/status/2064210146558136827

Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…

Inference Time Context Sparsity: Illusion or Opportunity?

Submit Feedback

Similar Articles

@polynoamial: https://x.com/polynoamial/status/2064210146558136827
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs
@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…
Inference Time Context Sparsity: Illusion or Opportunity?