@dair_ai: Outstanding paper on long-horizon agents. (bookmark it) Similar to humans, how do you make agents persist on a difficul…

X AI KOLs Following 06/04/26, 04:20 PM Papers

Summary

AutoLab is a new benchmark evaluating 17 frontier models on 36 expert-curated long-horizon tasks (system optimization, model development, CUDA kernels, puzzles), finding that persistence—not initial attempt quality—is the dominant predictor of success. Claude-opus-4.6 led all categories, while most other models terminated prematurely or exhausted budgets with minimal progress.

Outstanding paper on long-horizon agents. (bookmark it) Similar to humans, how do you make agents persist on a difficult task, and how is that useful? And which models today work well on this? This new work, AutoLab, explores this question and how encoding persistence in agents is beneficial for tasks such as auto research and engineering tasks. Can a model keep improving an artifact for hours, under a strict wall-clock budget, the way real research and engineering actually work? Results: AutoLab hands agents 36 expert-curated tasks across system optimization, model development, CUDA kernels, and puzzles, each starting from a correct but deliberately suboptimal baseline. Across 17 frontier models, the dominant predictor of success was not the quality of the first attempt. It was persistence, repeatedly benchmarking, editing, and folding in empirical feedback. It appears that Claude-opus-4.6 sustained that loop well. Most of the other models quit early or burned the budget, making almost no progress. Paper: https://arxiv.org/abs/2606.05080 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 06/05/26, 02:20 AM

Outstanding paper on long-horizon agents.

(bookmark it)

Similar to humans, how do you make agents persist on a difficult task, and how is that useful?

And which models today work well on this?

This new work, AutoLab, explores this question and how encoding persistence in agents is beneficial for tasks such as auto research and engineering tasks.

Can a model keep improving an artifact for hours, under a strict wall-clock budget, the way real research and engineering actually work?

Results:

AutoLab hands agents 36 expert-curated tasks across system optimization, model development, CUDA kernels, and puzzles, each starting from a correct but deliberately suboptimal baseline.

Across 17 frontier models, the dominant predictor of success was not the quality of the first attempt. It was persistence, repeatedly benchmarking, editing, and folding in empirical feedback.

It appears that Claude-opus-4.6 sustained that loop well. Most of the other models quit early or burned the budget, making almost no progress.

Paper: https://arxiv.org/abs/2606.05080

Learn to build effective AI agents in our academy: https://academy.dair.ai

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Source: https://arxiv.org/html/2606.05080 Zhangchen Xu1,11*Junda Chen4Yue Huang5Dongfu Jiang8,10Jiefeng Chen12 Hang Hua13Zijian Wu7Zheyuan Liu5Zexue He2Lichi Li14 Shizhe Diao10Jiaxin Pei2Jinsung Yoon12Hao Zhang4Mengdi Wang6 Radha Poovendran1Misha Sra3Alex Pentland2,9Zichen Chen2,3,11*

Abstract

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduceAutoLab, a new benchmark for ultra long-horizon closed-loop optimization.AutoLabconsists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent’s initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. Whileclaude-opus-4.6exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

Codeautolabhq/autolab Websiteautolab.moe

Refer to caption Figure 1:AutoLabbenchmarks frontier models across 36 tasks spanning 4 categories, here for the 11 provider flagships (one model per provider).*Top:*models sorted left-to-right by Avg@3 (solid bar); the translucent extension reaches Best@3.*Bottom:*four rose charts, one per category, where each petal’s length is that model’s Avg@3 in the category.claude-opus-4.6leads all four categories and the overall ranking; the runner-up rotates by category.## 1 Introduction

Frontier LLM agents are increasingly deployed on tasks that play out over hours rather than minutes, from post-training models(Ranket al.,2026)and optimizing low-level systems(Chiet al.,2026)to running open-ended research loops(Novikovet al.,2025,Karpathy,2026). Progress on such tasks is iterative: it comes from inspecting an artifact, proposing a change, running experiments, measuring the outcome, and refining over many cycles, not from a single correct answer. Sustaining this loop over a long horizon requires managing time, compute, and noisy empirical signals. Short, single-shot evaluations are not designed to test whether today’s frontier models can do so.

Current evaluations largely overlook this regime. Static, single-turn coding benchmarks primarily test model knowledge and one-shot coding(Jainet al.,2025a,Zhuoet al.,2025). Another wave of agentic benchmarks has extended to short, interactive trajectories(Mialonet al.,2023,Liuet al.,2024,Jimenezet al.,2024,Merrillandothers,2026). Only lately have a few benchmarks begun to explore hour-long, closed-loop optimization(Ouyangandothers,2025,Nathaniet al.,2025,Manget al.,2025,Lupidiet al.,2026). However, these efforts remain limited in both scale and generality.

Two major obstacles have kept sustained long-horizon optimization progress slow. First, the most impressive demonstrations of empirical optimization, such as AlphaEvolve(Novikovet al.,2025)and’s AutoResearch agent(Karpathy,2026), are tightly coupled to heavily engineered, model-specific harnesses, tools, and search strategies. This co-design makes it difficult to isolate the underlying model’s true contribution. Second, existing long-horizon benchmarks are narrow in scope, each targeting a single domain such as ML engineering(Ranket al.,2026,Staraceet al.,2025), systems and kernel optimization(Ouyangandothers,2025,Manget al.,2025), or real-world engineering(Chiet al.,2026). Crucially, none of these benchmarks simultaneously offer broad coverage across scientific and engineering domains while maintaining high difficulty and resistance to saturation.

To close this gap, we introduceAutoLab, a benchmark for ultra long-horizon closed-loop optimization with LLM agents. Each task inAutoLabprovides a correct but deliberately suboptimal baseline and challenges the agent to improve it iteratively within a strict wall-clock budget.AutoLabcomprises 36 executable tasks acrossfour categories: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Its design rests on three commitments: (1) tasks must demand sustained empirical iteration over long horizons; (2) scoring must be continuous and well-calibrated, rewarding partial progress across heterogeneous metrics; and (3) evaluation must be hack-resistant, enforced through sealed evaluators, correctness gates, immutable-file checks, and adversarial auditing.

Long-horizon optimization represents a distinct capability that cannot be reduced to agentic coding ability. This is clearly demonstrated by our main evaluation (Figure1), which consumed a total of2,544wall-clock hours and8.60billion tokens.claude-opus-4.6leads every sub-domain, reaching an Avg@3 of0.680.68versus0.500.50for the next-best model. Many otherwise strong models, includinggpt-5.4, fail for reasons unrelated to raw coding ability: some terminate after minimal exploration, while others exhaust their entire budget without producing a valid final solution (Section4). Our trajectory analysis further shows that final performance correlates more strongly withpersistencethan withone-shot solution quality: agents that repeatedly benchmark, edit, and incorporate empirical feedback throughout the trajectory achieve substantially better outcomes. These findings suggest that persistence, time awareness, and empirical search will be central to future autonomous research agents.

In summary, we make the following key contributions:

•A high-quality benchmark.We introduceAutoLab, the first benchmark designed specifically for ultra long-horizon closed-loop optimization across diverse domains.
•Large-scale evaluation.We conduct a systematic evaluation of 17 state-of-the-art models, including four proprietary frontier models, using a fixed, standardized harness under identical experimental conditions.
•In-depth trajectory analysis and insights.Through comprehensive analysis of all trajectories (including manual inspection of 302 zero-score rollouts), we reveal key behavioral limitations, most notably a lack of time awareness (premature termination versus budget exhaustion). We further show that the dominant predictor of final performance is not the quality of an agent’s initial solution, but its persistence in iterative refinement.

2 TheAutoLabBenchmark

We presentAutoLab, a benchmark for evaluating frontier models on research and engineering tasks whose horizons are measured in hours rather than minutes. Its design is organized around three commitments. Tasks must beultra long-horizon, demanding sustained empirical iteration across many cycles rather than a single-shot patch; scoring must becontinuous and calibrated, going beyond pass/fail to support fine-grained relative comparison across heterogeneous metrics such as runtime, perplexity, and parameter count, and to resist saturation as frontier capabilities advance; and verification must behack-resistant, since performance benchmarks expose a far larger attack surface for shortcuts than patch-style benchmarks. The remainder of this section formalizes the task specification (Sec.2.1), describes how tasks are sourced and quality-controlled (Sec.2.2), and reports the final composition ofAutoLab(Sec.2.3).

Refer to caption Figure 2:AutoLabtask formulation and evaluation pipeline.### 2.1 Task Formulation

A task inAutoLabconsists of aninstruction, anenvironment, averifier, areference solution, and awall-clock budget(Figure2). The instruction is a natural-language description of the optimization target. The environment is a containerized sandbox (either CPU or single-GPU depending on the workload) that ships with a codebase containing a working but unoptimized baseline implementation, together with a local evaluation script that the agent may invoke during development. The verifier is the held-out evaluation suite that produces the final score. The reference solution is a human-written implementation that anchors the scoring scale and is never exposed to the agent. The budget bounds the wall-clock time available for the agent to read the codebase, modify it, run it, and iterate.

During evaluation, the agent receives the instruction and environment inside the sandbox, and must produce a modified implementation that the verifier can evaluate within the allotted budget. Tasks are inherently interactive: the agent may freely edit the codebase, execute the implementation, profile its performance, invoke the local evaluation script, inspect intermediate outputs, and iteratively refine its solution. At the end of the episode, the verifier runs the modified implementation on held-out inputs and reports a metric, which is then mapped to a continuous score relative to the reference solution.

Baselines and references.The baselines inAutoLabtasks are correct but suboptimal, representing where a working but unoptimized first implementation might land. Reference solutions, in contrast, are required to improve the metric by a non-trivial margin (typically at least an order of magnitude on system-optimization tasks, and a clear statistical gain on model-development tasks), so that every task has genuine headroom for an agent to discover.

Scoring.Letm(x)m(x)denote the raw metric value achieved by an implementationxx(e.g., runtime, validation perplexity, throughput, or parameter count), and letmℬm_{\mathcal{B}}andmℛm_{\mathcal{R}}denote the metric values attained by the baseline and reference solutions, respectively.AutoLabemploys two anchored scoring schemes, both normalized to the interval[0,1][0,1]and anchored at the baseline and reference performance levels:

•Log-stretch.For performance-optimization tasks, where meaningful improvements frequently span orders of magnitude, we adopt a logarithmic scoring scheme: s(x)=clip(12⋅log⁡(mℬ/m(x))log⁡(mℬ/mℛ),0,1)s(x)=\mathrm{clip}\!\left(\tfrac{1}{2}\cdot\frac{\log(m_{\mathcal{B}}/m(x))}{\log(m_{\mathcal{B}}/m_{\mathcal{R}})},\,0,\,1\right)(1)(with the directional analogue for metrics where higher values are better). A minimum-improvement gate ensures thats(x)=0s(x)=0until the agent exceeds the baseline, yieldings=0s=0at the baseline,s=0.5s=0.5at the reference, and values approaching1.01.0as performance nears the practical optimum. This gate prevents submissions that make no meaningful improvement over the baseline from receiving partial credit. We note that bothmℬm_{\mathcal{B}}andmℛm_{\mathcal{R}}for performance-optimization tasks aresandbox-dependentquantities that have been carefully calibrated for the specific sandbox environments and hardware configurations used throughout this benchmark.
•Linear.For tasks with naturally bounded quality metrics, we use linear interpolation between the two anchors: s(x)=clip(mℬ−m(x)mℬ−mℛ,0,1)s(x)=\mathrm{clip}\!\Big(\frac{m_{\mathcal{B}}-m(x)}{m_{\mathcal{B}}-m_{\mathcal{R}}},\,0,\,1\Big)(2)(again with the directional analogue when higher is better). Thus,s=0s=0at the baseline ands=1.0s=1.0at the reference.

The specific choice ofmℬm_{\mathcal{B}}andmℛm_{\mathcal{R}}, along with any task-specific feasibility gates, is detailed in AppendixA.2. Anchored relative scoring serves two important purposes: it enables meaningful aggregation across tasks whose native units are otherwise incommensurable, and unlike binary pass/fail benchmarks, it rewards genuine partial progress. The latter property is particularly crucial atAutoLab’s level of difficulty, where the majority of agent submissions lie between the baseline and the reference solution.

Wall-clock budget.Wall-clock budgets range from 2 hours for the smallest puzzle tasks to 12 hours for end-to-end LLM development tasks. Budgets are chosen to balance two competing goals: preserving realistic development workflows, which often require substantial execution and iteration time, while keeping evaluation costs tractable and reproducible at benchmark scale. Accordingly, some model training tasks are intentionally designed around smaller models and shorter training steps rather than frontier-scale training runs, allowing agents to complete multiple optimization iterations within the time budget. This further challenges agent’s ability to allocate time effectively across exploration, execution, and iteration.

2.2 Benchmark Construction

Task Collection.AutoLab tasks were contributed by senior researchers and engineers. Contributors were asked to draw tasks from real engineering or research problems they had personally encountered, from low-level CUDA and C optimization to end-to-end vision-language model post-training. We deliberately prioritize realism and diversity over difficulty for its own sake: a task earns inclusion because it captures a workflow that practitioners actually undertake, not because it has been calibrated to be artificially hard.

Quality Control.Each task underwent a multi-round audit before inclusion. Inspired byMerrillandothers(2026), we audit each task against four criteria tailored toAutoLab’s continuous-scoring, performance-oriented setting:validity(a higher score requires a higher-quality implementation, not artifacts of the measurement procedure or weakened correctness checks);solvability(the reference solution reliably reaches the target score within the stated budget on the target hardware);integrity(the agent cannot pass by hacking the scoring function); andmeasurement stability(the metric variance across repeated runs of the reference is small enough that observed score differences are attributable to the implementation rather than to noise). Each task was reviewed by at least 2 experts independent of the original contributor and a format audit agent; tasks that failed any criterion were either revised or rejected.

Anti-Reward Hacking.Performance benchmarks expose a broader attack surface than patch-style benchmarks.AutoLabmitigates this risk in five ways. First, the verifier is sealed: the agent is given access to a local evaluation script, but not to the held-out test inputs or reference outputs used for final scoring. Second, ML tasks include a correctness gate that must pass before the optimization metric is recorded, with gate inputs drawn from a distribution disjoint from anything visible during development. Third, we run a dedicated adversarial agent explicitly prompted to discover shortcuts or reward hacks during task construction; any task that can be solved without genuine improvement to the target metric is either patched or removed. Fourth, critical files that should remain immutable are SHA-pinned, and any unauthorized modification immediately results in a zero score. In addition, we continuously analyze agent trajectories across different models during evaluation. When new forms of reward hacking or verifier exploitation are discovered, we patch the corresponding verifier and re-validate affected tasks to maintain benchmark integrity over time.

Refer to caption Figure 3:This figure illustrates task distribution ofAutoLab.

2.3 Benchmark Composition

The final benchmark comprises 36 tasks across four categories.Model Development(7 tasks) covers the full LLM pipeline, including pretraining scaling laws, RL post-training, SFT data selection, parameter-efficient fine-tuning, world-model training, and online serving optimization.System Optimization(15 tasks) focuses on low-level performance engineering of systems primitives such as kernels, sorting, hashing, search, compression, regular expressions, and cryptography in C, Rust, Go, and Python.Puzzle & Challenge(10 tasks) consists of algorithmic problems built around a single key insight, including combinatorial reductions, sorting networks, ISA-level scheduling, adversarial constructions, and adaptive coding.CUDA(4 tasks) targets GPU kernel optimization for cryptographic primitives, point-cloud registration, and compression. The complete list of tasks appears in Figure3. Detailed descriptions of each task are provided in AppendixA.1.

3 Benchmark Results

3.1 Experimental Setup

Models.We evaluate a range of state-of-the-art proprietary and open-weight models inAutoLab. Proprietary models includeclaude-opus-4.6(Anthropic,2026),gemini-3.1-pro(Google DeepMind,2026),gpt-5.4(OpenAI,2026), andgrok-4-20(xAI,2026). On the open-weight side, we evaluateqwen-3.6-plus(Qwen,2026b),deepseek-v4-pro(Deepseek,2026),glm-5(Zenget al.,2026),kimi-k2.6(Moonshot AI,2026),hunyuan-3-preview(Tencent,2026),mimo-v2.5-pro(Xiaomi,2026b), andminimax-m2.7(MiniMax,2026b). For ablation analyses, we additionally evaluate older or smaller variants from several of these families:kimi-k2.5(Moonshot AI,2025),minimax-m2.5(MiniMax,2026a),mimo-v2-pro(Xiaomi,2026a),mimo-v2.5(Xiaomi,2026c),deepseek-v4-flash(Deepseek,2026), andqwen-3.5-plus(Qwen,2026a). We do not test small open-weight models (<<200B parameters) due to the difficulty of the benchmark. The specific API provider used to access each model is listed in Table5in AppendixB.1for better reproducibility.

Evaluation metrics.We evaluate every (model, task) pair withthreeindependent rollouts and report three complementary metrics.Avg@3averages the per-run score across the three trials, capturing typical performance.Best@3takes the maximum of the three trials, reflecting an agent’s ceiling.Dominancemeasures a model’s head-to-head win rate against all other models using Avg@3 scores. Formally, letℳ\mathcal{M}be the set of evaluated models and𝒯\mathcal{T}the set of tasks, and letsm,ts_{m,t}denote modelmm’s Avg@3 score on tasktt. Then

Dominance(m)=1|𝒯|⋅(|ℳ|−1)∑t∈𝒯∑o∈ℳo≠m(𝟏[sm,t>so,t]+121[sm,t=so,t]).\mathrm{Dominance}(m)=\frac{1}{|\mathcal{T}|\cdot(|\mathcal{M}|-1)}\sum_{t\in\mathcal{T}}\sum_{\begin{subarray}{c}o\in\mathcal{M}\\ o\neq m\end{subarray}}\Bigl(\mathbf{1}[s_{m,t}>s_{o,t}]+\tfrac{1}{2}\,\mathbf{1}[s_{m,t}=s_{o,t}]\Bigr).(3)Thus,Dominance(m)∈[0,1]\mathrm{Dominance}(m)\in[0,1], where11means the model strictly outperforms every other model on every task and0.50.5corresponds to average performance across models. This metric provides a robust, tournament-style view that is largely insensitive to hardware variance and differences in per-task reward design, while being less sensitive to a small number of high-leverage tasks.

Implementation Details.Following Terminal-Bench(Merrillandothers,2026), we use theHarborframework(Harbor Framework Team,2026)as the unified evaluation harness, and theterminus-2agent by default across all models. While specialized harnesses may further improve performance, we leave such optimizations to future work, and provide an early pilot comparison with two alternative agent harnesses,pi-mono(Zechner,2026)and an optimizedmini-swe-agent(Yanget al.,2024), in Section4.3. CPU-only tasks run inside a local Docker sandbox, and GPU tasks run on Modal111https://modal.comcloud sandboxes provisioned with H100 and L40S GPUs. The local CPU sandbox runs on a workstation with an AMD Ryzen 9 9950X (16 cores / 32 threads) and 64 GB of RAM, and per-task CPU and memory caps are enforced via the task’s metadata. In total, the evaluation ofAutoLabconsumed2,544 wall-clock hoursand8.60 billion tokens.

3.2 Main Results

Table 1:This table shows the per-category results on theAutoLabbenchmark. For each sub-domain we report Avg@3, Best@3, and Dominance. Per-column best scores are shown inboldand runner-up scores areunderlined(computed on the main set only). Main-set rows are ordered by overall [email protected] Performance.Table1presents the Avg@3, Best@3, and Dominance scores for all evaluated models across the 36 tasks.claude-opus-4.6leads the benchmark by a substantial margin, achieving an Avg@3 of0.680.68and a Dominance score of0.930.93. The performance gap to the second-place model,gemini-3.1-pro(Avg@3 of0.500.50), is large and highlights a clear separation among frontier models on long-horizon iterative improvement tasks. Other proprietary frontier models such asgpt-5.4andgrok-4-20rank in the lower half of the leaderboard. We attribute this primarily to their tendency toward premature termination (see Section4) rather than insufficient raw capability. Among open-weight models,kimi-k2.6(0.460.46),mimo-v2.5-pro(0.450.45), andglm-5(0.430.43) form a strong and tight cluster. Notably, smaller models such asmimo-v2.5anddeepseek-v4-flash(both under 400B parameters) remain highly competitive. In particular,deepseek-v4-flash(0.370.37) performs on par with the much largerdeepseek-v4-pro(0.380.38) overall, and even surpasses it on CUDA kernel optimization and algorithmic puzzle tasks. Detailed per-task results are provided in AppendixB.2.

Performance by Category.The breakdown across the four task categories (Model Development, System Optimization, CUDA, and Puzzle & Challenge) reveals distinct model strengths.claude-opus-4.6leads in all four categories by a wide margin, with its largest advantage appearing on CUDA tasks, where most other models score near zero.gemini-3.1-properforms best on puzzle tasks but lags significantly on CUDA and model development tasks, consistent with its relatively short rollouts (median of 12 steps versus 57 steps forclaude-opus-4.6). Open-weight models show their strongest results on system optimization tasks, while CUDA kernel optimization remains a notable weakness across this group.

Refer to caption Figure 4:Self-reported runtime of each model’s bestflash_attentionrollout as a function of wall-clock time. Lower is better. Dashed lines indicate the task baseline (750750ms) and the reference solution (100100ms). Numbers in the legend report the end-to-end speedup achieved relative to the task baseline.Case Study: Flash Attention Optimization.To illustrate divergent optimization behaviors, we analyze the best rollout of each model on theflash_attentiontask, a two-hour CPU kernel optimization challenge requiring a tiled attention kernel in single-threaded C. All models begin from the same baseline runtime of approximately750750ms, but their trajectories diverge sharply (Figure4).claude-opus-4.6steadily reduces runtime to1818ms through 44 feedback-driven iterations over roughly 40 minutes, achieving a42.4×42.4\timesspeedup and surpassing the reference solution (100100ms). In contrast, several strong models plateau near or above the reference:kimi-k2.6,gemini-3.1-pro, andglm-5reach5050–8080ms but fail to improve further.

Reasoning-heavy models such asmimo-v2.5-proanddeepseek-v4-proexhibit a distinct failure mode. They spend the majority of their budget on prolonged per-step thinking rather than command execution, which severely delays the first benchmark and limits the total number of edit-and-rerun iterations. Consequently,deepseek-v4-pro’s final submitted solution is not its best result within the trajectory, as it times out before fully exploiting promising directions. Additionally,qwen-3.6-plusbriefly reached a strong intermediate result (better than its final submission) but discarded it after incorrectly judging the solution as illegal. At the low end,grok-4-20andgpt-5.4show minimal progress, withgrok-4-20running the evaluation script only once before early termination. This case study highlights a recurring pattern onAutoLab: high final performance demands not only strong initial coding ability, but sustained, measurement-driven iteration coupled with effective time awareness and self-verification.

4 Analysis

4.1 Cost Analysis

Figure5examines the relationship between models’ average overall score and three measures of resource utilization:average agent steps(left panel),average agent runtimein hours (middle panel), andaverage inference costin USD (right panel). A clear positive correlation is evident in the left and middle panels between average overall score and both the number of agent steps and total wall-clock runtime.claude-opus-4.6stands out prominently as an outlier, requiring substantially more steps than most other models while achieving the highest average score. In contrast, the short-horizon termination behavior discussed earlier is clearly visible in the middle panel (agent runtime): models such asgpt-5.4andgrok-4-20cluster at markedly lower runtimes, indicating that they terminate trials prematurely rather than continuing to iterate. This early stopping directly limits their optimization potential and largely explains why several proprietary frontier models rank low in the benchmark.

On the inference-cost dimension (right panel), higher performance generally incurs higher costs. Several open-weight models, particularly,deepseek-v4-flashandmimo-v2.5-pro, achieve competitive performance at substantially lower inference costs, highlighting promising avenues for cost-efficient model and agent design.

Refer to caption Figure 5:Relationship between models’ Avg@3 and three measures of resource utilization. Left: average agent steps. Middle: average agent runtime in hours. Right: average inference cost in USD.Figure 6:Distribution of zero-score rollouts by failure mode across models. For each model we manually categorized all rollouts that received a score of0into four mutually exclusive failure modes.

4.2 Failure Case Analysis

Despite the progress of models onAutoLabtasks, a substantial fraction of rollouts still receive a score of zero. Understanding the underlying causes of these failures is critical for identifying the current limitations of models and agents on ultra long-horizon tasks. To move beyond aggregate performance metrics and analyzewhyagents fail, we manually inspected all302302zero-score rollouts across the1111evaluated models and grouped them into four mutually exclusive failure modes. These categories together account for the entire set and are defined as follows:

•Timeout / Context Exhaustion.The agent never produces a final submission within the time budget. This includes both the standardAgentTimeoutErrorfrom the harness and individual LLM calls that hang for1,500+1{,}500+seconds due to long reasoning.
•Capability Gap.The agent submits a solution, but the verifier gives it a score of0. This covers incorrect outputs, sub-threshold scores, early give-ups with no improvement over the baseline, or missing required files (e.g., no LoRA adapter provided in LoRA training tasks).
•Instruction Violation.The submitted solution breaks explicit task constraints (e.g., using banned APIs likecudaMemcpyonntt_butterfly_cuda, importing disallowed modules, modifying protected reference files, or leaving extra files in the workspace). The verifier scores these as0regardless of correctness.
•Others.Upstream issues unrelated to the agent, such as internal server errors, malformed responses, or sandbox crashes caused by illegal operations.

Figure6shows the distribution of these failure modes per model. In what follows, we summarize two key findings.

Models struggle to calibrate exploration with the remaining time budget.Models exhibit a pronounced lack of time awareness, falling into two distinct behavioral patterns: some terminate far too early, while others continue iterating until the budget is exhausted without ever submitting a final solution. A timeout-dominated group, includingdeepseek-v4-pro,hunyuan-3-preview, andqwen-3.6-plus, frequently fails to reach submission, instead consuming the entire budget through excessive iteration and repeated re-prompting. At the opposite extreme,gpt-5.4andgrok-4-20often submit after only minimal exploration, resulting in consistently low scores despite substantial remaining budget.

In addition, we observe a failure mode that is exclusive to the open-weight models in our roster: the agent enters an extremely long reasoning chain and exhausts the two-hour budget after emitting only a handful of actions. This shows up forkimi-k2.6ontoy_isa_opt(all three trials time out at 2–11 agent steps, scoring 0) and most starkly fordeepseek-v4-proon the CUDA subset (9 of 12 trials submit fewer than 10 actions before the agent timeout). By contrast, none of the closed-weight models exhibits this pattern.

Instruction violations persist even among strong closed-source models.Although they represent only a modest fraction of all failures, instruction violations are heavily concentrated ingemini-3.1-pro(55cases) andglm-5(44cases). Notably, a single task (ntt_butterfly_cuda) accounts for half of all violations. These findings, along with model-specific patterns observed ingemini-3.1-proandgrok-4-20, suggest that robust instruction following remains a significant challenge even for capable frontier models.

4.3 Harness Ablation Analysis

The choice of agent harness is frequently treated as a mere implementation detail when reporting model capabilities. To examine the validity of this assumption, we re-evaluated four models (mimo-v2.5,gpt-5.4,deepseek-v4-flash, andkimi-k2.6) on two alternative harnesses:mini-swe-agent(Yanget al.,2024)andpi-mono(Zechner,2026). We used 25 CPU tasks fromsystem_optimizationandpuzzle_and_challengeand compared the results to our defaultterminus-2Harness. Since the originalmini-swe-agentwas designed primarily for patch-based editing rather than sustained iterative optimization, models tended to submit prematurely. We therefore augmented it with a custom system prompt (shown below) that explicitly encourages aggressive, persistent performance engineering and discourages early termination.

modified mini-swe-agentsystem promptYou are an aggressive performance-engineering agent operating in a Linux container.The codebase under /app is already correct. Your job is to minimise a metric described in the task instruction. Higher quality optimisation = higher reward.Workflow:1.Read the task instruction carefully. Note the metric, the baseline score, the reference (target) score, and any rules.2.Read the existing solve.* and main.* files to understand the harness.3.Run the existing build/test once to confirm baseline.4.Iteratively optimise.After each change: build, run the local verifier, check the metric. If correctness fails, fix it. If the metric improved, keep going---try further optimisations. If it regressed, revert and try something else.5.Aim for the reference score (or better).DO NOT submit at first pass---keep optimising until you have spent a meaningful fraction of your budget OR you reach the reference.6.Only submit your solution once you have exhausted reasonable optimisation ideas AND verified the final solution still passes correctness.

Refer to caption Figure 7:Per-task scores across three harnesses (terminus-2,pi-mono,mini-swe-agent*;∗denotes our custom optimisation-oriented system prompt), with one panel per model. Thin colored lines trace a single task’s Avg@3 across harnesses; bold lines and large markers show per-harness means (also labeled).Figure7reveals that harness-induced variance can be comparable to model-induced variance. Mean scores for the same model shift by as much asΔ=0.43\Delta=0.43across harnesses (e.g.,kimi-k2.6:0.21→0.640.21\to 0.64frompi-monotomini-swe-agent*). Relative rankings are non-transitive:pi-monofavorsgpt-5.4,mini-swe-agent*favorskimi-k2.6, andterminus-2lies roughly in between. Moreover, even within a fixed (model, harness) pair, per-task scores exhibit large spread, indicating substantial task-level reordering across harnesses. To ensure a fair evaluation, we selectedterminus-2, one of the most widely adopted and well-established agent harnesses in current coding agent evaluationsJimenezet al.(2024), as the default harness forAutoLab. This provides a strong, standardized baseline for long-horizon agent evaluation.

Refer to caption Figure 8:Score versus compute spend across three harnesses on the 25-task subset. Each polyline connects the three harness points for a single model. Marker shapes denote harnesses (∙\bulletterminus-2,■\blacksquarepi-mono,▲\blacktrianglemini-swe-agent*). Cost is shown on a log scale. Higher and further left is Pareto-better.To further disentangle score differences from compute usage, Figure8examines the score–compute trade-off across the same 25-task subset. Three key patterns emerge:(i) Harness choice is also a spending choice.Per-trial inference cost varies by more than5×5\timesacross harnesses for the same model (e.g.,kimi-k2.6: $0.40 underpi-monovs. $2\.05 undermini\-swe\-agent\*\), primarily because different harnesses encourage vastly different iteration efforts before termination\.$ii$ Score–cost rankings diverge from score\-only rankings\.Smaller or less capable base models paired with persistent harnesses \(e\.g\.,deepseek\-v4\-flashonmini\-swe\-agent\*, achieving0\.540\.54at∼\\sim$ 0.07/trial) can dominate more capable models on the cost-adjusted Pareto frontier, even though they score lower in isolation.(iii) Harness design amplifies or dampens specific model strengths.Under the iterativemini-swe-agent*, the less capable modelsdeepseek-v4-flashandmimo-v2.5benefit the most (+0.13+0.13and+0.20+0.20relative toterminus-2), whilegpt-5.4slightly declines (−0.03-0.03). In contrast, under the lightweightpi-mono, the pattern reverses: onlygpt-5.4maintains solid performance (0.500.50, the highest in that harness), while the other three models collapse to0.210.21–0.270.27. In short, the iterative harness (i.e.,mini-swe-agent*) let the agent recover through trial-and-error what a strong single-shot reasoner would solve in one pass; models weaker at one-shot reasoning therefore gain the most, while a model that excels at one-shot reasoning gains little or loses ground.

Taken together, these findings suggest thatharness design itself is a promising direction for future research: carefully tuned harnesses, by offering more iteration headroom for smaller models and tighter, high-quality patch loops for stronger instruction-followers, could close a substantial portion of the performance gap between weak and strong base models without any changes to the underlying models.

4.4 More Analysis

In AppendixC, we present two more complementary analyses that provide additional insight into agent performance onAutoLab. First, we analyze generational improvements within model families while holding the harness fixed, and observe modest gains in most cases, although these improvements are uneven across sub-domains. Second, we study across-trial stability and find that capability and reliability are correlated but distinct dimensions.claude-opus-4.6performs strongly on both axes, whereas many lower-performing models also exhibit substantial variance, making single-trial evaluation unreliable and artificially inflating the perceived strength of noisy models when only Best@3 is considered.

5 Related Work

Static and short-horizon agent benchmarks.

The majority of public frontier-model evaluations are still single-turn or terminal-state. Coding suites such as HumanEval(Chenet al.,2021), LiveCodeBench(Jainet al.,2025a), BigCodeBench(Zhuoet al.,2025), and SWE-bench(Jimenezet al.,2024)score one-shot generation or one-edit-one-submission, while saturation-resistant variants like MMLU-Pro(Wanget al.,2024)and LiveBench(Whiteet al.,2025)raise difficulty without changing the framing. Multi-step agent benchmarks such as GAIA(Mialonet al.,2023), AgentBench(Liuet al.,2024), and Terminal-Bench(Merrillandothers,2026)extend trajectories to several tool calls, but still grade only a final artifact or terminal state. None of them elevate sustained empirical iteration under a quantitative metric to the primary object of evaluation.

Long-horizon optimization and research-agent benchmarks.

A growing family of benchmarks studies agents on realistic multi-hour ML and engineering workflows. MLE-Bench(Chanet al.,2024), RE-Bench(Wijket al.,2024), PaperBench(Staraceet al.,2025), PostTrainBench(Ranket al.,2026), and AIRS-Bench(Lupidiet al.,2026)target ML research pipelines, while KernelBench(Ouyangandothers,2025), FrontierCS(Manget al.,2025), and Frontier-Eng(Chiet al.,2026)probe systems, kernel, and real-world engineering optimization. These works represent important progress toward agentic research, but each focuses on a narrow domain and typically grades only the final score or final patch, leaving the trajectory itself unmeasured.AutoLabdeparts from this convention by spanning four heterogeneous categories (system optimization, puzzles, model development, and CUDA kernels) under a single calibrated, hack-resistant scoring protocol.

Closed-loop agent frameworks and training environments.

Numerous frameworks have been proposed for closed-loop software engineering (SWE-agent(Yanget al.,2024), OpenHands(Wangandothers,2024), Aider(GauthierandContributors,2024)) and open-ended scientific iteration (The AI Scientist(Luet al.,2024)). Gym-style environments such as SWE-Gym(Panet al.,2024), R2E-Gym(Jainet al.,2025b), and MLGym(Nathaniet al.,2025)further enable iterative training and evaluation on engineering and ML tasks. The most striking demonstrations of sustained empirical optimization, including AlphaEvolve(Novikovet al.,2025)andKarpathy’s AutoResearch agent(Karpathy,2026), are tightly coupled to bespoke harnesses, tools, and search strategies, which makes the underlying model’s contribution difficult to isolate from system-level engineering.AutoLabinstead fixes the harness (terminus-2), action interface, task definitions, budget rules, and scoring function across all evaluated models, enabling direct, apples-to-apples comparison of long-horizon optimization capability under identical conditions.

6 Conclusion

We introducedAutoLab, a benchmark for evaluating frontier models on ultra long-horizon research and engineering tasks that require sustained iteration over hours rather than minutes. By enforcing ultra long-horizon tasks, continuous calibrated scoring, and strong anti-hacking safeguards,AutoLabreveals that raw capability alone is insufficient for these tasks: the dominant predictor of success is an agent’s willingness to persistently evaluate, edit, and iterate over extended horizons.claude-opus-4.6demonstrates this convincingly, achieving a commanding lead through long, steady optimization trajectories while most other frontier models, including several proprietary models, terminate prematurely or exhaust their budgets without submitting. These results highlight the critical need for future models and agents to prioritize time awareness, persistence, and more effective harness design. We release the full benchmark, evaluation harness, and task artifacts to accelerate progress toward truly capable ultra long-horizon agents.

Limitations and Broader Impact

AutoLabfocuses on executable system and machine learning engineering workflows, and should therefore be understood as a benchmark for measurable auto-research rather than for scientific discovery in its broadest sense. Because long-horizon evaluation inherently depends on multi-hour execution, API interactions, GPU workloads, and the surrounding execution stack,AutoLabreports trajectory analysis, resource consumption, and final performance jointly rather than treating benchmark score as a standalone metric. We hope this protocol supports more diagnostic, cost-aware, and reproducible evaluation of auto-research agents as the field moves from static answers toward iterative empirical work.

Acknowledgments

The authors thank Professor Erik Brynjolfsson (Stanford University) for valuable discussions that helped sharpen the framing of this work, and colleagues across our affiliated institutions for constructive feedback on earlier drafts.

References

Anthropic(2026)System card: claude opus 4.6.Note:https://www.anthropic.com/claude-opus-4-6-system-cardCited by:§3.1.
J. S.Chan, N.Chowdhury, O.Jaffe, J.Aung, D.Sherburn, E.Mays, G.Starace, K.Liu, L.Maksin, T.Patwardhan, L.Weng, and A.Mądry(2024)MLE-bench: evaluating machine learning agents on machine learning engineering.External Links:2410.07095,LinkCited by:§5.
M.Chen, J.Tworek, H.Jun, Q.Yuan, H. P. d. O.Pinto, J.Kaplan, H.Edwards, Y.Burda, N.Joseph, G.Brockman, A.Ray, R.Puri, G.Krueger, M.Petrov, H.Khlaaf, G.Sastry, P.Mishkin, B.Chan, S.Gray, N.Ryder, M.Pavlov, A.Power, L.Kaiser, M.Bavarian, C.Winter, P.Tillet, F. P.Such, D.Cummings, M.Plappert, F.Chantzis, E.Barnes, A.Herbert-Voss, W. H.Guss, A.Nichol, A.Paino, N.Tezak, J.Tang, I.Babuschkin, S.Balaji, S.Jain, W.Saunders, C.Hesse, A. N.Carr, J.Leike, J.Achiam, V.Misra, E.Morikawa, A.Radford, M.Knight, M.Brundage, M.Murati, K.Mayer, P.Welinder, B.McGrew, D.Amodei, S.McCandlish, I.Sutskever, and W.Zaremba(2021)Evaluating large language models trained on code.External Links:2107.03374,LinkCited by:§5.
Y.Chi, D.Hong, D.Jiang, T.Luo, K.Yang, B.Zhang, Z.Cao, X.Fan, B.He, H.Hao,et al.(2026)Frontier-eng: benchmarking self-evolving agents on real-world engineering tasks with generative optimization.arXiv preprint arXiv:2604.12290.Cited by:§1,§1,§5.
Deepseek(2026)DeepSeek v4 preview release.Note:https://api-docs.deepseek.com/news/news260424Cited by:§3.1.
P.Gauthierand A.Contributors(2024)Aider: AI pair programming in your terminal.Note:https://aider.chat/Accessed 2026-04-29Cited by:§5.
Google DeepMind(2026)Gemini 3.1 pro model card.Note:https://deepmind.google/models/model-cards/gemini-3-1-pro/Cited by:§3.1.
Harbor Framework Team(2026)Harbor: A framework for evaluating and optimizing agents and models in container environmentsExternal Links:LinkCited by:§3.1.
N.Jain, A.Gu, W.Li, F.Yan, T.Zhang, S.Wang, A.Solar-Lezama, K.Sen, and I.Stoica(2025a)Livecodebench: holistic and contamination free evaluation of large language models for code.InInternational Conference on Learning Representations,Vol.2025,pp.58791–58831.Cited by:§1,§5.
N.Jain, J.Singh, M.Shetty, L.Zheng, K.Sen, and I.Stoica(2025b)R2E-Gym: procedural environments and hybrid verifiers for scaling open-weights SWE agents.External Links:2504.07164,LinkCited by:§5.
C. E.Jimenez, J.Yang, A.Wettig, S.Yao, K.Pei, O.Press, and K.Narasimhan(2024)SWE-bench: can language models resolve real-world GitHub issues?.InInternational Conference on Learning Representations,External Links:LinkCited by:§1,§4.3,§5.
A.Karpathy(2026)Autoresearch: AI agents running research experiments.Note:https://github.com/karpathy/autoresearchAccessed 2026-04-29Cited by:§1,§1,§5.
X.Liu, H.Yu, H.Zhang, Y.Xu, X.Lei, H.Lai, Y.Gu, H.Ding, K.Men, K.Yang, S.Zhang, X.Deng, A.Zeng, Z.Du, C.Zhang, S.Shen, T.Zhang, Y.Su, H.Sun, M.Huang, Y.Dong, and J.Tang(2024)AgentBench: evaluating LLMs as agents.InInternational Conference on Learning Representations,External Links:LinkCited by:§1,§5.
C.Lu, C.Lu, R. T.Lange, J.Foerster, J.Clune, and D.Ha(2024)The AI scientist: towards fully automated open-ended scientific discovery.External Links:2408.06292,LinkCited by:§5.
A.Lupidi, B.Gauri, T. S.Foster, B.Al Omari, D.Magka, A.Pepe, A.Audran-Reiss, M.Aghamelu, N.Baldwin, L.Cipolina-Kun, J.Gagnon-Audet, C. H.Leow, S.Lefdal, H.Mossalam, A.Moudgil, S.Nazir, E.Tewolde, I.Urrego, J.Armengol Estape, A.Budhiraja, G.Chaurasia, A.Charnalia, D.Dunfield, K.Hambardzumyan, D.Izcovich, M.Josifoski, I.Mediratta, K.Niu, P.Pathak, M.Shvartsman, E.Toledo, A.Protopopov, R.Raileanu, A.Miller, T.Shavrina, J.Foerster, and Y.Bachrach(2026)AIRS-Bench: a suite of tasks for frontier AI research science agents.External Links:2602.06855,LinkCited by:§1,§5.
Q.Mang, W.Chai, Z.Li, H.Mao, S.Zhou, A.Du, H.Li, S.Liu, E.Chen, Y.Wang,et al.(2025)FrontierCS: evolving challenges for evolving intelligence.arXiv preprint arXiv:2512.15699.Cited by:§1,§1,§5.
M. A.Merrillet al.(2026)Terminal-bench: benchmarking agents on hard, realistic tasks in command line interfaces.External Links:2601.11868,LinkCited by:§1,§2.2,§3.1,§5.
G.Mialon, C.Fourrier, C.Swift, T.Wolf, Y.LeCun, and T.Scialom(2023)GAIA: a benchmark for general AI assistants.External Links:2311.12983,LinkCited by:§1,§5.
MiniMax(2026a)MiniMax m2.5: built for real-world productivity.Note:https://www.minimax.io/news/minimax-m25Cited by:§3.1.
MiniMax(2026b)MiniMax m2.7: early echoes of self-evolution.Note:https://www.minimax.io/news/minimax-m27-enCited by:§3.1.
Moonshot AI(2025)Kimi k2.5.Note:https://github.com/MoonshotAI/Kimi-K2.5Cited by:§3.1.
Moonshot AI(2026)Kimi k2.6.Note:https://www.kimi.com/ai-models/kimi-k2-6Cited by:§3.1.
D.Nathani, L.Madaan, N.Roberts, N.Bashlykov, A.Menon, V.Moens, A.Budhiraja, D.Magka, V.Vorotilov, G.Chaurasia, D.Hupkes, R. S.Cabral, T.Shavrina, J.Foerster, Y.Bachrach, W. Y.Wang, and R.Raileanu(2025)MLGym: a new framework and benchmark for advancing AI research agents.InConference on Language Modeling,External Links:LinkCited by:§1,§5.
A.Novikov, N.Vu, M.Eisenberger, E.Dupont, P.Huang, A. Z.Wagner, S.Shirobokov, B.Kozlovskii, F. J. R.Ruiz, A.Mehrabian, M. P.Kumar, A.See, S.Chaudhuri, G.Holland, A.Davies, S.Nowozin, P.Kohli, and M.Balog(2025)AlphaEvolve: a coding agent for scientific and algorithmic discovery.External Links:2506.13131,LinkCited by:§1,§1,§5.
OpenAI(2026)GPT-5.4 thinking system card.Note:https://openai.com/index/gpt-5-4-thinking-system-card/Cited by:§3.1.
A.Ouyanget al.(2025)KernelBench: can LLMs write efficient GPU kernels?.InInternational Conference on Machine Learning,External Links:LinkCited by:§1,§1,§5.
J.Pan, X.Wang, G.Neubig, N.Jaitly, H.Ji, A.Suhr, and Y.Zhang(2024)Training software engineering agents and verifiers with SWE-Gym.External Links:2412.21139,LinkCited by:§5.
Qwen(2026a)Qwen3.5: towards native multimodal agents.Note:https://qwen.ai/blog?id=qwen3.5Cited by:§3.1.
Qwen(2026b)Qwen3.6-plus: towards real world agents.Note:https://qwen.ai/blog?id=qwen3.6Cited by:§3.1.
B.Rank, H.Bhatnagar, A.Prabhu, S.Eisenberg, K.Nguyen, M.Bethge, and M.Andriushchenko(2026)PostTrainBench: can LLM agents automate LLM post-training?.External Links:2603.08640,LinkCited by:§1,§1,§5.
G.Starace, O.Jaffe, D.Sherburn, J.Aung, J. S.Chan, L.Maksin, R.Dias, E.Mays, B.Kinsella, W.Thompson, J.Ahmad, T.Wang, T.Patwardhan, K.Shah, A.Mądry, L.Weng, and N.Chowdhury(2025)PaperBench: evaluating AI’s ability to replicate AI research.InInternational Conference on Machine Learning,External Links:LinkCited by:§1,§5.
Tencent(2026)Hy3 preview: the first step in rebuilding the hy model.Note:https://hy.tencent.com/research/hy3Cited by:§3.1.
X.Wanget al.(2024)OpenHands: an open platform for AI software developers as generalist agents.External Links:2407.16741,LinkCited by:§5.
Y.Wang, X.Ma, G.Zhang, Y.Ni, A.Chandra, S.Guo, W.Ren, A.Arulraj, X.He, Z.Jiang, T.Li, M.Ku, K.Wang, A.Zhuang, R.Fan, X.Yue, and W.Chen(2024)MMLU-Pro: a more robust and challenging multi-task language understanding benchmark.InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track,External Links:LinkCited by:§5.
C.White, S.Dooley, M.Roberts, A.Pal, B.Feuer, S.Jain, R.Shwartz-Ziv, N.Jain, K.Saifullah, S.Dey,Shubh-Agrawal, S. S.Sandha, S.Naidu, C.Hegde, Y.LeCun, T.Goldstein, W.Neiswanger, and M.Goldblum(2025)LiveBench: a challenging, contamination-limited LLM benchmark.InInternational Conference on Learning Representations,External Links:LinkCited by:§5.
H.Wijk, T.Lin, J.Becker, S.Jawhar, N.Parikh, T.Broadley, L.Chan, M.Chen, J.Clymer, J.Dhyani, E.Ericheva, K.Garcia, B.Goodrich, N.Jurkovic, M.Kinniment, A.Lajko, S.Nix, L.Sato, W.Saunders, M.Taran, B.West, and E.Barnes(2024)RE-Bench: evaluating frontier AI R&D capabilities of language model agents against human experts.External Links:2411.15114,LinkCited by:§5.
xAI(2026)Grok 4.20.Note:https://www.mindstudio.ai/models/grok-4-20Cited by:§3.1.
Xiaomi(2026a)Xiaomi mimo-v2-pro.Note:https://mimo.xiaomi.com/mimo-v2-proCited by:§3.1.
Xiaomi(2026b)Xiaomi mimo-v2.5-pro.Note:https://mimo.xiaomi.com/mimo-v2-5-pro/Cited by:§3.1.
Xiaomi(2026c)Xiaomi mimo-v2.5.Note:https://mimo.xiaomi.com/mimo-v2-5/Cited by:§3.1.
J.Yang, C. E.Jimenez, A.Wettig, K.Lieret, S.Yao, and K.Narasimhan(2024)SWE-agent: agent-computer interfaces enable automated software engineering.InAdvances in Neural Information Processing Systems,External Links:LinkCited by:§3.1,§4.3,§5.
M.Zechner(2026)Pi-mono: ai agent toolkit.Note:https://github.com/badlogic/pi-monoGitHub repositoryCited by:§3.1,§4.3.
A.Zeng, X.Lv, Z.Hou, Z.Du, Q.Zheng, B.Chen, D.Yin, C.Ge, C.Huang, C.Xie,et al.(2026)Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763.Cited by:§3.1.
T. Y.Zhuo, M. C.Vu, J.Chim, H.Hu, W.Yu, R.Widyasari, I. N. B.Yusuf, H.Zhan, J.He, I.Paul,et al.(2025)Bigcodebench: benchmarking code generation with diverse function calls and complex instructions.InInternational Conference on Learning Representations,Vol.2025,pp.66602–66656.Cited by:§1,§5.

Appendix ATask Specifications

A.1 Task Descriptions

This subsection lists all 36 tasks inAutoLab, grouped by category. Each entry gives the implementation language, a difficulty tier (1 = textbook-classic optimization with well-known techniques; 2 = bespoke, domain-specific, or research-style), and a short description.

System Optimization (15 tasks).

aes128_ctrC, tier 1Optimizes AES-128 CTR mode encryption of 256 MiB in single-threaded C using hardware acceleration and SIMD intrinsics. The challenge is maximizing throughput by leveraging AES-NI instructions and pipelining to minimize latency-bound operations.

agent_tool_routingPython, tier 2Optimizes a Python lexical router that returns the top-10 tool schemas for natural-language agent queries while satisfyingMRR@10≥0.82\mathrm{MRR}@10\geq 0.82andRecall@10≥0.94\mathrm{Recall}@10\geq 0.94. The challenge is replacing a full-scan baseline with a weighted inverted index using IDF and stopword filtering, all in single-threaded stdlib Python with no external packages.

bm25_search_goGo, tier 2Optimizes BM25 search engine queries over a synthetic corpus using goroutines and stdlib. The challenge is implementing efficient inverted indexes and ranking algorithms to minimize query latency while managing concurrent goroutine overhead.

bvh_raytracerC++, tier 1Builds a BVH for a 4 096-triangle scene and optimizes ray-triangle intersection for638×638638{\times}638primary rays. The challenge is constructing a cache-friendly BVH hierarchy and using SIMD vectorization to batch ray intersection tests while minimizing traversal overhead.

concurrent_kv_walGo, tier 2Optimizes a write-ahead log (WAL) backed key-value store handling mixed read/write/query workloads from 4 concurrent goroutines. The challenge is reducing lock contention and I/O bottlenecks while maintaining consistency and durability guarantees.

fft_rustRust, tier 2Optimizes the Fast Fourier Transform of a 32 768-length real signal in pure Rust. The challenge is implementing cache-efficient FFT algorithms with proper memory layout and minimizing complex number arithmetic operations.

flash_attentionC, tier 1Optimizes scaled dot-product attention (n=4096n=4096,d=64d=64) while avoiding fulln×nn{\times}nmatrix allocation through tiled computation. The challenge is implementing memory-efficient attention with SIMD softmax to minimize bandwidth and maintain numerical stability.

gaussian_blurC, tier 1Applies a17×1717{\times}17Gaussian blur to a4096×40964096{\times}4096image 5 times sequentially in single-threaded C. The challenge is exploiting separable filtering and SIMD vectorization while maintaining optimal cache locality across multiple passes.

hash_joinC, tier 1Executes an inner equi-join of 20K and 5M row tables on an integer key using hash-join in single-threaded C. The challenge is building cache-friendly hash tables with minimal collisions and using SIMD prefetching to hide memory latency.

levenshtein_distanceC, tier 1Optimizes a Clevenshtein()routine to compute exact edit distances over 1 000 000 deterministic lowercase ASCII string pairs of lengths 16–64. The challenge is replacing theO(la⋅lb)O(l_{a}\cdot l_{b})two-row DP with a Myers bit-parallel algorithm that packs each row into auint64_tand performs∼\sim8 branchless 64-bit ops per column, while staying single-threaded and bit-exact.

radix_sortC, tier 1Sorts 50M random uint32 values as fast as possible using radix sort in single-threaded C. The challenge is maximizing memory bandwidth efficiency through careful radix digit selection, buffering strategies, and minimizing cache misses.

regex_engineRust, tier 2Compiles regex patterns and searches 100K haystacks using pure Rust without external crates. The challenge is building efficient NFA/DFA structures with backtracking optimization to minimize pattern matching latency across large text volumes.

sha256_throughputC, tier 1Hashes a 512 MiB buffer with SHA-256 as fast as possible in single-threaded C using intrinsics. The challenge is maximizing throughput by exploiting parallelism within SHA-256 compression rounds and maintaining optimal instruction pipelining.

sstable_compaction_rsRust, tier 2Optimizes LSM-style SSTable compaction pipeline merging sorted tables in pure Rust. The challenge is implementing efficient multi-way merge with prefix compression and block boundary optimization to minimize I/O and CPU overhead.

z_order_range_scanRust, tier 2Optimizes a Rust 2D rectangular range-count index over a 16-bit coordinate multiset answering manycount_in(xlo, ylo, xhi, yhi)queries per case. The challenge is replacing the linear per-query scan with a Z-order/Morton spatial key sort plus a skip-step forward scan that leaps over coordinate runs outside the query rectangle, all in safe stdlib Rust with no threading or unsafe.

Puzzle and Challenge (10 tasks).

adaptive_compressionPython, tier 2Implements a byte-level Predictor in/app/predictor.pywhose predict/update loop minimizes average bits-per-byte across 9 hidden families of sequences (Markov chains, periodic, bracketed, recurrences, regime switches, RLE, random), with a 5.0 bpb baseline (order-1 Markov) and a 3.8 bpb reference (PPM-style blended context mixer with match model and period detector). The challenge is that no single model wins on every family, so the agent must build an adaptive context-mixing compressor that tracks higher-order statistics, detects periodicity, and switches regimes online without ever emitting an invalid distribution.

adversarial_splayPython, tier 2Writes 4 096 key accesses (each in[0,1023][0,1023]) to/app/accesses.txtthat maximize the rotation count of a deterministic top-down Sleator–Tarjan splay tree built by inserting0..10230..1023in order, against a uniform-random baseline of 48 656 rotations and a bit-reversal-cycle reference of 67 008 rotations. The challenge is to design a sequence that simultaneously defeats the Sequential, Working-Set, and Dynamic-Finger access bounds, which requires reasoning about how zig / zig-zig / zig-zag steps reshape the tree across the full 4 096-step amortized trajectory.

discover_sortingPython, tier 1Implementsgenerate_network(16)in/app/solve.pyto emit a correct 16-input sorting network with as few comparators as possible, against a Batcher bitonic baseline of 80 comparators and a known-optimal reference of 60 (Dobbelaere 2019). The challenge is that constructing networks below the bitonic count requires either reproducing the literature’s hand-tuned optimal layouts or running a careful combinatorial search verified by the 0–1 principle over all 65 536 binary inputs.

fredkin_sort_networkcustom, tier 2Rewrites/app/circuit.txtto implement a 4-input 2-bit-wide stable reversible sorting network using onlyx/cx/ccx/fredgates, restoring all scratch wires to zero, against a 128-gate baseline and an 88-gate streamlined reference. The challenge is that reversible computation forbids destructive overwrites, so every comparator must be built from carefully shared intermediate terms and then fully uncomputed, making gate reduction a tight routing-and-cleanup puzzle.

resnet_bit_flipPython, tier 2Implementsfind_bit_flips()in/app/solve.pyto return the smallest set of float32 bit edits to a frozen∼\sim77K-parameter MiniResNet that drives CIFAR-10 top-1 accuracy below 12% (baseline 95 flips via top-KKsign-bit flipping, reference 40 flips via 1P-DNL|w|⋅|∇||w|{\cdot}|\nabla|saliency from Guzmán et al. 2025). The challenge is that the|w|≤10|w|{\leq}10magnitude cap and finite-output requirement rule out brute-force exponent attacks, so the agent must combine gradient/saliency analysis with careful search to identify the few weight bits whose flip cascades through the network into class collapse.

safety_routerPython, tier 1Optimizes/app/train.pyto train a 2-layer MLP refusal router with as few parameters as possible while passing private-split gates of accuracy≥0.64\geq 0.64, unsafe recall≥0.66\geq 0.66, and safe recall≥0.57\geq 0.57, against a 16 641-parameter baseline (128-hidden) and a 2 081-parameter reference. The challenge is shrinking the hidden width and tuning thresholds aggressively without violating the asymmetric refusal-vs-answer recall gates, since under-compressing wastes parameters but over-compressing collapses one of the class recalls below threshold and zeroes the reward.

smallest_game_playerPython, tier 2Implementstrain()/predict()in/app/solve.pyto learn optimal moves for4×44{\times}4gravity Connect-3 with as few learnable parameters as possible while keeping hidden-test accuracy≥95%\geq 95\%, against a 17 924-parameter tiny-transformer baseline and a 913-parameter weight-shared per-column scoring reference. The challenge is that explicit minimax/search and hardcoded lookup tables are forbidden, so the agent must hand-design tactical features and a weight-shared scoring head compact enough to encode the game’s tactics in under a thousand parameters.

stack_machine_golfcustom, tier 2Rewrites/app/program.stkto compute a 256-element integer dot product on a register-less stack VM in as few executed instructions as possible, against a 5 132-instruction memory-scratched-loop baseline and a 3 530-instruction4×4{\times}-unrolled stack-resident-accumulator reference. The challenge is that the pure stack ISA has no registers, so every loop-counter and pointer manipulation costs extradup/swap/overinstructions, and instruction-count gains require carefully unrolling and keeping the running accumulator on the stack across the loop body.

toy_isa_optcustom, tier 2Rewrites/app/program.sto compute a 512-element integer dot product on the PINC scoreboard pipeline in as few simulated cycles as possible, against a 9 220-cycle naive single-accumulator baseline and a 2 954-cycle reference using4×4{\times}unrolling with 4 independent accumulators and interleaved loads. The challenge is that the in-order pipeline stalls on RAW hazards through 5-cycle loads and 5-cycle muls, so cycle reduction requires software pipelining, multi-accumulator unrolling, and careful instruction interleaving to fully hide latency.

vliw_schedulercustom, tier 1Implementsvliw_schedule()in/app/solve.cto pack 3 000 adversarially ordered ALU/MUL/MEM ops (ending in 60 chains of 10 dependent muls) into 3-slot VLIW bundles obeying11/33/44-cycle latencies, against a 4 080-cycle one-op-per-bundle baseline and a 1 300-cycle critical-path list-scheduling reference. The challenge is that the long MUL chains form the critical path but appear last in input order, so a competitive scheduler must build the full dependence DAG, compute latency-weighted heights, and use priority-driven list scheduling to fill all three slots per cycle.

Model Development (7 tasks).

data_select_ifevalPython, tier 2Selects up to 5 000 training samples from a 50 000-sample pool spanning 19 sources to maximize IFEval prompt-strict accuracy after LoRA fine-tuning Qwen2.5-3B-Instruct on a single H100 within 8 hours. The challenge is that the base model is already instruction-tuned so most pool data is unhelpful or harmful, and the agent must reason about which sources (persona_if, coconot, wildchat, math, code, multilingual, safety) actually transfer to format-constraint following without being able to inspect IFEval directly.

flux2_klein_loraPython, tier 1Trains a LoRA adapter on the FLUX.2 klein 9B DiT with musubi-tuner over 15 concept images on a single L40S (48 GB) within 4 hours, optimizing a composite of CLIP image similarity, DINO structural similarity, and CLIP text alignment across 8 eval prompts. The challenge is that the shipped config OOMs and the agent must jointly fix the OOM (gradient checkpointing, blocks-to-swap, batch size, resolution), correct the model version, and tune LoRA rank, learning rate, and timestep sampling for FLUX.2’s flow-matching schedule.

grpo_multisourcePython, tier 1Fine-tunes Qwen2.5-VL-7B (4-bit) with GRPO over Geometry3K, MathVision, and ChartQA to maximize MathVista accuracy on a single L40S in 8 hours, subject to a retention gate that zeros the score if general VQA drops more than 10%. The challenge is mixing three heterogeneous visual-math sources, designing reward functions that are robust to formatting variance, and tuning GRPO hyperparameters (lr, num_generations, LoRA rank, KL) so the policy generalizes to MathVista without forgetting base capabilities.

llm_online_servingPython, tier 2Optimizes the SimpleLLM online serving engine for gpt-oss-20b (21B MoE, 3.6B active) on a single H100 to maximize a50/5050/50blend of throughput and inverse mean completion time over 96 Poisson-arriving requests within 2 hours. The challenge is editing only the engine layer (continuous batching, slot management, async request queue) while kernels and the model are frozen, which forces gains from CUDA graph capture across batch sizes, prefill chunk budgeting, and scheduler tuning while still passing a correctness test.

moving_mnist_world_modelPython, tier 1Trains a video next-frame prediction model from scratch on 8 000 Moving MNIST clips (20 frames,64×6464{\times}64) to maximize PSNR over autoregressive 10-step rollouts on a held-out test split, using a single H100 within 4 hours. The challenge is selecting an architecture and training schedule that produces sharp multi-step rollouts without ground-truth conditioning, since naive ConvLSTMs blur quickly under autoregressive error accumulation and the test split is unseen at design time.

multilingual_ocrPython, tier 1Fine-tunes DeepSeek-OCR (3B) with Unsloth LoRA on 9 000 mixed Persian and Bengali synthetic-print images to minimize average character error rate across 400 held-out images on a single L40S within 8 hours. The challenge is balancing two scripts with very different glyph shapes under a single LoRA, choosing rank, learning rate, and step count to escape a weakr=16r{=}16/ 60-step baseline without overfitting on the synthetic distribution.

scaling_lawPython, tier 1Pretrains a transformer language model from scratch on WikiText-103 (∼\sim118M tokens, GPT-2 tokenizer) with LitGPT to minimize test perplexity on a single H100 within 12 hours, with eval fixed atseq_length=1024. The challenge is choosing a compute-optimal point in the (model size, tokens, steps) space, modernizing the architecture (RMSNorm, SwiGLU, GQA, RoPE), and configuring bf16 plustorch.compile, cosine LR with warmup, and batch settings to close the perplexity gap from a∼\sim95 pythia-14m baseline to a∼\sim23 reference at∼\sim179M Llama-style parameters.

CUDA (4 tasks).

huffman_canonical_decode_cudaCUDA, tier 2Decodes 2 048 independent canonical-Huffman bitstreams (each 65 536 bytes of payload, max 16-bit codes) on a single H100 by writing a custom kernel with shared-memory primary/fallback LUTs, warp cooperation, and rolling bit registers. The challenge is that canonical Huffman decoding is inherently serial within a stream (each symbol’s length depends on the previous), so reaching the reference (∼\sim3 ms vs. 56 ms baseline) requires multi-level shared-memory LUTs, warp-cooperative output packing, and unrolled multi-symbol-per-iteration loops without using Thrust, CUB, or any compression library.

icp_correspondence_step_cudaCUDA, tier 2Runs one Iterative-Closest-Point correspondence step on a single H100 over a source cloud ofN=200 000N{=}200\,000and target cloud ofM=500 000M{=}500\,000fp32 points, returning the3×33{\times}3cross-covarianceHH, squared-error sum, and valid-pair count using a prebuilt KD-tree withdmax=0.05d_{\mathrm{max}}=0.05distance gating. The challenge is that brute-forceO(NM)O(NM)NN is far too slow (∼\sim65 ms baseline), so reaching the∼\sim0.28 ms reference requires register-stack KD-tree DFS with AABB pruning and ordered traversal, on-GPU centroid computation, and warp-cooperative fp64 covariance reduction with__shfl_down_sync, all without Thrust/CUB/cuBLAS or any spatial-index library.

msm_pippenger_bls12_381_cudaCUDA, tier 2Computes a BLS12-381 G1 multi-scalar multiplicationQ=∑siPiQ=\sum s_{i}P_{i}overN=262 144N{=}262\,144affine points (Montgomery-form 384-bit coordinates) and 256-bit scalars on a single H100, the dominant kernel in zkSNARK provers like Groth16 and PLONK. The challenge is that naive double-and-add is∼\sim9×\timesslower than reference (532 ms vs. 59 ms), so reaching the target requires implementing the Pippenger bucket method (c=8c{=}8, 32 windows×\times255 buckets) with chunked accumulate / tree-reduce / descending running-sum window combine and CIOS Montgomery field arithmetic in inline PTX, with no third-party finite-field, ECC, or zkSNARK library and no in-kernel allocations.

ntt_butterfly_cudaCUDA, tier 1Applies an in-place batched forward Number Theoretic Transform over the Goldilocks prime field (p=264−232+1p=2^{64}-2^{32}+1) on a single H100 for batch=256 rows ofn=65 536n{=}65\,536uint64 elements, producing natural-order output bit-exactly matching the reference. The challenge is that a textbook iterative Cooley–Tukey kernel with per-thread twiddle re-derivation and integer modulo runs∼\sim85×\timesslower than reference (110 ms vs. 1.28 ms), so reaching the target requires a precomputed twiddle table, shared-memory tiled butterflies for the early stages with global-memory butterflies for the rest, and Goldilocks-specific 128-to-64-bit modular reduction without integer division, all without cuFFT, Thrust, CUB, or any FFT library.

A.2 Per-Task Scoring Anchors and Gates

The two scoring schemes,anchored linearandlog-stretch, are formally defined in Section2.1. Both schemes are clipped to the interval[0,1][0,1]and saturate to0whenever the agent fails the correctness check. In this section we provide, for every task, the specific metric, its direction (↓\downarrowfor lower-is-better,↑\uparrowfor higher-is-better), the baseline anchormℬm_{\mathcal{B}}, the reference anchormℛm_{\mathcal{R}}, and any task-specific feasibility gate that must be satisfied before a positive score is awarded.

Two parameter-count tasks in the Puzzle & Challenge category (smallest_game_playerandsafety_router) employ the degenerate linear forms(x)=clip((mℬ−m(x))/mℬ,0,1)s(x)=\mathrm{clip}\big((m_{\mathcal{B}}-m(x))/m_{\mathcal{B}},0,1\big), which is mathematically equivalent to anchored linear scoring with an implicit reference anchor ofmℛ=0m_{\mathcal{R}}=0(zero parameters). Themℛm_{\mathcal{R}}values listed for these tasks in Table3therefore represent the documented strong solution rather than the scoring anchor itself. Theresnet_bit_fliptask additionally imposes a feasibility gate requiring the corrupted model’s accuracy to fall below12%12\%before any positive reward is granted. All system-optimization and CUDA tasks use log-stretch scoring on a speed-up metric with a “must beat baseline” gate.

Table 2:System-optimization (15 tasks) and CUDA (4 tasks), all using log-stretch scoring with a “must beat baseline” improvement gate. The valuesmℬm_{\mathcal{B}}andmℛm_{\mathcal{R}}are theempirically measuredruntimes of the baseline and reference implementations on the benchmark’s standardized hardware and sandbox environments (seconds for system-optimization tasks, milliseconds for CUDA tasks).TaskMetricDir.mℬm_{\mathcal{B}}mℛm_{\mathcal{R}}Categoryaes128_ctrruntime_seconds↓\downarrow3.00.10system-optagent_tool_routingruntime_seconds↓\downarrow3.850.40system-optbm25_search_goruntime_seconds↓\downarrow2.10.03system-optbvh_raytracerruntime_seconds↓\downarrow3.80.030system-optconcurrent_kv_walruntime_seconds↓\downarrow9.51.1system-optfft_rustruntime_seconds↓\downarrow10.00.001system-optflash_attentionruntime_seconds↓\downarrow0.750.10system-optgaussian_blurruntime_seconds↓\downarrow12.00.25system-opthash_joinruntime_seconds↓\downarrow20.00.04system-optlevenshtein_distanceruntime_seconds↓\downarrow2.08450.3107system-optradix_sortruntime_seconds↓\downarrow4.50.35system-optregex_engineruntime_seconds↓\downarrow1.50.37system-optsha256_throughputruntime_seconds↓\downarrow2.50.15system-optsstable_compaction_rsruntime_seconds↓\downarrow0.0990.041system-optz_order_range_scanruntime_seconds↓\downarrow2.00.02system-opthuffman_canonical_decode_cudaruntime_ms↓\downarrow56.03.112CUDAicp_correspondence_step_cudaruntime_ms↓\downarrow64.650.28CUDAmsm_pippenger_bls12_381_cudaruntime_ms↓\downarrow532.058.94CUDAntt_butterfly_cudaruntime_ms↓\downarrow109.81.28CUDA

Table 3:Puzzle-and-challenge tasks (10). All use anchored linear scoring unless otherwise noted;mℬm_{\mathcal{B}}andmℛm_{\mathcal{R}}denote the baseline and reference targets.TaskMetricDir.mℬm_{\mathcal{B}}mℛm_{\mathcal{R}}Scoring familydiscover_sortingcomparator_count↓\downarrow8060anchored linearfredkin_sort_networkgate_count↓\downarrow12888anchored linearstack_machine_golfinstruction_count↓\downarrow5 1323 530anchored lineartoy_isa_optcycles↓\downarrow9 2202 954anchored linearvliw_schedulercycles↓\downarrow4 0801 300anchored linearsmallest_game_playertotal_params↓\downarrow17 924913anchored linear (implicitmℛ=0m_{\mathcal{R}}=0)safety_routertotal_params↓\downarrow16 6412 081anchored linear (implicitmℛ=0m_{\mathcal{R}}=0)resnet_bit_flipbits_flipped↓\downarrow811anchored linear (12%12\%accuracy gate)adaptive_compressionbits_per_byte↓\downarrow5.03.8log-stretch,5%5\%gateadversarial_splayrotations↑\uparrow48 65667 008log-stretch,1%1\%gate

Table 4:Model-development tasks (7), all using anchored linear scoring.†Forflux2_klein_lora, the baseline anchor is the empirically measured no-LoRA quality score (≈0.49\approx 0.49); the task configuration lists0.00.0as the OOM-crash floor.‡Forllm_online_serving, the metric is a50/5050/50throughput/latency composite that equals1.01.0at the baseline by construction.

Appendix BMore on Experiments

B.1 More on Experimental Setups

Table5summarizes the developing organizations and API providers for all models evaluated onAutoLab.

Table 5:Models evaluated inAutoLab, with their developing organization and API provider.OrganizationModelAPI ProviderMain setAlibabaqwen-3.6-plusTokenRouterAnthropicclaude-opus-4.6Azure AI FoundryDeepSeekdeepseek-v4-proTokenRouterGoogle DeepMindgemini-3.1-proTokenRouterMiniMaxminimax-m2.7MiniMaxMoonshot AIkimi-k2.6CloudflareOpenAIgpt-5.4Azure AI FoundryTencenthunyuan-3-previewOpenRouterxAIgrok-4-20xAIXiaomimimo-v2.5-proXiaomiZhipu AIglm-5TokenRouterAblation setAlibabaqwen-3.5-plusTokenRouterDeepSeekdeepseek-v4-flashTokenRouterMiniMaxminimax-m2.5MiniMaxMoonshot AIkimi-k2.5Azure AI FoundryXiaomimimo-v2-proXiaomiXiaomimimo-v2.5Xiaomi

B.2 Detailed Experimental Results

Table6and7reports the per-task average score and best score across three independent trials for each of the 11 frontier models in our main set, grouped by sub-domain. Cell shading uses the same diverging green to pink scale as Table1, centred at0.50.5, so individual task strengths and weaknesses are visible at a glance.

Task [Uncaptioned image] ClaudeOpus 4.6Gemini3.1 ProKimiK2.6MiMoV2.5 ProGLM5DeepSeekV4 ProGPT5.4Grok4-20Hunyuan3 PreviewMiniMaxM2.7Qwen3.6 PlusSystem Optimization (15 tasks)aes128_ctr0.750.690.700.650.700.370.550.520.670.000.03agent_tool_routing0.650.430.370.300.450.250.130.220.150.360.38bm25_search_go0.830.540.640.510.510.600.500.510.320.500.56bvh_raytracer0.440.390.410.400.280.370.130.360.370.070.11concurrent_kv_wal0.960.760.560.470.630.830.530.530.320.280.25fft_rust0.000.550.390.360.530.570.520.530.370.350.55flash_attention0.850.390.650.430.570.330.320.440.480.430.35gaussian_blur0.710.490.640.540.200.490.340.280.280.090.00hash_join0.810.660.700.660.680.700.610.580.650.580.67levenshtein_distance0.580.530.520.080.120.490.470.000.040.080.01radix_sort0.710.640.460.670.620.690.540.630.660.220.68regex_engine1.000.410.370.020.120.000.190.080.070.080.00sha256_throughput0.460.350.440.260.180.130.140.150.090.140.13sstable_compaction_rs0.830.270.700.400.610.720.180.590.620.390.52z_order_range_scan0.450.200.310.170.410.310.320.260.350.150.10Puzzle & Challenge (10 tasks)adaptive_compression0.680.540.370.280.230.310.430.140.000.230.49adversarial_splay0.610.540.560.510.550.000.360.170.580.350.35discover_sorting1.000.950.900.900.850.570.850.850.250.000.57fredkin_sort_network0.960.310.470.210.600.290.330.620.360.000.00resnet_bit_flip0.630.900.320.920.060.600.320.320.260.350.53safety_router0.990.990.660.960.640.660.000.310.000.310.66smallest_game_player0.610.000.000.090.000.000.320.000.000.000.00stack_machine_golf1.001.000.670.790.481.000.160.960.380.190.00toy_isa_opt0.981.000.000.840.970.590.890.970.650.800.00vliw_scheduler1.000.940.840.850.530.310.830.991.000.390.00Model Development (7 tasks)data_select_ifeval0.640.490.480.380.280.460.090.490.230.430.28flux2_klein_lora0.480.070.000.170.220.000.280.090.000.290.06grpo_multisource0.840.790.590.810.820.850.830.000.570.930.84llm_online_serving0.000.000.000.010.030.040.310.340.000.000.28moving_mnist_world_model0.680.320.220.300.300.190.160.020.280.190.18multilingual_ocr0.890.690.980.880.880.630.440.000.830.700.45scaling_law0.850.140.650.780.610.300.330.000.000.430.63CUDA (4 tasks)huffman_canonical_decode0.450.380.200.110.110.000.000.090.020.000.00icp_correspondence_step0.550.520.510.450.500.000.440.280.150.270.00msm_pippenger_bls12_3810.160.000.100.000.210.000.100.100.000.000.00ntt_butterfly0.370.000.170.150.150.000.000.250.110.220.00

Table 6:Per-task Avg@3 results. Per-row best isboldand runner-upunderlined.Task [Uncaptioned image] ClaudeOpus 4.6Gemini3.1 ProKimiK2.6MiMoV2.5 ProGLM5DeepSeekV4 ProGPT5.4Grok4-20Hunyuan3 PreviewMiniMaxM2.7Qwen3.6 PlusSystem Optimization (15 tasks)aes128_ctr0.780.700.700.700.750.600.570.700.710.000.03agent_tool_routing0.710.500.490.350.510.320.160.360.230.390.47bm25_search_go1.000.590.650.530.520.640.510.520.510.510.58bvh_raytracer0.470.400.430.420.430.420.400.400.410.220.31concurrent_kv_wal0.960.770.920.761.000.930.860.700.960.830.75fft_rust0.000.600.590.550.540.580.520.530.590.530.58flash_attention0.940.690.700.580.610.530.520.440.510.510.55gaussian_blur0.740.640.650.610.600.580.360.280.310.260.00hash_join1.000.700.710.690.700.700.660.600.690.590.71levenshtein_distance0.590.570.560.130.120.500.480.010.120.130.01radix_sort0.720.680.690.680.650.700.570.650.700.660.68regex_engine1.000.430.880.050.140.000.580.250.100.210.00sha256_throughput0.460.460.460.460.260.130.420.170.140.140.13sstable_compaction_rs0.840.740.770.590.730.770.540.590.670.650.81z_order_range_scan0.490.310.410.280.430.360.390.300.420.240.30Puzzle & Challenge (10 tasks)adaptive_compression0.710.690.470.610.340.530.580.260.000.390.52adversarial_splay0.650.560.570.560.560.000.570.500.590.560.55discover_sorting1.001.001.001.000.850.850.850.850.750.000.85fredkin_sort_network1.000.930.730.400.670.470.870.670.530.000.00resnet_bit_flip0.990.950.960.970.190.970.960.660.780.550.86safety_router0.990.990.990.980.990.990.000.940.000.920.99smallest_game_player0.990.000.000.280.000.000.960.000.000.000.00stack_machine_golf1.001.001.001.000.551.000.191.000.880.370.00toy_isa_opt1.001.000.000.981.000.900.900.970.980.970.00vliw_scheduler1.001.000.980.980.940.940.961.001.000.600.00Model Development (7 tasks)data_select_ifeval0.860.580.730.660.540.950.210.770.400.660.56flux2_klein_lora0.720.200.000.370.660.000.410.130.000.860.11grpo_multisource0.890.820.890.840.840.870.840.000.870.980.89llm_online_serving0.000.000.000.020.100.110.810.490.000.000.83moving_mnist_world_model0.750.400.380.450.380.400.390.070.440.390.29multilingual_ocr0.960.931.000.921.001.000.710.000.950.890.90scaling_law0.880.420.711.000.690.590.450.000.000.670.71CUDA (4 tasks)huffman_canonical_decode0.480.450.370.320.270.000.000.140.070.000.00icp_correspondence_step0.600.530.520.510.500.000.480.330.450.310.00msm_pippenger_bls12_3810.480.000.310.000.320.000.310.310.000.000.00ntt_butterfly0.580.000.510.460.460.000.000.340.310.440.00

Table 7:Per-task Best@3 results. Per-row best isboldand runner-upunderlined. Columns are in the same order as the Avg@3 table (by overall Avg@3).

Appendix CMore on Analysis

C.1 Model Generations

We next examine within-provider generation improvements while holding the harness at terminus-2. Figure9compares four old-to-new pairs: Qwen 3.5 Plus to 3.6 Plus, MiMo v2 Pro to v2.5 Pro, MiniMax M2.5 to M2.7, and Kimi K2.5 to K2.6.

Three of the four pairs show modest gains. MiMo improves the most, followed by MiniMax and Kimi.Qwen 3.6 Plus is the only generation that regresses, dropping0.090.09on Avg@3 and0.120.12on Best@3. This decline is consistent with the category breakdown in Table1: while Qwen 3.6 Plus retains a strong Model Development score (0.88), its performance on CUDA, Puzzle & Challenge, and System Optimization collapses to or near zero. We also observe that newer flagship models do not improve every category uniformly. For instance, MiniMax M2.7 improves overall but still trails the median on CUDA, while MiMo v2.5 Pro gains on Model Development yet loses ground on CUDA. Thus, provider-level generation lifts do not guarantee uniform gains across sub-domains.

Refer to caption Figure 9:Per-provider generation deltas onAutoLabacross all 36 tasks. Three of the four pairs (MiMo, MiniMax, Kimi) exhibit modest gains from the older variant (left) to the newer one (right). Qwen is shown with the newer 3.6 Plus on the left to highlight its regression relative to 3.5 Plus

C.2 Stability Analysis

Across-trial stability is a distinct and important dimension from raw capability. A model that achieves0.850.85on one trial but only0.200.20on the other two will have the same Avg@3 as a model that consistently scores0.650.65across all three trials, yet their reliability differs substantially. Reporting only the mean therefore masks critical variance. Table8quantifies across-trial dispersion for the 11 main-set models.

For each (model, task) pair, we compute four complementary stability metrics from the three independent trials: the mean per-task standard deviationσ¯\bar{\sigma}, the mean per-task rangeR¯\bar{R}, the coefficient of variationCV=σ¯/Avg@3\text{CV}=\bar{\sigma}/\text{Avg@3}, and the normalized dispersionσ¯/Best@3\bar{\sigma}/\text{Best@3}. All four metrics show consistent trends.

Table 8:Across-trial stability for the 11 frontier models.σ¯\bar{\sigma}denotes the mean per-task standard deviation,R¯\bar{R}the mean per-task range, CV the coefficient of variation (σ¯/Avg@3\bar{\sigma}/\text{Avg@3}), and the final column normalizesσ¯\bar{\sigma}by Best@3. Lower values indicate higher stability. Best entry per column is inboldand runner-up isunderlined. Rows are ordered by descending Best@3 to match Table1.Three observations stand out:

(i) Stability and capability are correlated but not identical.claude-opus-4.6is both the highest-scoring and the most stable model (σ¯=0.099\bar{\sigma}=0.099), exhibiting less than half the dispersion of other top-5 models. At the lower end, three of the four weakest models (hunyuan-3-preview,minimax-m2.7,qwen-3.6-plus) also show high variance (CV≥0.43\geq 0.43). The relationship is not strictly monotonic:gemini-3.1-proandgrok-4-20are notably more stable than peers with similar Avg@3, whilekimi-k2.6is mid-tier in performance but among the noisiest models.

(ii) Single-trial evaluation is unreliable for high-variance models.For models with CV≥0.40\geq 0.40(deepseek-v4-pro,hunyuan-3-preview,minimax-m2.7,qwen-3.6-plus), the mean across-trial range reaches0.280.28–0.340.34on a[0,1][0,1]scale. A single rollout can therefore land anywhere within roughly one-third of the full score range, making single-shot rankings highly unreliable. We recommend using Avg@3 (or more trials) as the primary metric for such models.

(iii) Best@3 over-credits noisy models.The gap between Best@3 and Avg@3 widens with increasingσ¯\bar{\sigma}(Pearsonr=0.84r=0.84). Relying solely on Best@3 therefore inflates the apparent capability of high-variance models more than that of stable ones, which is why we report both metrics in the main leaderboard.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Abstract

2 TheAutoLabBenchmark

2.2 Benchmark Construction

2.3 Benchmark Composition

3 Benchmark Results

3.1 Experimental Setup

3.2 Main Results

4 Analysis

4.1 Cost Analysis

4.2 Failure Case Analysis

4.3 Harness Ablation Analysis

4.4 More Analysis

5 Related Work

Static and short-horizon agent benchmarks.

Long-horizon optimization and research-agent benchmarks.

Closed-loop agent frameworks and training environments.

6 Conclusion

Limitations and Broader Impact

Acknowledgments

References

Appendix ATask Specifications

A.1 Task Descriptions

System Optimization (15 tasks).

Puzzle and Challenge (10 tasks).

Model Development (7 tasks).

CUDA (4 tasks).

A.2 Per-Task Scoring Anchors and Gates

Appendix BMore on Experiments

B.1 More on Experimental Setups

B.2 Detailed Experimental Results

Appendix CMore on Analysis

C.1 Model Generations

C.2 Stability Analysis

Similar Articles

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Long-Horizon-Terminal-Bench: Testing the Limits of Agents on Long-Horizon Terminal Tasks with Dense Reward-Based Grading

@dair_ai: https://x.com/dair_ai/status/2058537927823556668

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

@dair_ai: https://x.com/dair_ai/status/2066174390048358760

Submit Feedback