AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Hugging Face Daily Papers Papers

Summary

AutoLab introduces a benchmark for evaluating long-horizon iterative optimization capabilities of frontier models across diverse domains. Results show that persistence and time awareness are more critical than initial performance, with claude-opus-4.6 demonstrating strong capabilities while many models terminate prematurely.

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.
Original Article
View Cached Full Text

Cached at: 06/04/26, 03:41 AM

Paper page - AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Source: https://huggingface.co/papers/2606.05080 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

AutoLab benchmark evaluates long-horizon iterative optimization capabilities of frontier models across diverse domains, revealing that persistent iteration and time awareness are more critical than initial performance quality.

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existingbenchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustainediterative improvementover extended time horizons. To address this gap, we introduce AutoLab, a newbenchmarkfor ultra long-horizonclosed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent’s initial attempt, but its persistence in repeatedlybenchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits stronglong-horizon optimizationcapabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance oftime awarenessandpersistent iterationinautonomous agents. We open-source the fullbenchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

View arXiv pageView PDFProject pageGitHub84Add to collection

Get this paper in your agent:

hf papers read 2606\.05080

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05080 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.05080 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05080 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

@dair_ai: Outstanding paper on long-horizon agents. (bookmark it) Similar to humans, how do you make agents persist on a difficul…

X AI KOLs Following

AutoLab is a new benchmark evaluating 17 frontier models on 36 expert-curated long-horizon tasks (system optimization, model development, CUDA kernels, puzzles), finding that persistence—not initial attempt quality—is the dominant predictor of success. Claude-opus-4.6 led all categories, while most other models terminated prematurely or exhausted budgets with minimal progress.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

arXiv cs.LG

Introduces LongDS, a benchmark for evaluating LLM agents on long-horizon, multi-turn data analysis tasks. Evaluations show that even the best models achieve only 48.45% accuracy, with performance dropping sharply over turns, highlighting that maintaining analytical state is the key bottleneck.