@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…
Summary
Introduces NatureBench, a cross-disciplinary benchmark of 90 tasks from Nature papers to test AI coding agents, finding the best agent (Claude Opus 4.7) surpasses SOTA on only 17.8% of tasks and often succeeds by reducing science to supervised ML rather than genuine discovery.
View Cached Full Text
Cached at: 06/25/26, 09:16 AM
Everyone’s racing to build “AI scientists.” So we asked a blunt question: Can today’s best coding agents beat the published SOTA of real Nature papers — on their own, no web search, with the original method hidden?
Introducing NatureBench: 90 tasks distilled from Nature-family papers. The best agent (Claude Opus 4.7) surpasses SOTA on just 17.8% of them.
And here’s the uncomfortable part — when agents do win, they mostly win by quietly reducing science to supervised ML, not by discovering anything new. The bottleneck isn’t coding or understanding the task; it’s choosing the right method and going deep enough.
Benchmark + NatureGym pipeline + public leaderboard, all open. Come run your agent. [huggingface] https://huggingface.co/papers/2606.24530…
[leaderboard] https://frontisai.github.io/NatureBench/
w/ @Tsinghua_Uni @FrontisAI
Paper page - NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
Source: https://huggingface.co/papers/2606.24530 Published on Jun 23
#2 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents’ ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation rather than genuine scientific innovation.
We introduceNatureBench, across-discipline benchmarkof 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whetherAI coding agentscan move beyond reproduction toward discovery on real scientific problems.NatureBenchis built onNatureGym, an automated pipeline that constructs a standardized, per-taskcontainerized environmentfrom a source paper, addressing theenvironment-fragmentation problemthat has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily throughmethodological translation, converting scientific tasks into familiarsupervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, theNatureGympipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench
View arXiv pageView PDFProject pageGitHub33Add to collection
Get this paper in your agent:
hf papers read 2606\.24530
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.24530 in a model README.md to link it from this page.
Datasets citing this paper1
#### FrontisAI/NatureBench Viewer• Updatedabout 1 hour ago • 90 • 165 • 5
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.24530 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
NatureBench is a cross-disciplinary benchmark of 90 scientific tasks from Nature publications, designed to evaluate AI coding agents' ability to achieve genuine discovery. Current agents succeed mainly through methodological translation, not scientific innovation.
AI Coding Agents Can Reproduce Social Science Findings
This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
Introduces SciAgentArena, a benchmark of ~200 tasks for evaluating AI agents in real scientific research. Finds agents effective for well-specified data-analysis workflows but struggle with novel insights and open-ended exploration.
PaperBench: Evaluating AI’s Ability to Replicate AI Research
OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.
@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…
IntologyAI releases NanoGPT-Bench, an internal benchmark to evaluate coding agents on AI R&D tasks. Current agents recover only 9.3% of human progress, mostly through hyperparameter tuning, highlighting gaps in algorithmic research capabilities.