@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…

X AI KOLs Timeline 06/24/26, 06:52 AM Papers

benchmark ai-agents coding-agents nature-papers scientific-discovery supervised-learning evaluation

Summary

Introduces NatureBench, a cross-disciplinary benchmark of 90 tasks from Nature papers to test AI coding agents, finding the best agent (Claude Opus 4.7) surpasses SOTA on only 17.8% of tasks and often succeeds by reducing science to supervised ML rather than genuine discovery.

Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the published SOTA of real Nature papers — on their own, no web search, with the original method hidden? Introducing NatureBench: 90 tasks distilled from Nature-family papers. The best agent (Claude Opus 4.7) surpasses SOTA on just 17.8% of them. And here's the uncomfortable part — when agents do win, they mostly win by quietly reducing science to supervised ML, not by discovering anything new. The bottleneck isn't coding or understanding the task; it's choosing the right method and going deep enough. Benchmark + NatureGym pipeline + public leaderboard, all open. Come run your agent. [huggingface] https://huggingface.co/papers/2606.24530… [leaderboard] https://frontisai.github.io/NatureBench/ w/ @Tsinghua_Uni @FrontisAI

Original Article

View Cached Full Text

Cached at: 06/25/26, 09:16 AM

Everyone’s racing to build “AI scientists.” So we asked a blunt question: Can today’s best coding agents beat the published SOTA of real Nature papers — on their own, no web search, with the original method hidden?

Introducing NatureBench: 90 tasks distilled from Nature-family papers. The best agent (Claude Opus 4.7) surpasses SOTA on just 17.8% of them.

And here’s the uncomfortable part — when agents do win, they mostly win by quietly reducing science to supervised ML, not by discovering anything new. The bottleneck isn’t coding or understanding the task; it’s choosing the right method and going deep enough.

Benchmark + NatureGym pipeline + public leaderboard, all open. Come run your agent. [huggingface] https://huggingface.co/papers/2606.24530…

[leaderboard] https://frontisai.github.io/NatureBench/

w/ @Tsinghua_Uni @FrontisAI

Paper page - NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Source: https://huggingface.co/papers/2606.24530 Published on Jun 23

#2 Paper of the day Authors:

Abstract

NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents’ ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation rather than genuine scientific innovation.

We introduceNatureBench, across-discipline benchmarkof 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whetherAI coding agentscan move beyond reproduction toward discovery on real scientific problems.NatureBenchis built onNatureGym, an automated pipeline that constructs a standardized, per-taskcontainerized environmentfrom a source paper, addressing theenvironment-fragmentation problemthat has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily throughmethodological translation, converting scientific tasks into familiarsupervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, theNatureGympipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench

View arXiv page View PDF Project page GitHub33 Add to collection

Get this paper in your agent:

hf papers read 2606\.24530

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.24530 in a model README.md to link it from this page.

Datasets citing this paper1

#### FrontisAI/NatureBench Viewer• Updatedabout 1 hour ago • 90 • 165 • 5

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.24530 in a Space README.md to link it from this page.

@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…

Paper page - NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper1

Similar Articles

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

AI Coding Agents Can Reproduce Social Science Findings

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

PaperBench: Evaluating AI’s Ability to Replicate AI Research

@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…

Submit Feedback

Similar Articles

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

AI Coding Agents Can Reproduce Social Science Findings

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

PaperBench: Evaluating AI’s Ability to Replicate AI Research

@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…