@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…

X AI KOLs Timeline Papers

Summary

Introduces NatureBench, a cross-disciplinary benchmark of 90 tasks from Nature papers to test AI coding agents, finding the best agent (Claude Opus 4.7) surpasses SOTA on only 17.8% of tasks and often succeeds by reducing science to supervised ML rather than genuine discovery.

Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the published SOTA of real Nature papers — on their own, no web search, with the original method hidden? Introducing NatureBench: 90 tasks distilled from Nature-family papers. The best agent (Claude Opus 4.7) surpasses SOTA on just 17.8% of them. And here's the uncomfortable part — when agents do win, they mostly win by quietly reducing science to supervised ML, not by discovering anything new. The bottleneck isn't coding or understanding the task; it's choosing the right method and going deep enough. Benchmark + NatureGym pipeline + public leaderboard, all open. Come run your agent. [huggingface] https://huggingface.co/papers/2606.24530… [leaderboard] https://frontisai.github.io/NatureBench/ w/ @Tsinghua_Uni @FrontisAI
Original Article
View Cached Full Text

Cached at: 06/25/26, 09:16 AM

Everyone’s racing to build “AI scientists.” So we asked a blunt question: Can today’s best coding agents beat the published SOTA of real Nature papers — on their own, no web search, with the original method hidden?

Introducing NatureBench: 90 tasks distilled from Nature-family papers. The best agent (Claude Opus 4.7) surpasses SOTA on just 17.8% of them.

And here’s the uncomfortable part — when agents do win, they mostly win by quietly reducing science to supervised ML, not by discovering anything new. The bottleneck isn’t coding or understanding the task; it’s choosing the right method and going deep enough.

Benchmark + NatureGym pipeline + public leaderboard, all open. Come run your agent. [huggingface] https://huggingface.co/papers/2606.24530…

[leaderboard] https://frontisai.github.io/NatureBench/

w/ @Tsinghua_Uni @FrontisAI


Paper page - NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Source: https://huggingface.co/papers/2606.24530 Published on Jun 23

#2 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents’ ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation rather than genuine scientific innovation.

We introduceNatureBench, across-discipline benchmarkof 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whetherAI coding agentscan move beyond reproduction toward discovery on real scientific problems.NatureBenchis built onNatureGym, an automated pipeline that constructs a standardized, per-taskcontainerized environmentfrom a source paper, addressing theenvironment-fragmentation problemthat has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily throughmethodological translation, converting scientific tasks into familiarsupervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, theNatureGympipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench

View arXiv pageView PDFProject pageGitHub33Add to collection

Get this paper in your agent:

hf papers read 2606\.24530

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.24530 in a model README.md to link it from this page.

Datasets citing this paper1

#### FrontisAI/NatureBench Viewer• Updatedabout 1 hour ago • 90 • 165 • 5

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.24530 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

AI Coding Agents Can Reproduce Social Science Findings

arXiv cs.CL

This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.

PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.