ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Hugging Face Daily Papers Papers

Summary

ResearchClawBench is a benchmark for evaluating end-to-end autonomous scientific research across 40 tasks from 10 domains, revealing that current AI agents and LLMs achieve low re-discovery accuracy, with Claude Code averaging 21.5 and Claude-Opus-4.7 averaging 20.7 out of a possible score.

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:44 AM

Paper page - ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Source: https://huggingface.co/papers/2606.07591 Published on May 28

#2 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

ResearchClawBench evaluates autonomous scientific research capabilities across 40 tasks from 10 domains using expert-curated criteria and reveals current limitations in re-discovery accuracy among AI agents and LLMs.

AI coding agents are increasingly used for scientific work, but their end-to-endautonomous researchcapability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomousscientific researchacross 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curatedmultimodal rubricsdecompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-levelre-discoverywhile leaving room for new discovery. We evaluate sevenautonomous research(auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliablere-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5.Error analysisshows that failures concentrate inexperimental protocol mismatch,evidence mismatch, and missingscientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomousscientific research.

View arXiv pageView PDFProject pageGitHub131Add to collection

Get this paper in your agent:

hf papers read 2606\.07591

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.07591 in a model README.md to link it from this page.

Datasets citing this paper1

#### InternScience/ResearchClawBench Benchmark• Updatedabout 7 hours ago • 56 • 4.48k • 5

Spaces citing this paper1

Collections including this paper3

Similar Articles

PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Hugging Face Daily Papers

WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.