ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
Summary
ResearchClawBench is a benchmark for evaluating end-to-end autonomous scientific research across 40 tasks from 10 domains, revealing that current AI agents and LLMs achieve low re-discovery accuracy, with Claude Code averaging 21.5 and Claude-Opus-4.7 averaging 20.7 out of a possible score.
View Cached Full Text
Cached at: 06/09/26, 08:44 AM
Paper page - ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
Source: https://huggingface.co/papers/2606.07591 Published on May 28
#2 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
ResearchClawBench evaluates autonomous scientific research capabilities across 40 tasks from 10 domains using expert-curated criteria and reveals current limitations in re-discovery accuracy among AI agents and LLMs.
AI coding agents are increasingly used for scientific work, but their end-to-endautonomous researchcapability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomousscientific researchacross 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curatedmultimodal rubricsdecompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-levelre-discoverywhile leaving room for new discovery. We evaluate sevenautonomous research(auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliablere-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5.Error analysisshows that failures concentrate inexperimental protocol mismatch,evidence mismatch, and missingscientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomousscientific research.
View arXiv pageView PDFProject pageGitHub131Add to collection
Get this paper in your agent:
hf papers read 2606\.07591
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.07591 in a model README.md to link it from this page.
Datasets citing this paper1
#### InternScience/ResearchClawBench Benchmark• Updatedabout 7 hours ago • 56 • 4.48k • 5
Spaces citing this paper1
Collections including this paper3
Similar Articles
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
ResearchClawBench is a benchmark for evaluating end-to-end autonomous scientific research across 40 tasks from 10 domains, using expert-curated rubrics. Current systems score poorly, highlighting challenges in achieving reliable autonomous scientific discovery.
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
AutoResearchClaw is a multi-agent autonomous research system that improves scientific discovery through structured debate, self-healing execution, and human collaboration, outperforming previous systems on the ARC-Bench benchmark by 54.7%.
PaperBench: Evaluating AI’s Ability to Replicate AI Research
OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.
ClawBench: Can AI agents complete everyday online tasks?
ClawBench is a benchmark for evaluating AI agents on everyday online tasks. This V2 update brings improvements or new tasks.