ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

benchmark autonomous-research scientific-research ai-agents evaluation llms

Summary

ResearchClawBench is a benchmark for evaluating end-to-end autonomous scientific research across 40 tasks from 10 domains, revealing that current AI agents and LLMs achieve low re-discovery accuracy, with Claude Code averaging 21.5 and Claude-Opus-4.7 averaging 20.7 out of a possible score.

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

Original Article

View Cached Full Text

Cached at: 06/09/26, 08:44 AM

Paper page - ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Source: https://huggingface.co/papers/2606.07591 Published on May 28

#2 Paper of the day Authors:

Abstract

ResearchClawBench evaluates autonomous scientific research capabilities across 40 tasks from 10 domains using expert-curated criteria and reveals current limitations in re-discovery accuracy among AI agents and LLMs.

AI coding agents are increasingly used for scientific work, but their end-to-endautonomous researchcapability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomousscientific researchacross 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curatedmultimodal rubricsdecompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-levelre-discoverywhile leaving room for new discovery. We evaluate sevenautonomous research(auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliablere-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5.Error analysisshows that failures concentrate inexperimental protocol mismatch,evidence mismatch, and missingscientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomousscientific research.

View arXiv page View PDF Project page GitHub131 Add to collection

Get this paper in your agent:

hf papers read 2606\.07591

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.07591 in a model README.md to link it from this page.

Datasets citing this paper1

#### InternScience/ResearchClawBench Benchmark• Updatedabout 7 hours ago • 56 • 4.48k • 5

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Paper page - ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper1

Collections including this paper3

Similar Articles

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

PaperBench: Evaluating AI’s Ability to Replicate AI Research

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

ClawBench: Can AI agents complete everyday online tasks?

Submit Feedback

Similar Articles

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

PaperBench: Evaluating AI’s Ability to Replicate AI Research

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

ClawBench: Can AI agents complete everyday online tasks?