PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog Papers

Summary

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:53 PM

# PaperBench: Evaluating AI’s Ability to Replicate AI Research Source: [https://openai.com/index/paperbench/](https://openai.com/index/paperbench/) OpenAIEvaluating AI’s Ability to Replicate AI Research\. We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state\-of\-the\-art AI research\. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments\. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub\-tasks with clear grading criteria\. In total, PaperBench contains 8,316 individually gradable tasks\. Rubrics are co\-developed with the author\(s\) of each ICML paper for accuracy and realism\. To enable scalable evaluation, we also develop an LLM\-based judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges\. We evaluate several frontier models on PaperBench, finding that the best\-performing tested agent, Claude 3\.5 Sonnet \(New\) with open\-source scaffolding, achieves an average replication score of 21\.0%\. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline\. We[open\-source⁠\(opens in a new window\)](https://github.com/openai/preparedness/tree/main/project/paperbench)our code to facilitate future research in understanding the AI engineering capabilities of AI agents\.

Similar Articles

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Hugging Face Daily Papers

ResearchClawBench is a benchmark for evaluating end-to-end autonomous scientific research across 40 tasks from 10 domains, revealing that current AI agents and LLMs achieve low re-discovery accuracy, with Claude Code averaging 21.5 and Claude-Opus-4.7 averaging 20.7 out of a possible score.

AI Coding Agents Can Reproduce Social Science Findings

arXiv cs.CL

This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.