PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog Papers

Summary

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:53 PM

# PaperBench: Evaluating AI’s Ability to Replicate AI Research Source: [https://openai.com/index/paperbench/](https://openai.com/index/paperbench/) OpenAIEvaluating AI’s Ability to Replicate AI Research\. We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state\-of\-the\-art AI research\. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments\. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub\-tasks with clear grading criteria\. In total, PaperBench contains 8,316 individually gradable tasks\. Rubrics are co\-developed with the author\(s\) of each ICML paper for accuracy and realism\. To enable scalable evaluation, we also develop an LLM\-based judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges\. We evaluate several frontier models on PaperBench, finding that the best\-performing tested agent, Claude 3\.5 Sonnet \(New\) with open\-source scaffolding, achieves an average replication score of 21\.0%\. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline\. We[open\-source⁠\(opens in a new window\)](https://github.com/openai/preparedness/tree/main/project/paperbench)our code to facilitate future research in understanding the AI engineering capabilities of AI agents\.

Similar Articles

ProgramBench (5 minute read)

TLDR AI

ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.

Evaluating AI’s ability to perform scientific research tasks

OpenAI Blog

OpenAI introduces FrontierScience, a new benchmark for measuring expert-level AI scientific capabilities across physics, chemistry, and biology, with GPT-5.2 achieving 77% on olympiad-style tasks and 25% on research-style tasks. The paper presents early evidence that GPT-5 meaningfully accelerates real scientific workflows, shortening work from weeks to hours while establishing metrics for tracking progress toward AI-accelerated science.