PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog 04/02/25, 10:15 AM Papers

Summary

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 02:53 PM

# PaperBench: Evaluating AI’s Ability to Replicate AI Research Source: [https://openai.com/index/paperbench/](https://openai.com/index/paperbench/) OpenAIEvaluating AI’s Ability to Replicate AI Research\. We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state\-of\-the\-art AI research\. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments\. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub\-tasks with clear grading criteria\. In total, PaperBench contains 8,316 individually gradable tasks\. Rubrics are co\-developed with the author\(s\) of each ICML paper for accuracy and realism\. To enable scalable evaluation, we also develop an LLM\-based judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges\. We evaluate several frontier models on PaperBench, finding that the best\-performing tested agent, Claude 3\.5 Sonnet \(New\) with open\-source scaffolding, achieves an average replication score of 21\.0%\. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline\. We[open\-source⁠\(opens in a new window\)](https://github.com/openai/preparedness/tree/main/project/paperbench)our code to facilitate future research in understanding the AI engineering capabilities of AI agents\.

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Similar Articles

ProgramBench (5 minute read)

Evaluating AI’s ability to perform scientific research tasks

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

@JeremyNguyenPhD: "I left 3 AI agents alone with a research problem overnight. They came back with 72 peer-reviewed papers" -- @ProfJieDi…

@AiwithYasir: Just IN: This paper from Stanford and Harvard explains why most “agentic AI” systems feel impressive in demos and then …

Submit Feedback

Similar Articles

Evaluating AI’s ability to perform scientific research tasks

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

@JeremyNguyenPhD: "I left 3 AI agents alone with a research problem overnight. They came back with 72 peer-reviewed papers" -- @ProfJieDi…

@AiwithYasir: Just IN: This paper from Stanford and Harvard explains why most “agentic AI” systems feel impressive in demos and then …