research-replication

#research-replication

PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog ↗ · 2025-04-02 Cached

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

0 favorites 0 likes

research-replication

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Submit Feedback