PaperBench: Evaluating AI’s Ability to Replicate AI Research
Summary
OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.
View Cached Full Text
Cached at: 04/20/26, 02:53 PM
Similar Articles
ProgramBench (5 minute read)
ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.
Evaluating AI’s ability to perform scientific research tasks
OpenAI introduces FrontierScience, a new benchmark for measuring expert-level AI scientific capabilities across physics, chemistry, and biology, with GPT-5.2 achieving 77% on olympiad-style tasks and 25% on research-style tasks. The paper presents early evidence that GPT-5 meaningfully accelerates real scientific workflows, shortening work from weeks to hours while establishing metrics for tracking progress toward AI-accelerated science.
I built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Developer created a new benchmark called continuity-benchmarks to test AI coding agents' ability to maintain consistency with project rules during active development, addressing gaps in existing memory benchmarks that focus on semantic recall rather than real-time architectural consistency and multi-session behavior.
@JeremyNguyenPhD: "I left 3 AI agents alone with a research problem overnight. They came back with 72 peer-reviewed papers" -- @ProfJieDi…
Professor Jie Ding open-sourced Autoresearch and WorldSeed, AI agent frameworks capable of autonomously reviewing 72 peer-reviewed papers overnight to address a research problem.
@AiwithYasir: Just IN: This paper from Stanford and Harvard explains why most “agentic AI” systems feel impressive in demos and then …
A paper from Stanford and Harvard researchers argues that agentic AI systems fail in real-world deployment not because they lack intelligence, but due to fundamental issues that cause demo performance to collapse in practice.