@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…

X AI KOLs Following Tools

Summary

IntologyAI releases NanoGPT-Bench, an internal benchmark to evaluate coding agents on AI R&D tasks. Current agents recover only 9.3% of human progress, mostly through hyperparameter tuning, highlighting gaps in algorithmic research capabilities.

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access.
Original Article
View Cached Full Text

Cached at: 05/20/26, 06:25 AM

Can coding agents do research?

We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress

Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research

NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access.

Similar Articles

PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

AI Coding Agents Can Reproduce Social Science Findings

arXiv cs.CL

This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.

ProgramBench (5 minute read)

TLDR AI

ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.