@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…
Summary
IntologyAI releases NanoGPT-Bench, an internal benchmark to evaluate coding agents on AI R&D tasks. Current agents recover only 9.3% of human progress, mostly through hyperparameter tuning, highlighting gaps in algorithmic research capabilities.
View Cached Full Text
Cached at: 05/20/26, 06:25 AM
Can coding agents do research?
We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress
Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research
NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access.
Similar Articles
PaperBench: Evaluating AI’s Ability to Replicate AI Research
OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.
@ChengleiSi: Excited to share these preliminary results on our internal autoresearch system @Recursive_SI, where we achieve SOTA on …
Recursive's automated AI research system achieves state-of-the-art results on NanoChat, NanoGPT Speedrun, and GPU kernel benchmarks by automating the research loop without task-specific adaptations, and open-sourcing artifacts for further inspection.
AI Coding Agents Can Reproduce Social Science Findings
This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.
ProgramBench (5 minute read)
ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.
@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…
ProgramBench is a new benchmark that tests AI agents' ability to reconstruct a complete codebase from a compiled binary and its documentation. The leaderboard will open for submissions soon.