@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…

X AI KOLs Following 05/19/26, 03:49 PM Tools

coding-agents ai-research benchmark nanoGPT evaluation llm-pretraining

Summary

IntologyAI releases NanoGPT-Bench, an internal benchmark to evaluate coding agents on AI R&D tasks. Current agents recover only 9.3% of human progress, mostly through hyperparameter tuning, highlighting gaps in algorithmic research capabilities.

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access.

Original Article

View Cached Full Text

Cached at: 05/20/26, 06:25 AM

Can coding agents do research?

We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress

Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research

NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access.

@IntologyAI: Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D proble…

Similar Articles

@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…

@OpenAI: We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navig…

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Introducing GeneBench-Pro

Are AI coding agents hitting a wall, or are we just measuring them wrong?

Submit Feedback

Similar Articles

@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…

@OpenAI: We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navig…

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Are AI coding agents hitting a wall, or are we just measuring them wrong?