@AlexGDimakis: I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a pro…

X AI KOLs Timeline Papers

Summary

This research compares AI coding agents (like Claude-Code and Codex) with human expert coders on long-horizon tasks, showing that humans scale super-linearly due to continual learning while agents plateau, highlighting a key limitation of current AI in extended problem-solving.

I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a problem k times independently, and keep the best) your ELO scaling will be linear in log(test-time-compute). Agents like Claude-Code and Codex scale like that after a few hours. 2. We compare human expert coders to coding agents on the same tasks (from AtCoder Heuristic Contest). The exciting finding is that humans scale super-linearly. This is evidence that humans do continual learning, while they are solving a problem! I.e. they learn more about the coding problem they are trying to solve and scale fundamentally better compared to randomly trying things in a memoryless fashion. This is empirical evidence that supports what many of us have felt for a while: unless we solve continual learning we will not be able to outperform humans in tasks that take many days. Current coding agents are not able to do this.
Original Article
View Cached Full Text

Cached at: 06/17/26, 03:47 AM

I am very excited about this research: We show 2 things:

  1. If you just do random sampling (i.e. you try to solve a problem k times independently, and keep the best) your ELO scaling will be linear in log(test-time-compute). Agents like Claude-Code and Codex scale like that after a few hours.
  2. We compare human expert coders to coding agents on the same tasks (from AtCoder Heuristic Contest). The exciting finding is that humans scale super-linearly. This is evidence that humans do continual learning, while they are solving a problem! I.e. they learn more about the coding problem they are trying to solve and scale fundamentally better compared to randomly trying things in a memoryless fashion.

This is empirical evidence that supports what many of us have felt for a while: unless we solve continual learning we will not be able to outperform humans in tasks that take many days. Current coding agents are not able to do this.

It’s Claude code so I can do whatever it wants. But it doesn’t.

It’s an easy proof: assume the performance in a task is a Gaussian zero mean with variance 1. Sample it k times. Random sampling is taking the max performance from k tries. Elo can be computed by the probability of k1 trials beating k2, ie the max of k1 Gaussians to be bigger than the max of k2 gaussians. Elo(k) = constant + log(k)

Similar Articles

AI Coding Agents Can Reproduce Social Science Findings

arXiv cs.CL

This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.