@AlexGDimakis: I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a pro…
Summary
This research compares AI coding agents (like Claude-Code and Codex) with human expert coders on long-horizon tasks, showing that humans scale super-linearly due to continual learning while agents plateau, highlighting a key limitation of current AI in extended problem-solving.
View Cached Full Text
Cached at: 06/17/26, 03:47 AM
I am very excited about this research: We show 2 things:
- If you just do random sampling (i.e. you try to solve a problem k times independently, and keep the best) your ELO scaling will be linear in log(test-time-compute). Agents like Claude-Code and Codex scale like that after a few hours.
- We compare human expert coders to coding agents on the same tasks (from AtCoder Heuristic Contest). The exciting finding is that humans scale super-linearly. This is evidence that humans do continual learning, while they are solving a problem! I.e. they learn more about the coding problem they are trying to solve and scale fundamentally better compared to randomly trying things in a memoryless fashion.
This is empirical evidence that supports what many of us have felt for a while: unless we solve continual learning we will not be able to outperform humans in tasks that take many days. Current coding agents are not able to do this.
It’s Claude code so I can do whatever it wants. But it doesn’t.
It’s an easy proof: assume the performance in a task is a Gaussian zero mean with variance 1. Sample it k times. Random sampling is taking the max performance from k tries. Elo can be computed by the probability of k1 trials beating k2, ie the max of k1 Gaussians to be bigger than the max of k2 gaussians. Elo(k) = constant + log(k)
Similar Articles
@AnthropicAI: Our latest economic research introduces a framework for tracking Claude Code as it scales. Who is using Claude Code, an…
Anthropic's latest economic research analyzes ~400,000 Claude Code sessions, finding that domain expertise matters more than coding skills for successful agentic coding, and that task value increased ~25% over seven months.
AI Coding Agents Can Reproduce Social Science Findings
This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.
Anyone else feel like AI agents are amazing right up until things get complicated?
A reflection on the gap between impressive AI agent demos and dependable real-world execution, arguing that current agents excel at structured tasks but fail under unpredictable conditions, suggesting near-term AI roles will focus on narrow automation with human oversight.
@MaximeRivest: Coding agents can only accelerate our work when we are willing to accept that we may not fully understand the overly co…
The article discusses how AI coding agents require engineers to accept that they may not fully understand the complex systems created, drawing parallels to other fields like natural resource management.
@techwith_ram: https://x.com/techwith_ram/status/2064925285003542820
Explores the shift from human-in-the-loop to autonomous agent loops in AI coding, where agents self-prompt and iterate, discussing both the promise and the hidden costs of reduced human control.