@rohanpaul_ai: Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw log…
Summary
A Meta paper shows that coding agents improve significantly when they reuse short summaries of past attempts instead of raw logs, achieving strong gains on SWE-Bench and Terminal-Bench with Claude 4.5 Opus.
View Cached Full Text
Cached at: 05/23/26, 04:10 PM
Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs.
i.e. stronger coding agents do not just need more attempts, but better ways to remember attempts.
That sounds obvious until you look at what an agent actually produces: not an answer, but a messy trail of file reads, shell commands, errors, partial fixes, and abandoned ideas.
The paper’s idea is to turn each full attempt into a compact summary of the main guess, partial progress, and failure points, then use those summaries both to pick the best attempts and to guide new ones.
Test-time scaling breaks when the model cannot compare its own past work.
For short answers, ranking is easy.
For long-horizon coding, the bottleneck shifts from generation to representation.
Once rollouts become summaries, two useful things happen.
The system can run tournament-style selection over small groups of candidates, which works better than forcing one giant comparison, and it can feed the best summaries back into a fresh round of attempts instead of starting blind.
The authors test this on 2 hard coding benchmarks by running many attempts in parallel, selecting promising summaries with a tournament style voting method, and then launching fresh attempts that can read the selected summaries first.
The results are strong, with Claude 4.5 Opus rising from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0.
What matters is that the paper says better test-time scaling for long coding agents is not mostly about making more attempts, but about storing experience in a form the agent can actually reuse.
Paper Link – arxiv. org/abs/2604.16529
Paper Title: “Scaling Test-Time Compute for Agentic Coding”
Satya Nadella reveals how Microsoft is applying the concept of “Lean for knowledge work” internally with AI.
The internal ROI on AI investment and leveraging the cost reduction effect of AI.
Borrowing from Toyota’s manufacturing efficiency principles and applying them to white-collar operations powered by AI.
e.g. Microsoft spends approximately $4 billion per year on customer support operations. By deploying AI agents for front-end deflection (resolving issues before they reach human agents) and real-time reasoning assistance for support staff, they are dramatically reducing costs in areas like Xbox and Azure support.
From “Bg2 Pod” YT channel ( link in comment)
Similar Articles
@rohanpaul_ai: Brilliant new paper from Meta, CMU and other labs. Shows that coding agents improve faster by manufacturing their own s…
A new paper from Meta, CMU, and other labs presents Self-play SWE-RL, a method where coding agents train themselves by manufacturing and fixing bugs in real codebases, achieving significant gains on SWE-bench benchmarks without relying on human-written tasks.
@rohanpaul_ai: This Meta + Stanford + Illinois survey paper argues that AI agents work better when code becomes their main working lay…
This survey paper from Meta, Stanford, and Illinois argues that AI agents perform better when code is used as their primary working layer, treating code as the environment for reasoning, action, and modeling. The authors introduce the concept of an 'agent harness' encompassing tools, memory, sandboxes, and feedback loops.
SWE Context Bench just proved something I think a lot of coding agent users already feel
A new benchmark paper 'SWE Context Bench' tests whether coding agents can reuse knowledge across tasks, highlighting a gap in existing benchmarks that only evaluate isolated problem-solving. The author discusses solutions like external memory and mentions tools such as langmem, mem0, supermemory, and Greplica.
EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
Introduces EvoCode-Bench, a benchmark of 26 stateful coding tasks across 227 rounds that evaluates coding agents in multi-turn iterative interactions, revealing that single-round performance overestimates multi-round capabilities by 22–40 points.
@rohanpaul_ai: New Meta, Stanford, Google and many other top labs paper proposes AutoResearchClaw. Shows that automated research impro…
A new paper from Meta, Stanford, and Google introduces AutoResearchClaw, which improves automated research by integrating failure recovery, debate, and selective human input. It outperforms AI Scientist v2 by 54.7% on ARC-Bench and reveals that autonomy is enhanced when constrained by process rather than given unlimited freedom.