@rohanpaul_ai: Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw log…

X AI KOLs Following 05/23/26, 02:29 PM Papers

coding-agents test-time-scaling meta summarization agent-memory swe-bench terminal-bench

Summary

A Meta paper shows that coding agents improve significantly when they reuse short summaries of past attempts instead of raw logs, achieving strong gains on SWE-Bench and Terminal-Bench with Claude 4.5 Opus.

Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs. i.e. stronger coding agents do not just need more attempts, but better ways to remember attempts. That sounds obvious until you look at what an agent actually produces: not an answer, but a messy trail of file reads, shell commands, errors, partial fixes, and abandoned ideas. The paper’s idea is to turn each full attempt into a compact summary of the main guess, partial progress, and failure points, then use those summaries both to pick the best attempts and to guide new ones. Test-time scaling breaks when the model cannot compare its own past work. For short answers, ranking is easy. For long-horizon coding, the bottleneck shifts from generation to representation. Once rollouts become summaries, two useful things happen. The system can run tournament-style selection over small groups of candidates, which works better than forcing one giant comparison, and it can feed the best summaries back into a fresh round of attempts instead of starting blind. --- The authors test this on 2 hard coding benchmarks by running many attempts in parallel, selecting promising summaries with a tournament style voting method, and then launching fresh attempts that can read the selected summaries first. The results are strong, with Claude 4.5 Opus rising from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0. What matters is that the paper says better test-time scaling for long coding agents is not mostly about making more attempts, but about storing experience in a form the agent can actually reuse. ---- Paper Link – arxiv. org/abs/2604.16529 Paper Title: "Scaling Test-Time Compute for Agentic Coding"

Original Article

View Cached Full Text

Cached at: 05/23/26, 04:10 PM

Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs.

i.e. stronger coding agents do not just need more attempts, but better ways to remember attempts.

That sounds obvious until you look at what an agent actually produces: not an answer, but a messy trail of file reads, shell commands, errors, partial fixes, and abandoned ideas.

The paper’s idea is to turn each full attempt into a compact summary of the main guess, partial progress, and failure points, then use those summaries both to pick the best attempts and to guide new ones.

Test-time scaling breaks when the model cannot compare its own past work.

For short answers, ranking is easy.

For long-horizon coding, the bottleneck shifts from generation to representation.

Once rollouts become summaries, two useful things happen.

The system can run tournament-style selection over small groups of candidates, which works better than forcing one giant comparison, and it can feed the best summaries back into a fresh round of attempts instead of starting blind.

The authors test this on 2 hard coding benchmarks by running many attempts in parallel, selecting promising summaries with a tournament style voting method, and then launching fresh attempts that can read the selected summaries first.

The results are strong, with Claude 4.5 Opus rising from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0.

What matters is that the paper says better test-time scaling for long coding agents is not mostly about making more attempts, but about storing experience in a form the agent can actually reuse.

Paper Link – arxiv. org/abs/2604.16529

Paper Title: “Scaling Test-Time Compute for Agentic Coding”

Satya Nadella reveals how Microsoft is applying the concept of “Lean for knowledge work” internally with AI.

The internal ROI on AI investment and leveraging the cost reduction effect of AI.

Borrowing from Toyota’s manufacturing efficiency principles and applying them to white-collar operations powered by AI.

e.g. Microsoft spends approximately $4 billion per year on customer support operations. By deploying AI agents for front-end deflection (resolving issues before they reach human agents) and real-time reasoning assistance for support staff, they are dramatically reducing costs in areas like Xbox and Azure support.

From “Bg2 Pod” YT channel ( link in comment)

@rohanpaul_ai: Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw log…

Similar Articles

@rohanpaul_ai: Brilliant new paper from Meta, CMU and other labs. Shows that coding agents improve faster by manufacturing their own s…

@rohanpaul_ai: This Meta + Stanford + Illinois survey paper argues that AI agents work better when code becomes their main working lay…

SWE Context Bench just proved something I think a lot of coding agent users already feel

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

@rohanpaul_ai: New Meta, Stanford, Google and many other top labs paper proposes AutoResearchClaw. Shows that automated research impro…

Submit Feedback

Similar Articles

@rohanpaul_ai: Brilliant new paper from Meta, CMU and other labs. Shows that coding agents improve faster by manufacturing their own s…

@rohanpaul_ai: This Meta + Stanford + Illinois survey paper argues that AI agents work better when code becomes their main working lay…

SWE Context Bench just proved something I think a lot of coding agent users already feel

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

@rohanpaul_ai: New Meta, Stanford, Google and many other top labs paper proposes AutoResearchClaw. Shows that automated research impro…