Scaling Test-Time Compute for Agentic Coding

Hugging Face Daily Papers Papers

Summary

A test-time scaling framework for agentic coding that compresses rollout trajectories into structured summaries and uses recursive voting/PDR to boost Claude-4.5-Opus to 77.6% on SWE-Bench Verified.

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/23/26, 07:47 AM

Paper page - Scaling Test-Time Compute for Agentic Coding

Source: https://huggingface.co/papers/2604.16529 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Test-time scaling framework for agentic coding uses compact trajectory representations and recursive voting/parallel-distill-refine methods to improve long-horizon task performance.

Test-time scalinghas become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose atest-time scalingframework foragentic codingbased on compact representations ofrollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduceRecursive Tournament Voting(RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adaptParallel-Distill-Refine(PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents acrossSWE-Bench VerifiedandTerminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% onSWE-Bench Verified(mini-SWE-agent) and 46.9% to 59.1% onTerminal-Bench v2.0(Terminus 1). Our results suggest thattest-time scalingfor long-horizon agents is fundamentally a problem of representation, selection, and reuse.

View arXiv pageView PDFAdd to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16529 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16529 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16529 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Claude Code: Best practices for agentic coding

Anthropic Engineering

This article outlines best practices for using Claude Code, an agentic coding environment by Anthropic. It emphasizes managing context windows, providing verification criteria for code, and separating exploration from execution to improve performance.

Consider running a bigger quant if possible

Reddit r/LocalLLaMA

A user reports that switching from a highly-compressed IQ4_XS quant to the larger IQ4_NL_XL quant of Qwen 3.6 dramatically improves agentic-coding accuracy, despite lower tok/s, urging others to favor bigger quants when VRAM allows.

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.