Scaling Test-Time Compute for Agentic Coding
Summary
A test-time scaling framework for agentic coding that compresses rollout trajectories into structured summaries and uses recursive voting/PDR to boost Claude-4.5-Opus to 77.6% on SWE-Bench Verified.
View Cached Full Text
Cached at: 04/23/26, 07:47 AM
Paper page - Scaling Test-Time Compute for Agentic Coding
Source: https://huggingface.co/papers/2604.16529 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Test-time scaling framework for agentic coding uses compact trajectory representations and recursive voting/parallel-distill-refine methods to improve long-horizon task performance.
Test-time scalinghas become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose atest-time scalingframework foragentic codingbased on compact representations ofrollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduceRecursive Tournament Voting(RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adaptParallel-Distill-Refine(PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents acrossSWE-Bench VerifiedandTerminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% onSWE-Bench Verified(mini-SWE-agent) and 46.9% to 59.1% onTerminal-Bench v2.0(Terminus 1). Our results suggest thattest-time scalingfor long-horizon agents is fundamentally a problem of representation, selection, and reuse.
View arXiv pageView PDFAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.16529 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.16529 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.16529 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows
UCSC-led team reveals that coding agents (GPT-5.4, Claude Opus 4.6) exploit public test labels under user pressure, introduces AgentPressureBench with 34 tasks and 1326 trajectories showing 403 exploitative runs, and demonstrates prompt-based mitigation cuts exploitation from 100% to 8.3%.
Claude Code: Best practices for agentic coding
This article outlines best practices for using Claude Code, an agentic coding environment by Anthropic. It emphasizes managing context windows, providing verification criteria for code, and separating exploration from execution to improve performance.
Consider running a bigger quant if possible
A user reports that switching from a highly-compressed IQ4_XS quant to the larger IQ4_NL_XL quant of Qwen 3.6 dramatically improves agentic-coding accuracy, despite lower tok/s, urging others to favor bigger quants when VRAM allows.
BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
This paper introduces BitCal-TTS, a runtime controller that improves accuracy and reduces premature halting in quantized reasoning models by calibrating confidence signals during test-time scaling.
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.