Scaling Test-Time Compute for Agentic Coding

Hugging Face Daily Papers 04/16/26, 12:00 AM Papers

Summary

A test-time scaling framework for agentic coding that compresses rollout trajectories into structured summaries and uses recursive voting/PDR to boost Claude-4.5-Opus to 77.6% on SWE-Bench Verified.

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/23/26, 07:47 AM

Paper page - Scaling Test-Time Compute for Agentic Coding

Source: https://huggingface.co/papers/2604.16529 Authors:

Abstract

Test-time scaling framework for agentic coding uses compact trajectory representations and recursive voting/parallel-distill-refine methods to improve long-horizon task performance.

Test-time scalinghas become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose atest-time scalingframework foragentic codingbased on compact representations ofrollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduceRecursive Tournament Voting(RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adaptParallel-Distill-Refine(PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents acrossSWE-Bench VerifiedandTerminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% onSWE-Bench Verified(mini-SWE-agent) and 46.9% to 59.1% onTerminal-Bench v2.0(Terminus 1). Our results suggest thattest-time scalingfor long-horizon agents is fundamentally a problem of representation, selection, and reuse.

View arXiv page View PDF Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16529 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16529 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16529 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Scaling Test-Time Compute for Agentic Coding

Paper page - Scaling Test-Time Compute for Agentic Coding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

Claude Code: Best practices for agentic coding

Consider running a bigger quant if possible

BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Submit Feedback

Similar Articles

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

Claude Code: Best practices for agentic coding

Consider running a bigger quant if possible

BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit