To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

Hugging Face Daily Papers 06/25/26, 12:00 AM Papers

program-repair llm-agents code-execution cost-effectiveness empirical-study swe-bench agent-traces

Summary

This paper empirically analyzes the cost-effectiveness of code execution in LLM-based program repair agents, finding that execution is used heavily but often indiscriminately, and that restricting execution can save significant cost with minimal impact on repair success.

LLM-based agents for program repair are increasingly built on a "generate-run-revise" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior at scale, we first analyze 7,745 agent traces from SWE-bench leaderboard submissions. Second, we evaluate 3,000 end-to-end repair attempts across 200 SWE-bench instances and three agents (Claude Code, Codex, and the open-source OpenCode) under four execution paradigms, which allows for a fine-grained comparison of performance and cost. Our analysis reveals three key observations: (1) Code execution is used across all agents and models analyzed, with an average of 8.8 test runs per task. Execution behavior varies substantially across agents and models, with frequency ranging from 2 to 19 per task, and late-stage executions consistently achieve higher success rates than early-stage ones. (2) Execution restrictions have little effect on repair success: on commercial agents with SOTA models the resolve-rate gap between Prohibited and Unrestricted is only 1.25 percentage points and not statistically significant, while Prohibited saves substantial token and wall-clock cost. (3) Execution benefit is concentrated rather than uniform. These patterns suggest that current agents apply execution indiscriminately, paying its cost on instances where it provides little benefit. Execution, therefore, should be treated as a resource with an explicit cost-benefit tradeoff, not a default capability.

Original Article

View Cached Full Text

Cached at: 06/29/26, 06:04 PM

Paper page - To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

Source: https://huggingface.co/papers/2606.26978

Abstract

LLM-based program repair agents frequently use execution-based testing but show inconsistent efficiency, with execution costs outweighing benefits in many cases.

LLM-based agentsforprogram repairare increasingly built on a “generate-run-revise” paradigm, iteratively executing tests to evaluate and refine patches. Thisexecution-based approachhas become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-basedprogram repair. To characterize execution behavior at scale, we first analyze 7,745agent tracesfromSWE-benchleaderboard submissions. Second, we evaluate 3,000end-to-end repairattempts across 200SWE-benchinstances and three agents (Claude Code, Codex, and the open-source OpenCode) under fourexecution paradigms, which allows for a fine-grained comparison of performance and cost. Our analysis reveals three key observations: (1) Code execution is used across all agents and models analyzed, with an average of 8.8 test runs per task. Execution behavior varies substantially across agents and models, with frequency ranging from 2 to 19 per task, and late-stage executions consistently achieve higher success rates than early-stage ones. (2) Execution restrictions have little effect on repair success: on commercial agents with SOTA models theresolve-rategap between Prohibited and Unrestricted is only 1.25 percentage points and not statistically significant, while Prohibited saves substantial token andwall-clock cost. (3) Execution benefit is concentrated rather than uniform. These patterns suggest that current agents apply execution indiscriminately, paying its cost on instances where it provides little benefit. Execution, therefore, should be treated as a resource with an explicit cost-benefit tradeoff, not a default capability.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.26978

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.26978 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.26978 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.26978 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

Paper page - To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Measuring Exploits in LLM Agents with Tool Use (4 minute read)

We use LLMs to analyze every file in your codebase. Everyone told us this was a stupid idea because of cost but it wasnt.

@polynoamial: https://x.com/polynoamial/status/2064210146558136827

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems

Submit Feedback

Similar Articles

Measuring Exploits in LLM Agents with Tool Use (4 minute read)

We use LLMs to analyze every file in your codebase. Everyone told us this was a stupid idea because of cost but it wasnt.

@polynoamial: https://x.com/polynoamial/status/2064210146558136827

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems