PaT: Planning-after-Trial for Efficient Test-Time Code Generation
Summary
This paper introduces PaT (Planning-after-Trial), an adaptive test-time computation strategy for code generation that reduces inference costs by approximately 69% while maintaining performance comparable to larger models.
View Cached Full Text
Cached at: 05/11/26, 06:56 AM
# PaT: Planning-after-Trial for Efficient Test-Time Code Generation
Source: [https://arxiv.org/html/2605.07248](https://arxiv.org/html/2605.07248)
Youngsik Yoon1, Sungjae Lee1, Seockbean Song2, Siwei Wang3, Wei Chen3, Jungseul Ok1,2 1Department of Computer Science and Engineering, POSTECH, South Korea 2Graduate School of Artificial Intelligence, POSTECH, South Korea 3Microsoft Research Asia, Beijing, China \{ysyoon97, sungjaelee25, shinebobo, jungseul\.ok\}@postech\.ac\.kr, \{siweiwang, weic\}@microsoft\.com
###### Abstract
Beyond training\-time optimization, scaling test\-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models \(LLMs\)\. However, most existing methods adopt a rigid Planning\-before\-Trial \(PbT\) policy, which inefficiently allocates test\-time compute by incurring planning overhead even on directly solvable problems\. We propose Planning\-after\-Trial \(PaT\), an adaptive policy for code generation that invokes a planner only upon verification failure\. This adaptive policy naturally enables a heterogeneous model configuration: a cost\-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions\. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost\-performance Pareto frontier\. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69%\.
PaT: Planning\-after\-Trial for Efficient Test\-Time Code Generation
Youngsik Yoon1, Sungjae Lee1, Seockbean Song2, Siwei Wang3, Wei Chen3, Jungseul Ok1,2††thanks:Corresponding author\.1Department of Computer Science and Engineering, POSTECH, South Korea2Graduate School of Artificial Intelligence, POSTECH, South Korea3Microsoft Research Asia, Beijing, China\{ysyoon97, sungjaelee25, shinebobo, jungseul\.ok\}@postech\.ac\.kr,\{siweiwang, weic\}@microsoft\.com
Figure 1:Cost \(↓\\downarrow\) \- Pass@1 \(↑\\uparrow\) trade\-off across diverse sizes\.We plot the average Pass@1 across foundational benchmarks \(HumanEvalChenet al\.\([2021](https://arxiv.org/html/2605.07248#bib.bib3)\), MBPPAustinet al\.\([2021](https://arxiv.org/html/2605.07248#bib.bib4)\)and their EvalPlusLiuet al\.\([2023](https://arxiv.org/html/2605.07248#bib.bib19)\)variants\) against the relative inference cost\. PaT consistently advances the Pareto frontier across model sizes \(Qwen34B,8B,14B and 32B\{\}\_\{\\text\{4B,8B,14B and 32B\}\}\)\. Detailed results are provided in Section[4\.1\.1](https://arxiv.org/html/2605.07248#S4.SS1.SSS1)and Table[1](https://arxiv.org/html/2605.07248#S4.T1)\.Figure 2:Comparison with existing methods and PaT \(Ours\)\.Problems are grouped by difficulty \(easy, mid, and hard\)\. Boxes denote the key components: a Generator \(creates code\), a Planner \(decomposes the problem\), and an Executor \(verifies the solution\)\.\(a\) Standarddirectly generate and execute; works on easy problems but often fails on harder ones\.\(b\) FunCoder \(PbT\)always plans first, so planning cost is paid even when unnecessary\.\(c\) PaT \(Ours\)trials first and plans only on failure; solves easy problems cheaply and hard problems adaptively\.## 1Introduction
Large Language Models \(LLMs\) have achieved remarkable success in code generation, driven by massive model and data scaling\(Chenet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib3); Liet al\.,[2022](https://arxiv.org/html/2605.07248#bib.bib22); OpenAI,[2023](https://arxiv.org/html/2605.07248#bib.bib23)\)\. Beyond scaling, conventional approaches focused on training\-time strategies, such as Supervised Fine\-Tuning \(SFT\)\(Roziereet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib46); Luoet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib47)\)and Reinforcement Learning from Human Feedback \(RLHF\)\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.07248#bib.bib45)\)\. Complementing these efforts, recent research scales test\-time computation to handle complex algorithmic logic\(Weiet al\.,[2022](https://arxiv.org/html/2605.07248#bib.bib10); Yaoet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib11); Snellet al\.,[2025](https://arxiv.org/html/2605.07248#bib.bib33)\)\. In particular, advancements have been made to simulate human\-like coding behavior by decomposing complex problems into manageable sub\-problems\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1); Leet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib40)\)rather than generating the solution directly\.
However, existing test\-time scaling code generation methods often incur prohibitive inference cost\. This is primarily due to their heavy reliance on explicit planning modules\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1); Leet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib40)\)or iterative repair and debugging loops\(Shinnet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib27); Zhonget al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib2)\), which consume significant computational resources\. As illustrated in Figure[1](https://arxiv.org/html/2605.07248#S0.F1), previous state\-of\-the\-art methods such as FunCoder\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1)\)consume excessive test\-time resources, often making a small model \(Qwen34B\{\}\_\{\\text\{4B\}\}\) with this method more expensive than a much larger model \(Qwen332B\{\}\_\{\\text\{32B\}\}\) with standard inference, even with inferior performance\. This indicates that previous methods underperform standard scaling in their cost\-performance ratio\.
This inefficiency stems from ignoring the intrinsic difficulty distribution of the workload\. As shown in Figure[1](https://arxiv.org/html/2605.07248#S0.F1), a small model \(Qwen34B\{\}\_\{\\text\{4B\}\}\) already achieves 76% Pass@1 using standard inference, suggesting that a substantial fraction of tasks can be solved without complex interventions\. Only a minority of hard instances truly require expensive planning\. Nonetheless, many prior approaches\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1); Leet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib40); Jianget al\.,[2024b](https://arxiv.org/html/2605.07248#bib.bib26)\)follow a rigid “Planning\-before\-Trial” \(PbT\) policy, wasting resources by indiscriminately applying expensive planning to these tractable problems\.
To address this, we introduce “Planning\-after\-Trial” \(PaT\), an adaptive method that inverts PbT\. Unlike PbT, which enforces planning for every problem, PaT initiates with standard inference and leverages execution feedback to verify this solution\. As illustrated in Figure[2](https://arxiv.org/html/2605.07248#S0.F2), expensive decomposition is triggered only when the initial attempt fails\. By strictly limiting planning to problems that demonstrably require it, PaT concentrates computational resources precisely on hard instances while keeping simple ones cheap\. Consequently, PaT significantly lowers the average inference cost while improving performance, thereby advancing the cost\-performance Pareto frontier as shown in Figure[1](https://arxiv.org/html/2605.07248#S0.F1)\.
The efficiency of PaT is further amplified within a heterogeneous model configuration\. This enables a strategic division of labor, where a cost\-efficient model is assigned to the high\-volume “generator” role and a powerful one to the infrequent “planner” role\. This alignment reflects the intuition that planning requiring more complex reasoning \(e\.g\.,problem decomposition\) is better handled by stronger models, while the resulting decomposed sub\-tasks can often be implemented proficiently by smaller models\.
We empirically validate our claims across diverse model families and a comprehensive suite of benchmarks\. Our results first establish that PaT consistently outperforms the state\-of\-the\-art PbT baseline, FunCoder\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1)\), achieving higher performance across all model scales while using, on average, only 60% of the inference cost\. Furthermore, we show that the heterogeneous model configuration further enhances this efficiency\. For instance, a small generator model guided by a powerful planner achieves competitive performance, with a <1% gap to a larger model in the homogeneous setting, while incurring only 31% of the cost\. These findings confirm that PaT’s adaptive policy establishes a new, superior cost\-performance frontier\. Collectively, our work demonstrates that scaling test\-time compute need not be uniform, proving that unnecessary overhead can be effectively eliminated while maintaining or even enhancing performance\.
Our contributions are summarized as follows:
- •We propose Planning\-after\-Trial \(PaT\), an adaptive policy that invokes planning only upon verification failure to avoid unnecessary overhead\.
- •We pair PaT with a heterogeneous model configuration, combining small models for generation and large models for planning to maximize efficiency\.
- •We provide a comprehensive evaluation across model families and benchmarks, showing that our approach consistently advances the cost\-performance Pareto frontier\.
## 2Related work
##### Code Generation with LLMs\.
To advance code generation, research has prioritized enhancing intrinsic capabilities via post\-training\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib20); Zhuet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib21)\)and self\-improvement frameworks like rStar\-Coder\(Liuet al\.,[2025](https://arxiv.org/html/2605.07248#bib.bib42)\), SPIN\(Chenet al\.,[2024b](https://arxiv.org/html/2605.07248#bib.bib44)\), and SCoRE\(Kumaret al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib43)\)\. In parallel, significant attention has shifted toward scaling test\-time reasoning and computation given models\. Key approaches utilize few\-shot prompting\(Brownet al\.,[2020](https://arxiv.org/html/2605.07248#bib.bib17)\)and diversity sampling\(Chenet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib3)\), often refined by test\-driven verification\(Chenet al\.,[2022a](https://arxiv.org/html/2605.07248#bib.bib6); Shinnet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib27)\), iterative refinement\(Wanget al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib36); Zhanget al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib37); Zhonget al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib2)\), or retrieval from external bases\(Liet al\.,[2022](https://arxiv.org/html/2605.07248#bib.bib22); Zhanget al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib30)\)\. To tackle higher complexity, recent systems have further incorporated hierarchical decomposition\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1); Leet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib40)\)and natural language planning\(Wanget al\.,[2025a](https://arxiv.org/html/2605.07248#bib.bib35)\)\. While these approaches improve robustness, they often apply sophisticated interventions uniformly across problems, incurring unnecessary overhead for simple cases\. Notably, AdaCoder\(Zhuet al\.,[2025](https://arxiv.org/html/2605.07248#bib.bib34)\)dynamically selects refinement strategies but still relies on iterative repair rather than structural decomposition\. In contrast, PaT optimizes test\-time computation, reserving complex interventions strictly for verified failures using execution feedback as a precise trigger\.
##### Decomposition and Structured Reasoning\.
To tackle complex problems beyond the reach of direct generation, divide\-and\-conquer strategies have evolved from implicit reasoning steps in Chain\-of\-Thought\(Weiet al\.,[2022](https://arxiv.org/html/2605.07248#bib.bib10)\)and Tree\-of\-Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib11)\)to explicit planning paradigms\. This evolution spans diverse domains, ranging from sequential solving in Least\-to\-Most prompting\(Zhouet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib25)\)and mathematical reasoning with Program of Thoughts\(Chenet al\.,[2022b](https://arxiv.org/html/2605.07248#bib.bib9)\), to using a society of models for complex tasks\(Junejaet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib31)\)\. In the code domain, structured pipelines like Self\-Plan\(Jianget al\.,[2024b](https://arxiv.org/html/2605.07248#bib.bib26)\)and FunCoder\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1)\)explicitly decompose problems before implementation\. However, these approaches predominantly adhere to a rigid Planning\-before\-Trial \(PbT\) policy, where planning is invoked unconditionally prior to execution\. While recent work attempts to mitigate this inefficiency by learning an adaptive policy for invoking planning\(Paglieriet al\.,[2025](https://arxiv.org/html/2605.07248#bib.bib32)\), such approaches introduce additional training complexity and auxiliary models\. In contrast, PaT offers a simpler, learning\-free alternative: it utilizes the definitive execution signal as a reactive trigger, avoiding both the universal cost of PbT and the overhead of training separate policy models\.
## 3PaT: Planning after Trial
##### Problem Formulation\.
We model a code generation instance as a specificationxx\(e\.g\., natural language description\) and seek a programℱ\\mathcal\{F\}that satisfiesxx\. LetMGM\_\{G\}denote a generator model andMPM\_\{P\}a planner model\. For any specificationxx, the generatorMGM\_\{G\}produces a direct candidate implementation, denotedf^\\hat\{f\}\. The plannerMPM\_\{P\}, when invoked, produces a decomposition plan consisting of a new top\-level implementation,f^\\hat\{f\}, and a set of subproblem specifications\{xi\}\\\{x\_\{i\}\\\}\. The final programℱ\\mathcal\{F\}is constructed byComposethat merges an implementationf^\\hat\{f\}with a set of verified helper functionsHH\. For verification, we construct a test set𝒯\(x\)=\{\(inj,outj\)\}j=1t\\mathcal\{T\}\(x\)=\\\{\(\\mathrm\{in\}\_\{j\},\\mathrm\{out\}\_\{j\}\)\\\}\_\{j=1\}^\{t\}and execute programs in a sandboxed Python runtime\. We evaluateℱ\\mathcal\{F\}with
Evaluate\(ℱ,𝒯\(x\)\)=∑j=1t𝟏\[ℱ\(inj\)=outj\],\\displaystyle\\\!\\\!\\\!\\textsc\{Evaluate\}\\big\(\\mathcal\{F\},\\mathcal\{T\}\(x\)\\big\)\\\!=\\\!\\\!\\sum\_\{j=1\}^\{t\}\\mathbf\{1\}\\\!\\big\[\\mathcal\{F\}\(\\mathrm\{in\}\_\{j\}\)\\\!=\\\!\\mathrm\{out\}\_\{j\}\\big\],\\\!\(1\)i\.e\.,the number of tests passed\.
Algorithm 1Planning after Trial \(PaT\)1:Input:Problem
xx, Set of Helper Functions
HH,
2:Generator
MGM\_\{G\}, Planner
MPM\_\{P\}
3:Output:Generated Program
ℱ\\mathcal\{F\}
4:PaT:
5:
f^←MG\(x;H\)\\hat\{f\}\\leftarrow M\_\{G\}\(x;H\)⊳\\trianglerightBest\-of\-NNtrial
6:
𝒯\(x\)←GenerateTests\(x\)\\mathcal\{T\}\(x\)\\leftarrow\\textsc\{GenerateTests\}\(x\)
7:
p←Evaluate\(Compose\(f^,H\),𝒯\(x\)\)p\\leftarrow\\textsc\{Evaluate\}\(\\textsc\{Compose\}\(\\hat\{f\},H\),\\mathcal\{T\}\(x\)\)
8:if
p=\|𝒯\(x\)\|p=\|\\mathcal\{T\}\(x\)\|then
9:
ℱ←Compose\(f^,H\)\\mathcal\{F\}\\leftarrow\\textsc\{Compose\}\(\\hat\{f\},H\)
10:return
ℱ\\mathcal\{F\}⊳\\trianglerightTrial success
11:endif
12:whileTruedo⊳\\trianglerightUntil success or plateau
13:
f^prev,pprev←f^,p\\hat\{f\}\_\{\\mathrm\{prev\}\},p\_\{\\mathrm\{prev\}\}\\leftarrow\\hat\{f\},p
14:
f^,\{xi\}←MP\(x;H\)\\hat\{f\},\\\{x\_\{i\}\\\}\\leftarrow M\_\{P\}\(x;H\)⊳\\trianglerightPlanner invoked
15:foreach
xix\_\{i\}not implemented in
HHdo
16:
f^i←PaT\(xi,H,MG,MP\)\\hat\{f\}\_\{i\}\\leftarrow\\textsc\{PaT\}\(x\_\{i\},H,M\_\{G\},M\_\{P\}\)
17:
H←H∪\{f^i\}H\\leftarrow H\\cup\\\{\\hat\{f\}\_\{i\}\\\}
18:endfor
19:
p←Evaluate\(Compose\(f^,H\),𝒯\(x\)\)p\\leftarrow\\textsc\{Evaluate\}\(\\textsc\{Compose\}\(\\hat\{f\},H\),\\mathcal\{T\}\(x\)\)
20:if
p=\|𝒯\(x\)\|p=\|\\mathcal\{T\}\(x\)\|then⊳\\trianglerightFinal success
21:return
Compose\(f^,H\)\\textsc\{Compose\}\(\\hat\{f\},H\)
22:endif
23:if
p≤pprevp\\leq p\_\{\\mathrm\{prev\}\}then⊳\\trianglerightPlateau condition
24:return
Compose\(f^prev,H\)\\textsc\{Compose\}\(\\hat\{f\}\_\{\\mathrm\{prev\}\},H\)
25:endif
26:endwhile
### 3\.1Adaptive Planning via Failure Feedback
PaT executes a simple yet effective failure\-triggered policy\. Given the problem specificationxx, the generatorMGM\_\{G\}first attempts a Best\-of\-NNtrial by sampling multiple candidate solutions\. We verify these candidates on a test set𝒯\(x\)\\mathcal\{T\}\(x\)\. If any candidate passes all tests, the pipeline terminates and returns that solution immediately without incurring any decomposition overhead\. However, if all candidates fail, this consistent failure serves as a strong signal that the problem complexity exceeds the generator’s direct reasoning capability, rather than being a simple implementation error\. Only in this case, PaT invokes the plannerMPM\_\{P\}to decomposexxinto subproblems\{xi\}\\\{x\_\{i\}\\\}, which are solved recursively using the same trial\-first policy\.
When all subproblems have verified solutions,i\.e\.,Evaluate\(f^i,𝒯\(xi\)\)=\|𝒯\(xi\)\|\\textsc\{Evaluate\}\(\\hat\{f\}\_\{i\},\\mathcal\{T\}\(x\_\{i\}\)\)=\|\\mathcal\{T\}\(x\_\{i\}\)\|for allii, PaT composes them into a parent\-level candidate and re\-verifies against𝒯\(x\)\\mathcal\{T\}\(x\)\. If this composite solution passes, the process terminates successfully\. Otherwise, the planner is invoked again on the original problem specificationxx\. For this subsequent planning attempt, the planner is provided with the original problem and the set of previously successful subsolutions\{f^i\}\\\{\\hat\{f\}\_\{i\}\\\}as additional context\. This allows the planner to make a more informed decomposition decision while reusing already successful components, further enhancing cost\-efficiency\.
### 3\.2Test Cases and Verification
Relying solely on provided public test cases is often insufficient\. Benchmarks typically provide only basic scenarios that lack critical edge cases\(Chenet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib3)\), or in some cases like MBPP, provide none at all\(Austinet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib4)\)\. To ensure robust operation in such environments, we explicitly generate test cases𝒯\(x\)\\mathcal\{T\}\(x\), producing an average of 6\.7 test cases per problem\.
Qwen34BQwen38BQwen314BQwen332B0202040406060808010010010\.410\.4%14\.514\.5%6\.26\.2%6\.16\.1%14\.614\.6%8\.98\.9%8\.98\.9%12\.212\.2%11\.611\.6%20\.920\.9%14\.614\.6%15\.215\.2%63\.463\.4%55\.755\.7%70\.370\.3%66\.566\.5%Percentage \(%\)0123\+Figure 3:Distribution of incorrect test cases \(false positives\) per problem on HumanEval\.The percentage of problems where the generated test cases contain 0, 1, 2, or 3\+ incorrect test cases, given an average of 6\.7 generated test cases per problem\.However, prior works note that generated test cases can be noisy or incorrect\(Chenet al\.,[2022a](https://arxiv.org/html/2605.07248#bib.bib6); Wanget al\.,[2025b](https://arxiv.org/html/2605.07248#bib.bib38); Prasadet al\.,[2025](https://arxiv.org/html/2605.07248#bib.bib39)\)\. To mitigate this, most previous works adopt consensus\-based scoring to identify the most robust output\(Chenet al\.,[2022a](https://arxiv.org/html/2605.07248#bib.bib6),[2024a](https://arxiv.org/html/2605.07248#bib.bib1)\)\. This approach evaluates a pool of candidate outputs against generated test inputs, selecting the one that achieves the broadest consensus\. However, such consensus\-based scoring is ill\-suited for our framework, as its objective of selecting the most likely correct outputs contrasts with PaT’s need for a definitive binary signal \(success or failure\) to trigger its adaptive escalation to planning\.
We therefore separate the success criterion from the termination rule\. The success criterion is strict: a candidate is accepted only if it passes all test cases in𝒯\(x\)\\mathcal\{T\}\(x\)\. This reliance on a strict signal is feasible because, as shown in Figure[3](https://arxiv.org/html/2605.07248#S3.F3), generated test cases are completely error\-free for the majority of problems \(e\.g\., 63\.4% on HumanEval with Qwen34B\{\}\_\{\\text\{4B\}\}\)\. Furthermore, even in noisy instances, errors are typically limited to 1\-2 false positives, with severe noise \(3\+ errors\) concentrated in a small minority\. To handle these residual noisy test cases and prevent unproductive recursion, we track the pass countp\(t\)=Evaluate\(f^,𝒯\(x\)\)p^\{\(t\)\}=\\textsc\{Evaluate\}\(\\hat\{f\},\\mathcal\{T\}\(x\)\)at iterationttand apply a plateau rule: ifp\(t\)≤p\(t−1\)p^\{\(t\)\}\\leq p^\{\(t\-1\)\}then halt and return the best\-known solution\. We find this heuristic to be practically effective, as it prevents the model from overfitting to potential false positives by terminating the search when performance saturates\.
### 3\.3Heterogeneous Model Configuration
To further enhance efficiency, we adopt a heterogeneous model configuration that assigns distinct models to the two key roles inherent in PaT: the generator and the planner\. These roles place different demands on model capability\. The generator’s task, which is to produce a solution for a well\-scoped and often manageable subproblem, can be effectively handled by a cost\-efficient small language model \(sLM\)\. In contrast, the planner’s task, which is to understand the nuances of a complex specification and propose a helpful decomposition, benefits from the advanced reasoning of a high\-performance large language model \(LLM\)\. Building on this observation, we instantiate PaT with an sLM as the generator and an LLM as the planner\.
However, achieving efficiency requires more than simply “using a smaller model as the generator\.” While decomposition is less expensive than full generation, it still incurs a planning token cost\. We must therefore navigate a critical trade\-off: an overly weak sLM triggers frequent failures, necessitating repeated interventions by the planner, which raises the total cost\. Conversely, an overly strong sLM reduces the need for planning but increases the baseline cost of every trial, eroding the benefits of heterogeneity\. Therefore, maximizing efficiency requires empirically tuning the generator selection to strike a balance between minimizing baseline costs and avoiding excessive planning overhead\. To provide a rigorous foundation for these observations, we provide a formal analysis of this trade\-off in Appendix[A](https://arxiv.org/html/2605.07248#A1)\.
## 4Experiments
In this section, we evaluate PaT in terms of both performance and cost efficiency\. We conduct experiments on two settings: a homogeneous setting \(Section[4\.1](https://arxiv.org/html/2605.07248#S4.SS1)\), and a heterogeneous setting \(Section[4\.2](https://arxiv.org/html/2605.07248#S4.SS2)\)\. Below we describe the experimental setup used in our evaluation, including benchmarks, LLMs, evaluation metrics, and baselines\.
##### Benchmarks\.
We evaluate on established code generation benchmarks\.HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib3)\)andMBPP\(Austinet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib4)\)measure foundational code generation\. For both, we use EvalPlus\(Liuet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib19)\)to obtain expanded and more robust unit tests \(i\.e\.,HumanEval\+andMBPP\+\)\. For challenging tasks, we usexCodeEval\(Khanet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib5)\)benchmark, partitioning instances into four categories \(Easy, Mid, Hard, and Expert\) following FunCoder’s rating scheme\.
##### LLMs\.
We primarily adopt the Qwen3 family \(4B, 8B, 14B, and 32B\)\(Yanget al\.,[2025](https://arxiv.org/html/2605.07248#bib.bib8)\)as our main testbed, since it provides fine\-grained scaling steps within a single model family\. To evaluate the cross\-family generalization of our method, we additionally conduct experiments on Llama\-3\.1\-8B\-Instruct\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib20)\)and DeepSeek\-Coder\-V2\-Lite\-Instruct \(≈\\approx16B\)\(Zhuet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib21)\)\. For the cost analysis, we calculate inference costs based on public market pricing as detailed in Appendix[E](https://arxiv.org/html/2605.07248#A5)\.
##### Evaluation Metric\.
We report two primary metrics:Pass@1andLLM cost\. Pass@1 is our primary performance metric, reflecting the ability to generate a correct solution on the first attempt, which aligns with real\-world efficiency goals\. LLM cost for each experiment is computed directly from the per\-token prices in Table[7](https://arxiv.org/html/2605.07248#A4.T7)\.
##### Baselines\.
To validate the effectiveness of PaT, we compare it against the following state\-of\-the\-art algorithms for scaling test\-time computation:
- •Standard\(Brownet al\.,[2020](https://arxiv.org/html/2605.07248#bib.bib17)\)uses few\-shot prompting, giving several in\-context examples to the model and asking it to generate a solution; we follow the original prompting protocol to measure baseline quality without post\-hoc filtering or decomposition\.
- •Best\-of\-N\(Chenet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib3)\)generates multiple candidates \(N=5\) and selects the best one; we include it to distinguish the gains of our adaptive planning from those achievable simply by scaling inference compute\.
- •CodeT\(Chenet al\.,[2022a](https://arxiv.org/html/2605.07248#bib.bib6)\)performs consensus\-based selection by executing solutions against automatically generated test cases; it serves as a representative benchmark for the efficacy of test\-driven verification methods\.
- •FunCoder\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1)\)implements a static divide\-and\-conquer pipeline with a fixed hierarchy, solves subproblems, and assembles the final solution; we compare against it to highlight the efficiency gains of our adaptive, failure\-driven policy over rigid Planning\-before\-Trial strategies\.
For the sake of reproducibility and clarity, a detailed account of our implementation is included in Appendix[C](https://arxiv.org/html/2605.07248#A3)–[G](https://arxiv.org/html/2605.07248#A7)\.
### 4\.1Homogeneous Setting
In this section, we compare results from running PaT using a homogeneous language model against baselines\. Evaluations are performed on the foundational benchmarks \(HumanEval and MBPP; Sec\.[4\.1\.1](https://arxiv.org/html/2605.07248#S4.SS1.SSS1)\) and the challenging benchmark \(xCodeEval; Sec\.[4\.1\.2](https://arxiv.org/html/2605.07248#S4.SS1.SSS2)\)\.
Table 1:Performance and cost comparison on foundational benchmarks\.We report Pass@1 across four benchmarks \(HumanEval, HumanEval\+, MBPP, and MBPP\+\) and their average \(‘Avg\.’\)\. Cost is normalized relative to the Standard baseline \(1\.00\)\. Best results are inbold\.#### 4\.1\.1Foundational Benchmarks
We evaluate PaT and baselines on the foundational benchmarks \(HumanEval and MBPP\) and their EvalPlus variants\. As shown in Table[1](https://arxiv.org/html/2605.07248#S4.T1), PaT consistently outperforms all baselines across all tested model families and scales\. The effectiveness of this policy is best illustrated by its ability to dramatically improve the capability of smaller models\. For instance, PaT enables Qwen34B\{\}\_\{\\text\{4B\}\}to achieve an average Pass@1 of 83\.13%, which is remarkably similar to the 83\.35% achieved by Standard on Qwen332B\{\}\_\{\\text\{32B\}\}, a model eight times larger\. This demonstrates that PaT is not merely an incremental improvement but a powerful policy that fundamentally alters the performance curve of a given model\.
Beyond its superior performance, PaT also shows critical advantages in cost\-efficiency\. As detailed in Table[1](https://arxiv.org/html/2605.07248#S4.T1), PaT consistently achieves higher performance than FunCoder, the previous state\-of\-the\-art hierarchical method, while consuming, on average 60% of its cost\. This efficiency gain can be explained by examining Standard performance\. On these foundational benchmarks, Standard solves, on average, 76% of problems directly, demonstrating that a large majority of tasks do not require decomposition\. FunCoder’s rigid PbT policy is forced to pay a cost of planning on all of these problems, incurring its expensive planning overhead even when it is not needed\. In contrast, PaT only invokes its planner on the much smaller fraction of problems \(24%\) that actually fail the initial, cheaper trial\. By avoiding this universal planning overhead, PaT establishes a new and more effective cost\-performance frontier for code generation\.
Table 2:Performance breakdown on xCodeEval by difficulty\.We report Pass@1 scores for each difficulty category \(Easy, Mid, Hard, and Expert\) and the overall score\.EasyMidHardExpert020204040606080801001006\.86\.86\.019\.519\.525\.425\.424\.624\.638\.638\.644\.844\.871\.271\.2DifficultyProportion \(%\)FunCoderPaT\(a\)Decomposition rate024681002020404060608080100100Cost \(10−210^\{\-2\}$\)Proportion \(%\)EasyMidHardExpert\(b\)Cost \(FunCoder\)024681002020404060608080100100Cost \(10−210^\{\-2\}$\)Proportion \(%\)EasyMidHardExpert\(c\)Cost \(PaT\)
Figure 4:Adaptive decomposition probability and cost analysis for Qwen34B\{\}\_\{\\text\{4B\}\}on xCodeEval\.\(a\) Decomposition rate of FunCoder and PaT by problem difficulty\. Per\-difficulty cost distribution \(solid line\) and average cost \(vertical dashed line\) on \(b\) FunCoder and \(c\) PaT\.\.\.\.
\-Ifasinglefunctionistoohardtosolve,youcandecomposeitintomultiplesmallerfunctions\.
\.\.\.
\(a\)Decomposition prompt for FunCoder
\.\.\.
\-Thepreviousattempttodirectimplementthetargetfunctionisfailed,indicatingitsoveralllogicmightbetoocomplextoimplementdirectly\.
\-Therefore,youmustdecomposeitintomultiplesmaller,manageablehelperfunctions\.
\.\.\.
\(b\)Planning prompt for PaT
Figure 5:Comparison of planning prompts\.Excerpts from planning prompts for \(a\) FunCoder and \(b\) PaT\.
#### 4\.1\.2Challenging Benchmark
On the challenging xCodeEval benchmark, PaT again achieves higher performance than all baselines across all model variants\. However, we observe an interesting cost dynamic: for smaller models like Qwen34B\{\}\_\{\\text\{4B\}\}and Llama3\.18B\{\}\_\{\\text\{8B\}\}, PaT incurs a higher cost than FunCoder\. This is not a sign of inefficiency but a direct consequence of PaT’s adaptive strategy\. Less capable models fail more frequently on xCodeEval’s difficult problems, and PaT correctly interprets these failures as signals to invest in decomposition\. This strategic escalation, while costly, is precisely what allows these smaller models to overcome challenges where rigid policies like FunCoder’s fall short\.
For a deeper analysis, we examine the behavior of Qwen34B\{\}\_\{\\text\{4B\}\}in Figure[4](https://arxiv.org/html/2605.07248#S4.F4), which reveals a fundamental limitation of pre\-emptive planning\. While both PaT and FunCoder increase decomposition for harder problems, PaT exhibits a much more dynamic response compared to FunCoder’s more restrained increase\. Remarkably, despite this aggressive decomposition, the cost difference remains marginal because FunCoder incurs a planning overhead on 100% of problems, whereas PaT invokes the planner exclusively on failed instances\. This reinforces a key insight: planning is a high\-cost resource that must be allocated reactively, relying on the signal of failure to distinguish necessary investment from wasteful overhead\.
This behavioral divergence stems from the distinct prompt strategies shown in Figure[5](https://arxiv.org/html/2605.07248#S4.F5)\. While FunCoder relies on the model’s subjective assessment of complexity, PaT leverages a grounded feedback signal derived from trial failure to issue an imperative command\. This explicit directive proves far more effective at triggering necessary decomposition than a conditional instruction, ensuring the model does not underestimate the task’s difficulty\.
### 4\.2Heterogeneous Setting
Our discussion in Section[3\.3](https://arxiv.org/html/2605.07248#S3.SS3)established the potential for enhanced cost\-efficiency in a heterogeneous model configuration\. To empirically validate this, we conduct experiments on the foundational benchmarks, pairing a powerful planner \(Qwen332B\{\}\_\{\\text\{32B\}\}\) with a series of smaller generator models\. The results, presented in Table[3](https://arxiv.org/html/2605.07248#S4.T3), provide strong support for our analysis\. Pairing a Qwen38B\{\}\_\{\\text\{8B\}\}generator with a Qwen332B\{\}\_\{\\text\{32B\}\}planner achieves competitive performance \(an average Pass@1 of87\.39%87\.39\\%\), with a <1% gap to the homogeneous Qwen332B\{\}\_\{\\text\{32B\}\}, while reducing the relative cost to just0\.310\.31\.
The superior cost\-benefit trade\-off of the heterogeneous approach is visualized in Figure[6](https://arxiv.org/html/2605.07248#S4.F6)\. In this graph, the slope of the curve represents the performance return on additional cost\. The heterogeneous model configurations exhibit a significantly steeper slope, demonstrating that upgrading only the planner is a highly capital\-efficient strategy, yielding substantial performance gains for a marginal increase in cost\. This confirms our central hypothesis: because PaT invokes the planner infrequently, reserving a powerful model for this critical but rare task is the most cost\-effective way to enhance the overall system’s capability\.
Table 3:Performance \(Pass@1\) and relative cost resultsfor homogeneous and heterogeneous PaT configurations\. The cost is normalized relative to the homogeneous Qwen332B\{\}\_\{\\text\{32B\}\}model, which is set to1\.001\.00\.0\.20\.20\.60\.611828286869090\(4B, 4B\)\(8B, 8B\)\(14B, 14B\)\(32B, 32B\)\(4B, 32B\)\(8B, 32B\)\(14B, 32B\)CostPass@1 \(%\)HomogeneousHeterogeneousFigure 6:Trade\-off curve for heterogeneous configurations\.Corresponding to Table[3](https://arxiv.org/html/2605.07248#S4.T3), this plot visualizes the Pass@1 performance vs\. relative cost\. Labels denote \(Generator, Planner\) sizes\. Employing a powerful Qwen332B\{\}\_\{\\text\{32B\}\}planner \(dashed line\) yields a significant performance gain for a marginal increase in cost compared to Homogeneous configurations \(solid line\)\.
## 5Conclusion
In this work, we introduced Planning\-after\-Trial \(PaT\), an adaptive policy that addresses the critical challenge of high inference costs in LLM\-based code generation\. By inverting the conventional Planning\-before\-Trial \(PbT\) policy, PaT avoids unnecessary planning overhead on simple problems and strategically allocates computational resources only when a direct attempt fails\. We demonstrated that this adaptive nature is uniquely synergistic with a heterogeneous configuration, allowing for the decoupling of planning and generation costs\. Comprehensive experiments confirmed that PaT outperforms existing state\-of\-the\-art baselines in homogeneous settings across diverse models and benchmarks\. Furthermore, we validated that our heterogeneous model configuration establishes a superior cost\-performance frontier compared to homogeneous setups\. By enabling principled test\-time computation scaling, PaT provides a practical and effective framework for building more scalable and cost\-efficient code generation systems\.
## Limitations
Relying on self\-generated test cases inherently introduces noise and potential inefficiency arising from unnecessary planning for correctly solved problems\. While integrating emerging high\-fidelity generation methods\(Wanget al\.,[2025b](https://arxiv.org/html/2605.07248#bib.bib38); Prasadet al\.,[2025](https://arxiv.org/html/2605.07248#bib.bib39)\)offers synergistic potential, achieving a perfectly noise\-free oracle remains unlikely; thus, robust mechanisms like our plateau heuristic remain essential for practical deployment\. Regarding computational resources, our heterogeneous configuration entails a higher static memory footprint\. However, this design aligns with efficient architectures like Mixture\-of\-Experts \(MoE\)\(Jianget al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib41)\), prioritizing the significant reduction of average dynamic inference cost and user\-facing latency for the majority of tasks\.
## Acknowledgements
This work was supported by Institute of Information & Communications Technology Planning & Evaluation \(IITP\) grants funded by the Korea government \(MSIT\) \(RS\-2019\-II191906, Artificial Intelligence Graduate School Program \(POSTECH\); RS\-2024\-00457882, AI Research Hub Project; IITP\-2026\-RS\-2024\-00437866, Information Technology Research Center \(ITRC\); RS\-2026\-25511821, Development of Personalized Media Service Recommendation and Generative Technology\)\. It was also supported by the MSIT under the Global Research Support Program in the Digital Field \(RS\-2024\-00436680\) supervised by the IITP\. This project is supported by Microsoft Research Asia\.
## AI writing assistance
We utilized large language models solely for the purpose of refining the clarity, grammar, and style of the manuscript under the scope of “Assistance purely with the language of the paper\.” All scientific claims, experimental designs, and empirical results presented in this paper are the original work of the authors\.
## References
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[Figure 1](https://arxiv.org/html/2605.07248#S0.F1),[§3\.2](https://arxiv.org/html/2605.07248#S3.SS2.p1.1),[§4](https://arxiv.org/html/2605.07248#S4.SS0.SSS0.Px1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[Appendix F](https://arxiv.org/html/2605.07248#A6.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1),[1st item](https://arxiv.org/html/2605.07248#S4.I1.i1.p1.1)\.
- B\. Chen, F\. Zhang, A\. Nguyen, D\. Zan, Z\. Lin, J\. Lou, and W\. Chen \(2022a\)Codet: code generation with generated tests\.arXiv preprint arXiv:2207\.10397\.Cited by:[Appendix F](https://arxiv.org/html/2605.07248#A6.SS0.SSS0.Px3.p1.3),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.07248#S3.SS2.p2.1),[3rd item](https://arxiv.org/html/2605.07248#S4.I1.i3.p1.1)\.
- J\. Chen, H\. Tang, Z\. Chu, Q\. Chen, Z\. Wang, M\. Liu, and B\. Qin \(2024a\)Divide\-and\-conquer meets consensus: unleashing the power of functions in code generation\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[Appendix F](https://arxiv.org/html/2605.07248#A6.SS0.SSS0.Px4.p1.3),[Appendix G](https://arxiv.org/html/2605.07248#A7.p1.1),[§1](https://arxiv.org/html/2605.07248#S1.p1.1),[§1](https://arxiv.org/html/2605.07248#S1.p2.2),[§1](https://arxiv.org/html/2605.07248#S1.p3.1),[§1](https://arxiv.org/html/2605.07248#S1.p6.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.07248#S3.SS2.p2.1),[4th item](https://arxiv.org/html/2605.07248#S4.I1.i4.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[Appendix D](https://arxiv.org/html/2605.07248#A4.SS0.SSS0.Px1.p1.1),[Appendix F](https://arxiv.org/html/2605.07248#A6.SS0.SSS0.Px2.p1.3),[Figure 1](https://arxiv.org/html/2605.07248#S0.F1),[§1](https://arxiv.org/html/2605.07248#S1.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.07248#S3.SS2.p1.1),[2nd item](https://arxiv.org/html/2605.07248#S4.I1.i2.p1.1),[§4](https://arxiv.org/html/2605.07248#S4.SS0.SSS0.Px1.p1.1)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2022b\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.arXiv preprint arXiv:2211\.12588\.Cited by:[Appendix D](https://arxiv.org/html/2605.07248#A4.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Chen, Y\. Deng, H\. Yuan, K\. Ji, and Q\. Gu \(2024b\)Self\-play fine\-tuning converts weak language models to strong language models\.arXiv preprint arXiv:2401\.01335\.Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[Appendix E](https://arxiv.org/html/2605.07248#A5.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.07248#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, E\. B\. Hanna, F\. Bressand,et al\.\(2024a\)Mixtral of experts\.arXiv preprint arXiv:2401\.04088\.Cited by:[Limitations](https://arxiv.org/html/2605.07248#Sx1.p1.1)\.
- X\. Jiang, Y\. Dong, L\. Wang, Z\. Fang, Q\. Shang, G\. Li, Z\. Jin, and W\. Jiao \(2024b\)Self\-planning code generation with large language models\.ACM Transactions on Software Engineering and Methodology33\(7\),pp\. 1–30\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p3.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Juneja, S\. Dutta, and T\. Chakraborty \(2024\)LM2\\texttt\{LM\}^\{\\texttt\{2\}\}: A simple society of language models solves complex reasoning\.arXiv preprint arXiv:2404\.02255\.Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[Assumption 3](https://arxiv.org/html/2605.07248#Thmassumption3.p1.2.2)\.
- M\. A\. M\. Khan, M\. S\. Bari, X\. L\. Do, W\. Wang, M\. R\. Parvez, and S\. Joty \(2023\)Xcodeeval: a large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval\.arXiv preprint arXiv:2303\.03004\.Cited by:[Appendix D](https://arxiv.org/html/2605.07248#A4.SS0.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2605.07248#S4.SS0.SSS0.Px1.p1.1)\.
- A\. Kumar, V\. Zhuang, R\. Agarwal, Y\. Su, J\. D\. Co\-Reyes, A\. Singh, K\. Baumli, S\. Iqbal, C\. Bishop, R\. Roelofs,et al\.\(2024\)Training language models to self\-correct via reinforcement learning\.arXiv preprint arXiv:2409\.12917\.Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Le, H\. Chen, A\. Saha, A\. Gokul, D\. Sahoo, and S\. Joty \(2024\)CodeChain: towards modular code generation through chain of self\-revisions with representative sub\-modules\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1),[§1](https://arxiv.org/html/2605.07248#S1.p2.2),[§1](https://arxiv.org/html/2605.07248#S1.p3.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, D\. Choi, J\. Chung, N\. Kushman, J\. Schrittwieser, R\. Leblond, T\. Eccles, J\. Keeling, F\. Gimeno, A\. Dal Lago,et al\.\(2022\)Competition\-level code generation with alphacode\.Science378\(6624\),pp\. 1092–1097\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2023\)Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation\.Advances in Neural Information Processing Systems36,pp\. 21558–21572\.Cited by:[Appendix D](https://arxiv.org/html/2605.07248#A4.SS0.SSS0.Px3.p1.1),[Figure 1](https://arxiv.org/html/2605.07248#S0.F1),[§4](https://arxiv.org/html/2605.07248#S4.SS0.SSS0.Px1.p1.1)\.
- Y\. Liu, L\. L\. Zhang, Y\. Zhu, B\. Dong, X\. Zhou, N\. Shang, F\. Yang, and M\. Yang \(2025\)RStar\-coder: scaling competitive code reasoning with a large\-scale verified dataset\.arXiv preprint arXiv:2505\.21297\.Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Luo, C\. Xu, P\. Zhao, Q\. Sun, X\. Geng, W\. Hu, C\. Tao, J\. Ma, Q\. Lin, and D\. Jiang \(2023\)Wizardcoder: empowering code large language models with evol\-instruct\.arXiv preprint arXiv:2306\.08568\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1)\.
- OpenAI \(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1)\.
- D\. Paglieri, B\. Cupiał, J\. Cook, U\. Piterbarg, J\. Tuyls, E\. Grefenstette, J\. N\. Foerster, J\. Parker\-Holder, and T\. Rocktäschel \(2025\)Learning when to plan: efficiently allocating test\-time compute for llm agents\.arXiv preprint arXiv:2509\.03581\.Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Prasad, E\. Stengel\-Eskin, J\. C\. Chen, Z\. Khan, and M\. Bansal \(2025\)Learning to generate unit tests for automated debugging\.arXiv preprint arXiv:2502\.01619\.Cited by:[§3\.2](https://arxiv.org/html/2605.07248#S3.SS2.p2.1),[Limitations](https://arxiv.org/html/2605.07248#Sx1.p1.1)\.
- B\. Roziere, J\. Gehring, F\. Gloeckle, S\. Sootla, I\. Gat, X\. E\. Tan, Y\. Adi, J\. Liu, R\. Sauvestre, T\. Remez,et al\.\(2023\)Code llama: open foundation models for code\.arXiv preprint arXiv:2308\.12950\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[Appendix D](https://arxiv.org/html/2605.07248#A4.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.07248#S1.p2.2),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- C\. V\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2025\)Scaling llm test\-time compute optimally can be more effective than scaling parameters for reasoning\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1)\.
- E\. Z\. Wang, F\. Cassano, C\. Wu, Y\. Bai, W\. Song, V\. Nath, Z\. Han, S\. M\. Hendryx, S\. Yue, and H\. Zhang \(2025a\)Planning in natural language improves llm search for code generation\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Wang, Z\. Liu, S\. Wang, G\. Cui, N\. Ding, Z\. Liu, and G\. Yu \(2024\)INTERVENOR: prompting the coding ability of large language models with the interactive chain of repair\.InACL \(Findings\),Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Wang, C\. Yang, Z\. Wang, Y\. Huang, Z\. Chu, D\. Song, L\. Zhang, A\. R\. Chen, and L\. Ma \(2025b\)Testeval: benchmarking large language models for test case generation\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 3547–3562\.Cited by:[§3\.2](https://arxiv.org/html/2605.07248#S3.SS2.p2.1),[Limitations](https://arxiv.org/html/2605.07248#Sx1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix E](https://arxiv.org/html/2605.07248#A5.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.07248#S4.SS0.SSS0.Px2.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Zhang, B\. Chen, Y\. Zhang, J\. Keung, J\. Liu, D\. Zan, Y\. Mao, J\. Lou, and W\. Chen \(2023\)Repocoder: repository\-level code completion through iterative retrieval and generation\.arXiv preprint arXiv:2303\.12570\.Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Zhang, W\. Cheng, Y\. Wu, and W\. Hu \(2024\)A pair programming framework for code generation via multi\-plan exploration and feedback\-driven refinement\.InASE,Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Zhong, Z\. Wang, and J\. Shang \(2024\)Debug like a human: a large language model debugger via verifying runtime execution step by step\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 851–870\.Cited by:[§1](https://arxiv.org/html/2605.07248#S1.p2.2),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Zhou, N\. Sch"arli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le,et al\.\(2023\)Least\-to\-most prompting enables complex reasoning in large language models\.arXiv preprint arXiv:2205\.10625\.Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Zhu, D\. Guo, Z\. Shao, D\. Yang, P\. Wang, R\. Xu, Y\. Wu, Y\. Li, H\. Gao, S\. Ma,et al\.\(2024\)Deepseek\-coder\-v2: breaking the barrier of closed\-source models in code intelligence\.arXiv preprint arXiv:2406\.11931\.Cited by:[Appendix E](https://arxiv.org/html/2605.07248#A5.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.07248#S4.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhu, C\. Liu, X\. He, X\. Ren, Z\. Liu, R\. Pan, and H\. Zhang \(2025\)AdaCoder: an adaptive planning and multi\-agent framework for function\-level code generation\.arXiv preprint arXiv:2504\.04220\.Cited by:[§2](https://arxiv.org/html/2605.07248#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix ATheoretical Analysis
In this section, we provide a formal theoretical analysis supporting the cost\-efficiency of the Heterogeneous Model Configuration in Section[3\.3](https://arxiv.org/html/2605.07248#S3.SS3)\. We first establish the notations and cost model, then prove the existence of an efficient heterogeneous policy \(Theorem[1](https://arxiv.org/html/2605.07248#Thmtheorem1)\)\. We subsequently extend our analysis to consider the asymptotic cost for problems of large complexitykkand derive the optimal cost for the generator sLM under scaling laws\.
##### Preliminaries and Notations\.
We denote the problem complexity byk\>0k\>0\. We consider a set of modelsM∈\{L,s\}M\\in\\\{L,s\\\}, whereLLrepresents a large, capable model \(used as the Planner\) andssrepresents a smaller, cost\-efficient model \(used as the Generator\)\. For each modelMM, letpMp\_\{M\}denote its problem\-solving capability \(the maximum complexity it can solve in a single attempt\) andcMc\_\{M\}denote its unit inference cost\. Consistent with the heterogeneous setting, we assume0<ps<pL0<p\_\{s\}<p\_\{L\}and0<cs<cL0<c\_\{s\}<c\_\{L\}\.
Based on these notations, we define the core assumptions that characterize our cost model\.
###### Assumption 1\(Generation cost\)\.
A generation call byMGM\_\{G\}on a problem of complexitykkincurs
CostGeneration=\{kcMG,k≤pMG\(success\),pMGcMG,k\>pMG\(failed\)\.\\displaystyle\\text\{Cost\}\_\{\\text\{Generation\}\}=\\begin\{cases\}k\\,c\_\{M\_\{G\}\},\\\!\\\!\\\!\\\!&k\\leq p\_\{M\_\{G\}\}\\;\\text\{\(success\)\},\\\\\[2\.0pt\] p\_\{M\_\{G\}\}\\,c\_\{M\_\{G\}\},\\\!\\\!\\\!\\\!&k\>p\_\{M\_\{G\}\}\\;\\text\{\(failed\)\}\.\\end\{cases\}\(2\)
###### Assumption 2\(Planning cost\)\.
A Planner\-producednn\-way plan maps a problem of sizekktonnindependent subproblems of sizekn\\frac\{k\}\{n\}and incurs a fixed per\-subproblem overheadDMPD\_\{M\_\{P\}\}, so that
CostPlanning=nDMP\.\\displaystyle\\text\{Cost\}\_\{\\text\{Planning\}\}\\;=\\;nD\_\{M\_\{P\}\}\.\(3\)
To formally connect a model’s capabilitypMp\_\{M\}and its unit costcMc\_\{M\}, we adopt a standard assumption from the literature on neural scaling laws\.
###### Assumption 3\(Scaling law\)\.
We assume that a model’s capabilitypMp\_\{M\}and its unit costcMc\_\{M\}are related by a power\-law scaling relation, consistent with prior work\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.07248#bib.bib14)\):
pM=αcMβ,α\>0,1≥β\>0\.\\displaystyle p\_\{M\}=\\alpha\\,c\_\{M\}^\{\\beta\},\\qquad\\alpha\>0,\\ 1\\geq\\beta\>0\.\(4\)The constraint0<β≤10<\\beta\\leq 1reflects the principle of diminishing returns\.
With these definitions established, we now formally state the condition under which the heterogeneous policy outperforms the homogeneous one\.
###### Theorem 1\(Existence of an efficient sLM capability\)\.
Letk∼Unif\(0,pL\]k\\sim\\mathrm\{Unif\}\(0,p\_\{L\}\]\. Under Assumptions[1](https://arxiv.org/html/2605.07248#Thmassumption1),[2](https://arxiv.org/html/2605.07248#Thmassumption2), and[3](https://arxiv.org/html/2605.07248#Thmassumption3), if the total planning overhead satisfies
nDL<\(12−1n2\)pLcL,\\displaystyle nD\_\{L\}\\;<\\;\\Big\(\\tfrac\{1\}\{2\}\-\\tfrac\{1\}\{n^\{2\}\}\\Big\)\\,p\_\{L\}c\_\{L\},\(5\)then there existsps∈\(0,pL\)p\_\{s\}\\in\(0,p\_\{L\}\)for which PaT with a Heterogeneous Model Configuration has strictly lower expected cost than an LLM\-only policy\.
###### Proof\.
First, for the homogeneous LLM\-only policy, the cost for any problemk≤pLk\\leq p\_\{L\}iskcLkc\_\{L\}\. The expected cost is therefore:
𝔼\[CostHomogeneous\]\\displaystyle\\mathbb\{E\}\[\\text\{Cost\}\_\{\\text\{Homogeneous\}\}\]=1pL∫0pLkcL𝑑k=1pLpL2cL2=pLcL2\\displaystyle=\\frac\{1\}\{p\_\{L\}\}\\int\_\{0\}^\{p\_\{L\}\}kc\_\{L\}\\;dk=\\frac\{1\}\{p\_\{L\}\}\\frac\{p\_\{L\}^\{2\}c\_\{L\}\}\{2\}=\\frac\{p\_\{L\}c\_\{L\}\}\{2\}\(6\)
Next, for the Heterogeneous policy, we consider two cases based on the capability of the small model,psp\_\{s\}\. Ifps≥kp\_\{s\}\\geq k, the generatorsssucceeds, incurring a cost ofkcskc\_\{s\}\. Else,i\.e\.,ps<kp\_\{s\}<k, the generatorssfails \(costpscsp\_\{s\}c\_\{s\}\), the plannerLLis invoked \(costnDLnD\_\{L\}\), and then the generatorsssolves thennsubproblems of sizekn\\frac\{k\}\{n\}\. Considerps≥pLnp\_\{s\}\\geq\\frac\{p\_\{L\}\}\{n\}, we can calculate the cost in this casepscs\+nDL\+kcsp\_\{s\}c\_\{s\}\+nD\_\{L\}\+kc\_\{s\}\. The expected cost is the sum of the integrals over these two ranges, divided bypLp\_\{L\}:
𝔼\[CostHeterogeneous\]\\displaystyle\\mathbb\{E\}\[\\text\{Cost\}\_\{\\text\{Heterogeneous\}\}\]=1pL\(∫0pskcs𝑑k\+∫pspLpscs\+nDL\+kcsdk\)\\displaystyle=\\frac\{1\}\{p\_\{L\}\}\\biggl\(\\int\_\{0\}^\{p\_\{s\}\}\\\!\\\!kc\_\{s\}\\;dk\+\\\!\\int\_\{p\_\{s\}\}^\{p\_\{L\}\}\\\!\\\!p\_\{s\}c\_\{s\}\+\\\!nD\_\{L\}\+\\\!kc\_\{s\}\\;dk\\biggr\)\(7\)=1pL\(ps2cs2\+pscs\(pL−ps\)\+nDL\(pL−ps\)\\displaystyle=\\frac\{1\}\{p\_\{L\}\}\\biggl\(\\frac\{p\_\{s\}^\{2\}c\_\{s\}\}\{2\}\+p\_\{s\}c\_\{s\}\(p\_\{L\}\-p\_\{s\}\)\+nD\_\{L\}\(p\_\{L\}\-p\_\{s\}\)\+\(pL2−ps2\)cs2\)\\displaystyle\\;\\;\\;\\;\\;\\;\\;\\;\\;\\;\\;\\;\\;\+\\frac\{\(p\_\{L\}^\{2\}\-p\_\{s\}^\{2\}\)c\_\{s\}\}\{2\}\\biggr\)\(8\)=pLcs2\+pL−pspL\(pscs\+nDL\)\\displaystyle=\\frac\{p\_\{L\}c\_\{s\}\}\{2\}\+\\frac\{p\_\{L\}\-p\_\{s\}\}\{p\_\{L\}\}\\left\(p\_\{s\}c\_\{s\}\+nD\_\{L\}\\right\)\(9\)
To analyze a concrete scenario, let’s assume we can choose an sLM such that its capability is a fraction of the LLM’s,i\.e\.,ps=pLnp\_\{s\}=\\frac\{p\_\{L\}\}\{n\}\. Using the scaling law from Assumption[3](https://arxiv.org/html/2605.07248#Thmassumption3)\(pM=αcMβp\_\{M\}=\\alpha c\_\{M\}^\{\\beta\}\), we can relate the costs:cs=\(psα\)1/β=\(pLnα\)1/β=n−1/βcLc\_\{s\}=\\left\(\\frac\{p\_\{s\}\}\{\\alpha\}\\right\)^\{1/\\beta\}=\\left\(\\frac\{p\_\{L\}\}\{n\\alpha\}\\right\)^\{1/\\beta\}=n^\{\-1/\\beta\}c\_\{L\}\.
Substituting these into the heterogeneous cost equation[9](https://arxiv.org/html/2605.07248#A1.E9):
𝔼\[CostHeterogeneous\]\\displaystyle\\mathbb\{E\}\[\\text\{Cost\}\_\{\\text\{Heterogeneous\}\}\]\(10\)=pLcs2\+pL−pspL\(pscs\+nDL\)\\displaystyle=\\frac\{p\_\{L\}c\_\{s\}\}\{2\}\+\\frac\{p\_\{L\}\-p\_\{s\}\}\{p\_\{L\}\}\\left\(p\_\{s\}c\_\{s\}\+nD\_\{L\}\\right\)\(11\)=pLn−1/βcL2\+n−1n\(pLnn−1/βcL\+nDL\)\\displaystyle=\\frac\{p\_\{L\}n^\{\-1/\\beta\}c\_\{L\}\}\{2\}\+\\frac\{n\-1\}\{n\}\\left\(\\frac\{p\_\{L\}\}\{n\}n^\{\-1/\\beta\}c\_\{L\}\+nD\_\{L\}\\right\)\(12\)=\(12\+n−1n2\)n−1/βpLcL\+\(n−1\)DL\\displaystyle=\\left\(\\frac\{1\}\{2\}\+\\frac\{n\-1\}\{n^\{2\}\}\\right\)n^\{\-1/\\beta\}p\_\{L\}c\_\{L\}\+\(n\-1\)D\_\{L\}\(13\)
For the heterogeneous policy to be strictly more efficient, the difference must be positive:
0\\displaystyle 0\>𝔼\[CostHeterogeneous−CostHomogeneous\]\\displaystyle\>\\mathbb\{E\}\[\\text\{Cost\}\_\{\\text\{Heterogeneous\}\}\-\\text\{Cost\}\_\{\\text\{Homogeneous\}\}\]\(14\)=\(n−1\)DL\+\(12\+n−1n2\)n−1βpLcL−pLcL2\.\\displaystyle=\\\!\(n\-1\)D\_\{L\}\\\!\+\\\!\\left\(\\\!\\frac\{1\}\{2\}\\\!\+\\\!\\frac\{n\-1\}\{n^\{2\}\}\\\!\\right\)n^\{\-\\frac\{1\}\{\\beta\}\}p\_\{L\}c\_\{L\}\\\!\-\\\!\\frac\{p\_\{L\}c\_\{L\}\}\{2\}\.\(15\)
Rearranging to solve for the planning overheadnDnDgives the condition stated in the theorem:
nDL\\displaystyle nD\_\{L\}<nn−1\(pLcL2−\(12\+n−1n2\)n−1βpLcL\)\\displaystyle<\\frac\{n\}\{n\-1\}\\\!\\left\(\\\!\\frac\{p\_\{L\}c\_\{L\}\}\{2\}\\\!\-\\\!\\left\(\\\!\\frac\{1\}\{2\}\\\!\+\\\!\\frac\{n\-1\}\{n^\{2\}\}\\\!\\right\)n^\{\-\\frac\{1\}\{\\beta\}\}p\_\{L\}c\_\{L\}\\\!\\right\)\\\!\(16\)=\(n−n1−1β2\(n−1\)−n−1−1β\)pLcL\.\\displaystyle=\\left\(\\frac\{n\-n^\{1\-\\frac\{1\}\{\\beta\}\}\}\{2\(n\-1\)\}\-n^\{\-1\-\\frac\{1\}\{\\beta\}\}\\right\)p\_\{L\}c\_\{L\}\.\(17\)
As a specific intuitive case, if we consider linear scaling whereβ=1\\beta=1, the condition simplifies to:
nDL<\(12−1n2\)pLcL\\displaystyle nD\_\{L\}<\\left\(\\frac\{1\}\{2\}\-\\frac\{1\}\{n^\{2\}\}\\right\)p\_\{L\}c\_\{L\}\(18\)
This shows that as long as the total planning cost is less than the savings achieved by the heterogeneous configuration, a more efficient sLM exists\. ∎
###### Theorem 2\(Asymptotic efficiency of the heterogeneous configuration\)\.
For any sufficiently complex taskkk, the Heterogeneous strategy is asymptotically more cost\-efficient than the Homogeneous strategy, provided that the cost of decomposition satisfies the following condition:
DL<cL−cs1ps−1pL\.\\displaystyle D\_\{L\}<\\frac\{c\_\{L\}\-c\_\{s\}\}\{\\frac\{1\}\{p\_\{s\}\}\-\\frac\{1\}\{p\_\{L\}\}\}\.\(19\)
###### Proof\.
We prove the theorem by analyzing the asymptotic cost of each strategy for a large problem of complexitykk\. LethM=⌈k/pM⌉h\_\{M\}=\\lceil k/p\_\{M\}\\rceilbe the number of recursive division levels for a modelMM\. The total number of division operations is1\+n\+n2\+…\+nhM−1=nhM−1n−11\+n\+n^\{2\}\+\.\.\.\+n^\{h\_\{M\}\-1\}=\\frac\{n^\{h\_\{M\}\}\-1\}\{n\-1\}\. For a largekk, we can approximate it asnhM−1n−1≈kpM\(n−1\)\\frac\{n^\{h\_\{M\}\}\-1\}\{n\-1\}\\approx\\frac\{k\}\{p\_\{M\}\(n\-1\)\}\.
The total cost for a modelMMis the sum of three components: the cumulative cost of failures at each division stepkpM\(n−1\)pMcM\\frac\{k\}\{p\_\{M\}\(n\-1\)\}p\_\{M\}c\_\{M\}, the cumulative cost of decompositionkpM\(n−1\)nDL\\frac\{k\}\{p\_\{M\}\(n\-1\)\}nD\_\{L\}, and the final cost of conquering the sub\-problemskcMkc\_\{M\}\. Summing these components gives the total asymptotic cost:
nkn−1\(cM\+DLpM\)\.\\displaystyle\\frac\{nk\}\{n\-1\}\\left\(c\_\{M\}\+\\frac\{D\_\{L\}\}\{p\_\{M\}\}\\right\)\.\(20\)
For the heterogeneous setting to be more cost\-efficient than the homogeneous one, it requires:
nkn−1\(cs\+DLps\)<nkn−1\(cL\+DLpL\)\.\\displaystyle\\frac\{nk\}\{n\-1\}\\left\(c\_\{s\}\+\\frac\{D\_\{L\}\}\{p\_\{s\}\}\\right\)<\\frac\{nk\}\{n\-1\}\\left\(c\_\{L\}\+\\frac\{D\_\{L\}\}\{p\_\{L\}\}\\right\)\.\(21\)
Thenkn−1\\frac\{nk\}\{n\-1\}term cancels\. Rearranging the remaining terms to solve forDLD\_\{L\}yields the condition stated in the theorem:
DL<cL−cs1ps−1pL\.\\displaystyle D\_\{L\}<\\frac\{c\_\{L\}\-c\_\{s\}\}\{\\frac\{1\}\{p\_\{s\}\}\-\\frac\{1\}\{p\_\{L\}\}\}\.\(22\)
∎
GeneratorPlannerHumanEvalMBPPAvg\.Δ\\DeltaAvg\.costbaseplusbaseplusQwen332B\{\}\_\{\\text\{32B\}\}Qwen332B\{\}\_\{\\text\{32B\}\}93\.9084\.1588\.5786\.8688\.37\-1\.00Qwen314B\{\}\_\{\\text\{14B\}\}Qwen314B\{\}\_\{\\text\{14B\}\}91\.4683\.5485\.1482\.2986\.18\- 2\.190\.47Qwen332B\{\}\_\{\\text\{32B\}\}93\.2985\.9886\.2984\.5787\.53\- 0\.840\.49Qwen38B\{\}\_\{\\text\{8B\}\}Qwen38B\{\}\_\{\\text\{8B\}\}90\.8582\.3285\.1484\.0085\.58\- 2\.790\.25Qwen332B\{\}\_\{\\text\{32B\}\}93\.9085\.3786\.2984\.0087\.39\- 0\.980\.31Qwen34B\{\}\_\{\\text\{4B\}\}Qwen34B\{\}\_\{\\text\{4B\}\}89\.6382\.3282\.2978\.2983\.13\- 5\.240\.14Qwen332B\{\}\_\{\\text\{32B\}\}92\.0784\.7682\.2980\.0084\.78\- 3\.400\.18Table 4:Full results on heterogeneous setting\.We report the average cost relative to Qwen332B\{\}\_\{\\text\{32B\}\}is normalized to 1\.00\.This theorem provides a theoretical basis for the efficiency of heterogeneous model configuration\. It indicates that the heterogeneous configuration becomes more cost\-efficient when the cost of decomposition \(DLD\_\{L\}\) is less than the savings generated by executing the sub\-problems with a more cost\-effective model \(ss\)\. This provides a formal rationale for using the heterogeneous configuration and shows that the PaT policy is a structured approach to allocating computational resources\.
###### Theorem 3\(Optimal Generator cost under scaling laws\)\.
The cost of the optimal small model,cs∗c\_\{s\}^\{\*\}, that minimizes the asymptotic cost of the heterogeneous configuration is given by the following closed\-form solution:
cs∗=min\{\(β⋅DLα\)1β\+1,cL\}\\displaystyle c\_\{s\}^\{\*\}=\\min\\left\\\{\\left\(\\frac\{\\beta\\cdot D\_\{L\}\}\{\\alpha\}\\right\)^\{\\frac\{1\}\{\\beta\+1\}\},c\_\{L\}\\right\\\}\(23\)
###### Proof\.
The asymptotic cost of the heterogeneous configuration is proportional to the coefficientcs\+DLpsc\_\{s\}\+\\frac\{D\_\{L\}\}\{p\_\{s\}\}\. Substituting the scaling law givescs\+DLαcsβc\_\{s\}\+\\frac\{D\_\{L\}\}\{\\alpha c\_\{s\}^\{\\beta\}\}\. To find the minimum, we take the derivative with respect tocsc\_\{s\}and set it to zero:
1−βDLαcs−β−1=0\.\\displaystyle 1\-\\frac\{\\beta D\_\{L\}\}\{\\alpha\}c\_\{s\}^\{\-\\beta\-1\}=0\.\(24\)Solving forcsc\_\{s\}yields the unconstrained optimum\. The final solution is capped atcLc\_\{L\}to respect the problem’s practical constraints\. ∎
This theorem provides a practical, closed\-form solution for the cost of the optimal generator model in the asymptotic regime\. The result serves as a powerful heuristic to approximately estimate the most cost\-effective smaller model to use in real\-world heterogeneous configurations\.
##### Empirical Validation\.
To validate the practical utility of Theorem[3](https://arxiv.org/html/2605.07248#Thmtheorem3), we applied the formula to our HumanEval experimental data using Qwen3 models\. We modeled model capabilityppagainst normalized input token costsc∈\{0\.11,0\.18,0\.35,0\.70\}c\\in\\\{0\.11,0\.18,0\.35,0\.70\\\}using the scaling lawp=αcβp=\\alpha c^\{\\beta\}\. Fitting this to our observed data yieldedα≈1722\.2\\alpha\\approx 1722\.2andβ≈0\.12\\beta\\approx 0\.12, with an average planning costDL≈1270\.73D\_\{L\}\\approx 1270\.73derived from the heterogeneous \(4B\+32B\) experiments\. Substituting these parameters into Theorem[3](https://arxiv.org/html/2605.07248#Thmtheorem3)yields a theoretical optimal costcs∗≈0\.114c\_\{s\}^\{\*\}\\approx 0\.114\. This value is remarkably close to the actual cost of the 4B model \(c=0\.11c=0\.11\)\. While our main experiments \(Figure[6](https://arxiv.org/html/2605.07248#S4.F6)\) suggest the 8B model offers a strong performance\-cost balance, the 4B model is indeed the strictly cost\-optimal generator as predicted\. This alignment confirms that despite simplified assumptions, our theoretical model successfully captures the underlying cost dynamics and serves as a practical guideline for selecting the initial sLM size for hyperparameter search\.
## Appendix BAdditional Results
### B\.1Full version of Table[3](https://arxiv.org/html/2605.07248#S4.T3)
Table[4](https://arxiv.org/html/2605.07248#A1.T4)presents the full results for our heterogeneous setting experiments across all four foundational benchmarks \(HumanEval, HumanEval\+, MBPP, and MBPP\+\)\. In this configuration, we pair a series of smaller generator models with a powerful, fixed planner \(Qwen332B\{\}\_\{\\text\{32B\}\}\)\. The table provides the detailed Pass@1 scores and relative costs for each combination, which were aggregated and visualized in Figure[6](https://arxiv.org/html/2605.07248#S4.F6)in the main body of the paper\.
### B\.2Wall\-Clock Latency Analysis
To complement the token\-cost analysis in the main paper, we measure end\-to\-end wall\-clock latency in an isolated single\-GPU environment to avoid interference from shared resources\. Table[5](https://arxiv.org/html/2605.07248#A2.T5)reports the mean per\-problem latency \(in seconds\) on HumanEval using Qwen3 models\.
Table 5:Mean wall\-clock latency per problem \(seconds\) on HumanEval\.Measured in an isolated environment using Qwen3 models\. Qwen34B\+32B\{\}\_\{\\text\{4B\}\+\\text\{32B\}\}denotes the heterogeneous model configuration using a 4B generator and 32B planner\.PaT incurs roughly 3×\\timesoverhead over Standard, while FunCoder incurs 5×\\times, confirming that PaT’s adaptive policy reduces not only token cost but also real\-world execution time\. Notably, the heterogeneous configuration \(4B\+32B\) achieves a latency of 11\.90s, nearly identical to the 4B\-only PaT \(11\.14s\) and faster than the 32B Standard baseline \(14\.21s\)\. This demonstrates that reserving the larger model exclusively for planning keeps wall\-clock time dominated by the cheaper generator\.
Table 6:Compute ROI across models and methods on foundational benchmarks\.ROI is defined asΔAvg/\(Cost−1\)\\Delta\\text\{Avg\}/\(\\text\{Cost\}\-1\), measuring Pass@1 improvement per unit of additional cost over Standard\. Higher is better\.
### B\.3Compute Return on Investment
To provide a unified metric for the cost\-performance trade\-off, we define the compute return on investment \(ROI\) as the performance gain per unit of additional cost over Standard:
ROI=ΔAvgCost−1,\\text\{ROI\}=\\frac\{\\Delta\\text\{Avg\}\}\{\\text\{Cost\}\-1\},\(25\)whereΔAvg\\Delta\\text\{Avg\}is the average Pass@1 improvement over Standard across benchmarks, and Cost is the normalized cost reported in Table[1](https://arxiv.org/html/2605.07248#S4.T1)\(Standard = 1\)\. A higher ROI indicates a more efficient use of additional compute\. Table[6](https://arxiv.org/html/2605.07248#A2.T6)reports the ROI for all methods\.
PaT achieves an average ROI of 1\.58, approximately 2\.4×\\timeshigher than FunCoder \(0\.67\) and 4\.6×\\timeshigher than CodeT \(0\.34\)\. This confirms that PaT’s adaptive policy consistently delivers superior returns on additional test\-time investment across all model scales and families\.
## Appendix CImplementation Details
##### Generation\.
The trial phase of our PaT policy incorporates a Best\-of\-N strategy to maximize the chance of a direct success\. For each problem specification, the generator modelMGM\_\{G\}produces 5 candidate solutions, using a temperature of 0\.8 to encourage diverse outputs\. Each of these 5 candidates is then verified against the test set\. If any of the candidates passes all test cases, the process terminates successfully, and that solution is returned\. Only if all 5 candidates fail the verification step does the policy escalate to the planning phase\. The generation is retried up to 3 times if it fails to produce a parsable output\.
##### Planning\.
If the trial fails, the planning phase is triggered\. The planner modelMPM\_\{P\}is prompted to decompose the problem specificationxxby rewriting the main function to call new unimplemented helper functions\. A single decomposition plan is generated with a temperature of 0\.2\. From the resulting Python code block, we parse the new function signatures and docstrings, which become the specifications for the subproblems\. If the output is not a valid code block, this planning step is retried up to 3 times, and the maximum recursion depth is limited to 3 to prevent overly complex solutions\. Once all subproblems are recursively solved, their solutions are composed into a final program and verified against the original test suite forxx\. If this composite solution still fails, PaT initiates a re\-planning loop\. The planner is invoked again on the original problemxx, but this time it is provided with the set of previously successful helper functions as additional context, enabling a more informed and cost\-efficient re\-planning attempt\.
##### Test case generation & verification\.
Our verification process requires a test suite𝒯\(x\)\\mathcal\{T\}\(x\)for each problemxx, which we construct in two stages\. First, we process any example test cases provided directly in the problem description\. Then, to ensure a comprehensive and robust evaluation, we augment the initial suite by prompting an LLM to generate additional test cases based on the problem description, a technique inspired by CodeT\. This test generation process is performed consistently with a temperature of0\.20\.2and is retried up to33times\.
## Appendix DBenchmark Details
##### HumanEval
\(Chenet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib3)\)is a foundational dataset for evaluating the functional correctness of generated code\. It consists of 164 hand\-written programming problems, each including a function signature, a detailed docstring, and a set of hidden unit tests for evaluation\. This benchmark is the standard for measuring the code generation capabilities of a model\.
##### MBPP
\(Chenet al\.,[2022b](https://arxiv.org/html/2605.07248#bib.bib9)\)is a larger, crowd\-sourced dataset\. A critical challenge with the original MBPP dataset is that the prompts contain the ground\-truth test cases, which can cause label leakage, particularly for baselines that perform selection or refinement\. To ensure a fair comparison, we follow the setup ofShinnet al\.\([2023](https://arxiv.org/html/2605.07248#bib.bib27)\)\. For our experiments, we use a representative subset of 175 problems sampled from the full dataset\.
##### EvalPlus
\(Liuet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib19)\)ensures a rigorous and reliable evaluation by augmenting the standard test suites of both HumanEval and MBPP\. HumanEval and MBPP sometimes pass solutions that are functionally correct on the provided tests but fail on more subtle edge cases\. EvalPlus mitigates this risk by automatically generating a much larger and more comprehensive set of unit tests, providing a more robust measure of a candidate solution’s true correctness\.
##### xCodeEval
\(Khanet al\.,[2023](https://arxiv.org/html/2605.07248#bib.bib5)\)is more challenging, competition\-level problems, which is sourced from the CodeForces platform\. For our experiments, we sample a subset of 500 problems, after first filtering out any instances with incomplete or invalid test cases\. A key feature of xCodeEval is its difficulty labels, which allow us to partition problems into four categories \(Easy, Mid, Hard, and Expert\) for a more fine\-grained analysis of policy behavior\.
Table 7:Token pricing \(USD per million tokens\)\.The values represent the cost for input and output, respectively\.
## Appendix ELLMs Details
We selected a range of state\-of\-the\-art open\-source language models to evaluate our PaT policy across different architectures and scales\. All experiments were conducted using NVIDIA A 6000 GPUs, with models served via the vLLM framework at float16 precision\. Due to its memory size, the Qwen332B\{\}\_\{\\text\{32B\}\}model was run using 2\-way tensor parallelism across two A6000 GPUs\. Table[7](https://arxiv.org/html/2605.07248#A4.T7)reports input and output token prices \(USD per million tokens\) collected from public provider listings111[https://artificialanalysis\.ai](https://artificialanalysis.ai/)\(accessed 2025\-09\-25\)\.\. Since there is no official price available for DeepSeek\-Coder\-V2\-Lite\-Instruct, we use the pricing policy of DeepSeek\-Coder\-V2 as a proxy\.
##### Qwen3
\(Yanget al\.,[2025](https://arxiv.org/html/2605.07248#bib.bib8)\)As our primary model family, we use the Qwen 3 series in four sizes: 4B, 8B, 14B, and 32B\. For all experiments, we used the base, pre\-trained versions of these models\. Additionally, for all experiments with the Qwen3 model family, we prepended the /no\_think command to all prompts\.
##### Llama\-3\.1
\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib20)\)To test the cross\-family generalization of our approach, we use Llama\-3\.1\-8B\-Instruct\. This is the instruction\-tuned version of the Llama\-3\.1 8B model, optimized for following user commands and prompts\.
##### DeepSeek\-Coder\-V2\-Lite
\(Zhuet al\.,[2024](https://arxiv.org/html/2605.07248#bib.bib21)\)To evaluate on a model specifically fine\-tuned for code generation, we use DeepSeek\-Coder\-V2\-Lite\-Instruct\. This is the instruction\-tuned “Lite” version of the DeepSeek\-Coder\-V2 family, with approximately 16B parameters\.
Table 8:Baseline details\.Hyperparameter details for baselines and PaT\.
## Appendix FBaseline Details
##### Standard
\(Brownet al\.,[2020](https://arxiv.org/html/2605.07248#bib.bib17)\)represents the most direct approach to code generation, establishing the base capability of a model\. It performs a single, one\-time generation attempt to produce the entire program from the problem specification\. For our code generation tasks, we follow a few\-shot prompting protocol, providing the model with a small number of in\-context examples to guide its output\. We generate a single candidate solution with a low temperature of0\.30\.3to produce the most probable and deterministic result\. This baseline serves to measure the model’s raw performance without any complex strategies like sampling, refinement, or decomposition\.
##### Best\-of\-N
\(Chenet al\.,[2021](https://arxiv.org/html/2605.07248#bib.bib3)\)aims to improve performance by exploring a diverse set of potential solutions through sampling\. For each problem, it generatesNNcandidate programs using a high temperature to encourage variety\. To ensure a fair comparison with the trial phase of our PaT policy, we use the same sampling parameters: we setN=5N=5and use a temperature of0\.80\.8\. Each of the 5 candidates is then executed against the provided test suite, and the one that passes the most tests is selected as the final solution\. This method increases the probability of finding a correct solution at the cost of generating and evaluating multiple candidates\. Since MBPP does not have a test suite, we deterministically submit the first of the 5 generated candidates\.
##### CodeT
\(Chenet al\.,[2022a](https://arxiv.org/html/2605.07248#bib.bib6)\)is utilized from the implementation provided by the authors of FunCoder to ensure a faithful reproduction of their setup\. The process begins by sampling a pool of candidate solutions\. Following the hyperparameter settings of FunCoder, we used a larger sample size ofN=11N=11with a temperature of0\.80\.8, which diverges from PaT and Best\-of\-N’sN=5N=5\. This is because CodeT’s consensus\-based ranking mechanism is highly dependent on a large and diverse candidate pool to function effectively; using a smaller sample size would artificially weaken this baseline\.
##### FunCoder
\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1)\)serves as our primary baseline for the Planning\-before\-Trial \(PbT\) policy, and we use the official implementation to ensure a fair and accurate comparison\. The method operates as a rigid two\-stage pipeline\. First, a planner model decomposes the problem into a complete plan of helper functions using a temperature of0\.80\.8\. Only after this static plan is finalized does it proceed to the second stage, where it solves each subproblem using a consensus\-based mechanism similar to CodeT, withN=10N=10and a temperature of0\.80\.8\. This rigid, plan\-first approach, where the plan is fixed regardless of generation outcomes, provides a direct contrast to our adaptive PaT policy\.
## Appendix GPrompt Details
Our prompting strategy largely follows the one provided in the FunCoder\(Chenet al\.,[2024a](https://arxiv.org/html/2605.07248#bib.bib1)\)implementation to ensure a fair comparison\. We use the same few\-shot examples and sampling methods for both the generation \(conquer\) and planning \(divide\) phases\.
### G\.1Prompt for Generation
Youareaprogrammingcopilot,youcansolveaproblembywritingPythonfunctions\.Yourtaskisto:
\-Foreveryturn,youneedtowriteaPythonfunctionthatreturnstheanswer,basedoncurrentcode\(notcodeinchathistory\)andproblemdescription\.
\-Donotmodifyfunctionname,argnames,docstringingivenfunctions\.
\-Considerreusingexistingfunctionsthatarealreadyimplemented\.
\-Youcanimportlibrariestobettersolvetheproblem\.
<User\>:
CurrentCode:
‘‘‘python
defprime\_factor\(x:int\)\-\>list:
"""getalistofprimefactorsofnumber$x$"""
ret=\[\]
i=1
whilei\*i<=x:
i\+=1
ifx%i==0andis\_prime\(i\):
ret\.append\(i\)
returnret
defis\_prime\(x:int\)\-\>bool:
"""determine$x$isaprimenumberornot"""
ifx<2:
returnFalse
foriinrange\(2,int\(x\*\*0\.5\)\+1\):
ifx%i==0:
returnFalse
returnTrue
defget\_common\(a:list,b:list\)\-\>list:
"""getcommonelementintwolist$a$and$b$"""
ret=\[\]
foritemina:
ifiteminb:
ret\.append\(item\)
returnret
defsum\_common\_factors\(a:int,b:int\)\-\>int:
"""Returnthesumofallcommonprimefactorsof$a$and$b$"""
raiseNotImplementedError\(\)
’’’
Let’sthinkstepbystepandimplementthefollowingmethod‘sum\_common\_factors‘usingexistingfunctionstosolve:
"Returnthesumofallcommonprimefactorsof$a$and$b$"
<Assistant\>:
First,Ineedtogettheprimefactorsof$a$and$b$\.
Second,Icanuse‘for‘looptofindcommonelementintwofactorslist\.
Finally,sumthecommonfactorlistandreturntheanswer\.
Hereisthe‘sum\_common\_factors‘function:
‘‘‘python
defsum\_common\_factors\(a:int,b:int\)\-\>int:
"""Computethesumofallcommonprimefactorsof$a$and$b$"""
factors\_a=prime\_factor\(a\)
factors\_b=prime\_factor\(b\)
common\_factors=get\_common\(factors\_a,factors\_b\)
returnsum\(common\_factors\)
’’’
<User\>:
CurrentCode:
‘‘‘python
\{prev\_code\}
’’’
Let’sthinkstepbystepandimplementthefollowingmethod‘\{cur\_func\_name\}‘usingexistingfunctionstosolve:
"\{cur\_func\_doc\}"
### G\.2Prompt for Planning
Youareaprogrammingcopilot,youcansolveaproblembywritingPythonfunctions\.Yourtaskisto:
\-Thepreviousattempttodirectimplementthetargetfunctionisfailed,indicatingitsoveralllogicmightbetoocomplextoimplementdirectly\.
\-Foreveryturn,youneedtowriteaPythonfunctionthatreturnstheanswerbasedonCurrentCode\(notcodeinchathistory\)\.
\-Donotmodifyfunctionname,argnames,docstringingivenfunctions\.
\-Youcanimportlibrariestobettersolvetheproblem\.
\-Youcanleavenewfunctionunimplementedfornow,butwritethefunctionattheendofthecodeandcommentwhatthefunctiondoes\.
\-Therefore,youmustdecomposeitintomultiplesmaller,manageablehelperfunctions\.
<User\>:
CurrentCode:
‘‘‘python
defsum\_common\_factors\(a:int,b:int\)\-\>int:
"""Computethesumofallcommonprimefactorsof$a$and$b$"""
raiseNotImplementedError\(\)
’’’
Let’sthinkstepbystepandcompletethefollowingPythonfunction‘sum\_common\_factors‘thatsolves:
"Computethesumofallcommonprimefactorsof$a$and$b$"
<Assistant\>:
First,Ineedtogettheprimefactorsof$a$and$b$\.
Second,Icanuse‘for‘looptofindcommonelementintwofactorslist\.
Finally,sumthecommonfactorlistandreturntheanswer\.
Hereisthe‘sum\_common\_factors‘function:
‘‘‘python
defsum\_common\_factors\(a:int,b:int\)\-\>int:
"""Computethesumofallcommonprimefactorsof$a$and$b$"""
factors\_a=prime\_factor\(a\)
factors\_b=prime\_factor\(b\)
common\_factors=get\_common\(factors\_a,factors\_b\)
returnsum\(common\_factors\)
defprime\_factor\(x:int\)\-\>list:
"""getalistofprimefactorsofnumber$x$"""
raiseNotImplementedError\(\)
defget\_common\(a:list,b:list\)\-\>list:
"""getcommonelementintwolist$a$and$b$"""
raiseNotImplementedError\(\)
’’’
<User\>:
CurrentCode:
‘‘‘python
defsum\_common\_factors\(a:int,b:int\)\-\>int:
"""Computethesumofallcommonprimefactorsof$a$and$b$"""
factors\_a=prime\_factor\(a\)
factors\_b=prime\_factor\(b\)
common\_factors=get\_common\(factors\_a,factors\_b\)
returnsum\(common\_factors\)
defget\_common\(a:list,b:list\)\-\>list:
"""getcommonelementintwolist$a$and$b$"""
raiseNotImplementedError\(\)
’’’
Let’sthinkstepbystepandcompletethefollowingPythonfunction‘get\_common‘thatsolves:
"getcommonelementintwolist$a$and$b$"
<Assistant\>:
Hereisthe‘get\_common‘function:
‘‘‘python
defget\_common\(a:list,b:list\)\-\>list:
"""getcommonelementintwolist$a$and$b$"""
ret=\[\]
foritemina:
ifiteminb:
ret\.append\(item\)
returnret
’’’
<User\>:
CurrentCode:
‘‘‘python
\{prev\_code\}
’’’
Let’sthinkstepbystepandcompletethefollowingPythonfunction‘\{cur\_func\_name\}‘thatsolves:
"\{cur\_func\_doc\}"Similar Articles
Efficient Test-time Inference for Generative Planning Models
This paper introduces OCLGen, a compute-efficient test-time search algorithm that integrates generative planning models with a classical Open-Closed List framework, improving solution quality across combinatorial planning domains.
LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling
LoopCoder-v2 proposes Parallel Loop Transformers (PLT) for efficient test-time computation scaling in code generation, showing that two loops yield significant gains while more loops cause diminishing returns and positional mismatch costs.
A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner
This paper presents a complementary evaluation of PlanGPT, a large language model for automated planning, using plan cost and plan generation time metrics, and finds that PlanGPT performs no better than a greedy search strategy.
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
This paper introduces Test-Time Personalization (TTP), a framework that improves LLM personalization by scaling inference-time computation through candidate sampling and reward-based selection. It diagnoses failure modes in standard reward models and proposes a probabilistic personalized reward model to mitigate them.
Scaling Test-Time Compute for Agentic Coding
A test-time scaling framework for agentic coding that compresses rollout trajectories into structured summaries and uses recursive voting/PDR to boost Claude-4.5-Opus to 77.6% on SWE-Bench Verified.