Efficient Test-time Inference for Generative Planning Models

arXiv cs.AI 06/02/26, 04:00 AM Papers
planning generative-models test-time-inference search ai-planning icml algorithmic-innovation
Summary
This paper introduces OCLGen, a compute-efficient test-time search algorithm that integrates generative planning models with a classical Open-Closed List framework, improving solution quality across combinatorial planning domains.
arXiv:2606.00618v1 Announce Type: new Abstract: Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution. One approach is to improve generated solutions during inference by scaling test-time compute. A more efficient alternative is to optimize the inference process itself. In this paper, we show that a modified version of a classical Open-Closed List (OCL) search provides just such an efficient inference procedure. Our algorithm synergizes two learned components: a generative model that performs fast rollouts from intermediate states and a heuristic model that prioritizes among candidate reasoning paths. Key contributions include novel exploration control mechanisms and integration of learned models within the OCL framework. Across multiple combinatorial planning domains, our approach outperforms both neurosymbolic search baselines and classical solvers in computational efficiency and solution quality.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:47 PM
# Efficient Test-time Inference for Generative Planning Models with OCL Search
Source: [https://arxiv.org/html/2606.00618](https://arxiv.org/html/2606.00618)
###### Abstract

Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution\. One approach is to improve generated solutions during inference by scaling test\-time compute\. A more efficient alternative is to optimize the inference process itself\. In this paper, we show that a modified version of a classical Open\-Closed List \(OCL\) search provides just such an efficient inference procedure\. Our algorithm synergizes two learned components: a generative model that performs fast rollouts from intermediate states and a heuristic model that prioritizes among candidate reasoning paths\. Key contributions include novel exploration control mechanisms and integration of learned models within the OCL framework\. Across multiple combinatorial planning domains, our approach outperforms both neurosymbolic search baselines and classical solvers in computational efficiency and solution quality\.

Machine Learning, ICML

## 1Introduction

Automated planning seeks action sequences that transform an initial state into one satisfying goal conditions\. Deep generative models have emerged as a promising paradigm for plan generation, offering fast synthesis across diverse domains\(Rossettiet al\.,[2024b](https://arxiv.org/html/2606.00618#bib.bib1)\)\. However, solution quality is bounded by the training data, and collecting large\-scale optimal datasets is often infeasible; even seemingly simple domains like Blocksworld are NP\-hard to solve optimally\(Slaney and Thiébaux,[2001](https://arxiv.org/html/2606.00618#bib.bib23)\)\.

An alternative is to spend additional compute at inference time\. Best\-of\-N sampling\(Stiennonet al\.,[2020](https://arxiv.org/html/2606.00618#bib.bib4)\)repeatedly queries a generative model and returns the best generation, but does not balance exploration and exploitation to find better solutions at lower cost\. Search algorithms like Monte Carlo Tree Search \(MCTS\) have become popular for test\-time compute in formal reasoning domains such as theorem proving\(Chenet al\.,[2024](https://arxiv.org/html/2606.00618#bib.bib3); Hubertet al\.,[2025](https://arxiv.org/html/2606.00618#bib.bib29)\)or code generation\(Antoniadeset al\.,[2025](https://arxiv.org/html/2606.00618#bib.bib30)\)\. However, MCTS methods typically employ UCT\-style selection \(e\.g\., PUCT;Rosin,[2011](https://arxiv.org/html/2606.00618#bib.bib36)\) with exploration incentives growing with parent visitation counts\. Because every traversal originates at the root, deeper nodes are visited less frequently, often yielding wide but shallow trees\. This depth imbalance is problematic for long\-horizon planning, where critical decisions deep in the tree remain underexplored\.

Open\-Closed List \(OCL\) algorithms\(Hartet al\.,[1968](https://arxiv.org/html/2606.00618#bib.bib32); Valenzano and Xie,[2016](https://arxiv.org/html/2606.00618#bib.bib33)\)like A\* select nodes globally and underpin state\-of\-the\-art symbolic planners\(Helmert,[2003](https://arxiv.org/html/2606.00618#bib.bib2)\)\. However, unlike MCTS where generative models can serve as rollout policies, standard OCL provides no mechanism for fast candidate generation nor does it compensate for inadmissible heuristics\. When trained on suboptimal demonstrations, learned heuristics overestimate cost\-to\-go, disproportionately inflating estimates for states far from the goal and causing greedy behavior that favors near\-goal nodes\. Moreover, querying generative models is computationally expensive, necessitating selective invocation\.

We presentOCLGen, a compute\-efficient test\-time search algorithm for autoregressive planning models that substantially improves plan quality\. Our approach addresses these challenges through three innovations: \(1\)*depth\-partitioned selection*maintains separate open lists per depth level, ensuring balanced exploration despite overestimating heuristics; \(2\)*truncated rollouts with adaptive expansion*uses the generative model for fast multi\-step synthesis while branching at low\-confidence decision points; and \(3\)*distributional heuristic estimation*ranks nodes by a lower\-percentile cost\-to\-go estimate, targeting their best attainable outcome\.

Experiments on four classical planning domains demonstrate thatOCLGenconverges to shorter plans significantly faster than baselines\. On problems with known optimal solutions, our method achieves 87\.3% optimality across domains, compared to 49\.8% for MCTS given the same compute budget\.OCLGenalso provides an effective foundation for iterative self\-improvement: after three refinement iterations, our method achieves 100% optimal plans on Blocksworld and 94\.7% on Sokoban, compared to 13\.5% and 51\.3%, respectively, for the base model with Best\-of\-N sampling given the same runtime budget\. In summary, our contributions include:

- •A modernized OCL framework with autoregressive models for fast rollout generation\.
- •A distributional approach for deploying overestimating heuristics in search\.
- •Comprehensive evaluation across four domains with ablations validating each design choice\.
- •Demonstration ofOCLGen’s effectiveness within a recursive self\-improvement framework\.

## 2Related Work

#### Generative Models for AI Planning

Deep generative models have emerged as a promising approach for planning by learning to generate solutions from collections of solved instances\. Recent work has explored both pre\-trained large language models \(LLMs\) and smaller domain\-specialized transformers for plan generation\. While LLMs can be fine\-tuned for planning tasks\(Pallaganiet al\.,[2023](https://arxiv.org/html/2606.00618#bib.bib5)\), concerns about inference costs and reliability have motivated training smaller transformers from scratch on planning data\. These approaches typically formulate planning as autoregressive sequence generation, predicting plans token by token\. For example, PlanGPT\(Rossettiet al\.,[2024b](https://arxiv.org/html/2606.00618#bib.bib1)\)trains a GPT2 model on plans from a domain\-independent planner by splitting individual actions into sequences of operator name and object tokens\. Extensions include integrating action validators during generation\(Rossettiet al\.,[2024a](https://arxiv.org/html/2606.00618#bib.bib7)\), repairing invalid plans via local search\(Tummoloet al\.,[2024](https://arxiv.org/html/2606.00618#bib.bib8)\), and leveraging symmetries for improved generalization\(Fritzscheet al\.,[2025](https://arxiv.org/html/2606.00618#bib.bib9)\)\. Our work adopts the model architecture and tokenization scheme from\(Rossettiet al\.,[2024b](https://arxiv.org/html/2606.00618#bib.bib1)\)and contributes an efficient inference algorithm\.

#### Test\-time Search

Test\-time inference in generative models is increasingly framed as a search over candidate solutions or reasoning paths\. Recent frameworks include Tree of Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2606.00618#bib.bib13)\), which applies breadth\-first search or depth\-first search to deliberative reasoning, and rStar\(Guanet al\.,[2025](https://arxiv.org/html/2606.00618#bib.bib14)\), which leverages MCTS for multi\-step reasoning\. In formal domains, DeepSeek\-Prover\-V1\.5\(Xinet al\.,[2025](https://arxiv.org/html/2606.00618#bib.bib31)\)integrates MCTS with proof assistant feedback, while PG\-TD\(Zhanget al\.,[2023](https://arxiv.org/html/2606.00618#bib.bib15)\)combines AlphaZero\-style search with test case execution for verified code generation\. While most of these methods target general LLM reasoning, in this work we focus on smaller, domain\-specialized planning models, where the same search principles can substantially improve output quality\.

#### Self\-Improvement via Search

Test\-time search can serve as a standalone inference procedure or underpin recursive self\-improvement, where search\-generated solutions provide supervision for retraining\. Combining search and learning in a self\-improvement loop dates back to early checkers programs\(Samuel,[1959](https://arxiv.org/html/2606.00618#bib.bib37)\), and has since been applied to classical planning via heuristic search with learned neural heuristics\(Jabbari Arfaeeet al\.,[2011](https://arxiv.org/html/2606.00618#bib.bib38); Groshevet al\.,[2018](https://arxiv.org/html/2606.00618#bib.bib42)\)\. The idea gained widespread attention with AlphaGo\(Silveret al\.,[2016](https://arxiv.org/html/2606.00618#bib.bib39)\)and AlphaZero\(Silveret al\.,[2017](https://arxiv.org/html/2606.00618#bib.bib43)\), which coupled MCTS with deep networks trained via self\-play\. Similarly, the Expert Iteration \(ExIt\) framework\(Anthonyet al\.,[2017](https://arxiv.org/html/2606.00618#bib.bib41)\)formalizes this loop, where an*expert*search iteratively supervises an*apprentice*policy\. Analogous frameworks have recently been applied to fine\-tuning transformers for reasoning\(Lehnertet al\.,[2024](https://arxiv.org/html/2606.00618#bib.bib40); Zhanget al\.,[2024](https://arxiv.org/html/2606.00618#bib.bib44); Chenet al\.,[2024](https://arxiv.org/html/2606.00618#bib.bib3)\), theorem proving\(Xinet al\.,[2025](https://arxiv.org/html/2606.00618#bib.bib31)\), and multi\-step reasoning\(Guanet al\.,[2025](https://arxiv.org/html/2606.00618#bib.bib14)\)\. The generative planning framework inGieselmannet al\.\([2026](https://arxiv.org/html/2606.00618#bib.bib45)\)combines Best\-of\-N sampling with graph processing to refine a transformer in a recursive loop that iterates between data curation and model finetuning\. We build on this work, focusing on efficient test\-time search which further accelerates self\-improvement\.

## 3Preliminaries

### 3\.1Classical Planning in PDDL

We focus on classical single\-agent planning in deterministic, fully observable, discrete environments\. A planning problem is specified by a finite set of objects𝒪\\mathcal\{O\}and a finite set of predicates𝒫\\mathcal\{P\}, where each predicatep∈𝒫p\\in\\mathcal\{P\}has an associated arityarity\(p\)\\mathrm\{arity\}\(p\)\. Grounded atoms are formed by instantiating predicates with objects:ℱ=⋃p∈𝒫\{p\(o1,…,oarity\(p\)\)∣o1,…,oarity\(p\)∈𝒪\}\.\\mathcal\{F\}=\\bigcup\_\{p\\in\\mathcal\{P\}\}\\left\\\{\\,p\(o\_\{1\},\\ldots,o\_\{\\mathrm\{arity\}\(p\)\}\)\\mid o\_\{1\},\\ldots,o\_\{\\mathrm\{arity\}\(p\)\}\\in\\mathcal\{O\}\\,\\right\\\}\.A states∈𝒮=2ℱs\\in\\mathcal\{S\}=2^\{\\mathcal\{F\}\}represents the set of atoms that hold true in a given world configuration\. Actionsa∈𝒜a\\in\\mathcal\{A\}are grounded operatorsa=⟨pre\(a\),add\(a\),del\(a\)⟩a=\\langle\\mathrm\{pre\}\(a\),\\mathrm\{add\}\(a\),\\mathrm\{del\}\(a\)\\rangle, where preconditionspre\(a\)⊆ℱ\\mathrm\{pre\}\(a\)\\subseteq\\mathcal\{F\}, add effectsadd\(a\)⊆ℱ\\mathrm\{add\}\(a\)\\subseteq\\mathcal\{F\}, and delete effectsdel\(a\)⊆ℱ\\mathrm\{del\}\(a\)\\subseteq\\mathcal\{F\}determine applicability and outcome\. An actionaais applicable in statesswhenpre\(a\)⊆s\\mathrm\{pre\}\(a\)\\subseteq s, yielding successor stateF\(s,a\)=\(s∖del\(a\)\)∪add\(a\)F\(s,a\)=\(s\\setminus\\mathrm\{del\}\(a\)\)\\cup\\mathrm\{add\}\(a\)\. Goalssg∈𝒢⊆2ℱs\_\{g\}\\in\\mathcal\{G\}\\subseteq 2^\{\\mathcal\{F\}\}are partial state descriptions\. A statesssatisfies goalsgs\_\{g\}ifsg⊆ss\_\{g\}\\subseteq s\. A planτ=\(a0,…,aT−1\)\\tau=\(a\_\{0\},\\ldots,a\_\{T\-1\}\)is valid if each actionata\_\{t\}is applicable insts\_\{t\}, and executing the sequence froms0s\_\{0\}viast\+1=F\(st,at\)s\_\{t\+1\}=F\(s\_\{t\},a\_\{t\}\)reaches a final statesTs\_\{T\}satisfying the goal\. In this work, we define optimality as minimum plan length\.

The Planning Domain Definition Language \(PDDL\)\(McDermottet al\.,[1998](https://arxiv.org/html/2606.00618#bib.bib18); Fox and Long,[2003](https://arxiv.org/html/2606.00618#bib.bib19)\)serves as the standard formalism in AI planning, underpinning benchmarks such as the International Planning Competition\(Vallatiet al\.,[2015](https://arxiv.org/html/2606.00618#bib.bib20)\)\. A PDDL domain defines predicates, typed objects, and operators specified through preconditions and effects\. Individual problem instances declare concrete objects, an initial state, and goal conditions\. We use PDDL as the formal language to formulate problems, states, and plans while leveraging its structure to tokenize planning problems for autoregressive sequence generation\.

### 3\.2Autoregressive Planning Models

We investigate test\-time inference for generative modelsπθ\\pi\_\{\\theta\}that produce distributions over actions conditioned on an initial states0∈𝒮s\_\{0\}\\in\\mathcal\{S\}, a goal specificationsg∈𝒢s\_\{g\}\\in\\mathcal\{G\}, and the sequence of previous actionsa<t∈𝒜ta\_\{<t\}\\in\\mathcal\{A\}^\{t\}, i\.e\.,πθ\(at∣s0,sg,a<t\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{0\},s\_\{g\},a\_\{<t\}\)\. FollowingRossettiet al\.\([2024b](https://arxiv.org/html/2606.00618#bib.bib1)\), we modelπθ\\pi\_\{\\theta\}as a decoder\-only transformer\(Radfordet al\.,[2019](https://arxiv.org/html/2606.00618#bib.bib21)\)that represents states and actions by tokenizing PDDL descriptions of predicates, operator names and objects\. This representation enables the model to navigate the combinatorially large space of grounded actions using a compact, fixed\-size vocabulary\. Actions are split into tokens and generated autoregressively\.

To illustrate, consider the Blocksworld domain — a classical planning benchmark where the objective is to rearrange a set of blocks \(e\.g\.,b1,b2,b3\) into a specified goal configuration using four operators:pickupandputdownfor moving blocks to and from the table, andstackandunstackfor placing or removing blocks on top of one another\. In this domain, the grounded actionunstack b1 b2\(removing blockb1from atop blockb2\) is encoded as a sequence comprising the operator tokenunstackfollowed by argument tokensb1andb2\. At each generation step, the newly predicted token is concatenated to the input sequence for the subsequent forward pass\. The vocabulary is derived directly from PDDL domain and problem files, comprising predicate names, object identifiers, type declarations, and special delimiter tokens\. A domain\-specific upper bound on the number of objects ensures bounded representations\.

## 4Efficient Test\-time Search for Autoregressive Planning

Using the generative modelπθ\\pi\_\{\\theta\}defined in Section[3\.2](https://arxiv.org/html/2606.00618#S3.SS2), we develop an efficient test\-time search algorithm\. Our objective is to minimize plan length while generalizing across a distribution of problem instancesP𝒮×𝒢P\_\{\\mathcal\{S\}\\times\\mathcal\{G\}\}\. For each planning domain, a modelπθ\\pi\_\{\\theta\}is trained via supervised learning on𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}, a dataset of suboptimal plans generated by running a symbolic solver on a set of instances drawn fromP𝒮×𝒢P\_\{\\mathcal\{S\}\\times\\mathcal\{G\}\}\.

### 4\.1OCL Search with Generative Models

We introduceOCLGen, an efficient test\-time search algorithm for generative planning models built on the Open\-Closed List \(OCL\) framework\(Hartet al\.,[1968](https://arxiv.org/html/2606.00618#bib.bib32); Valenzano and Xie,[2016](https://arxiv.org/html/2606.00618#bib.bib33)\)\. This framework represents a general class of graph search algorithms that includes, e\.g\.,A∗A^\{\*\}\. OCL search maintains an open list of frontier nodes pending exploration and a closed list that tracks already\-visited states to prevent redundant expansions\. At each iteration, a nodennis selected from the open list𝒪\\mathcal\{O\}according to a priority functionf\(n\)=g\(n\)\+h\(n\)f\(n\)=g\(n\)\+h\(n\), whereg\(n\)g\(n\)denotes the incurred path cost andh\(n\)h\(n\)the heuristic cost\-to\-go\. The selected node is then moved to the closed list𝒞\\mathcal\{C\}and expanded to generate successors\. Crucially, the framework detects when a node is revisited via a shorter path from the root, updates itsg\(n\)g\(n\)value accordingly, and moves it back to𝒪\\mathcal\{O\}\.

#### Modifications\.

To adapt OCL search for generative planning models, we introduce several modifications to the standard framework\. First, we introduce*depth\-partitioned selection*, maintaining separate open lists per depth level \(determined byg\(n\)g\(n\)\) to balance exploration across the solution depth and counteract systematic heuristic overestimation\. Second, we perform node expansions via*truncated rollouts*using the generative modelπθ\\pi\_\{\\theta\}rather than single\-step transitions\. This allows us to rapidly generate sequences of successor nodes, which effectively prunes large or combinatorial action spaces\. Within each rollout, we further apply*adaptive expansion*, using the generative model’s token confidence to identify critical decision points\. These branching points are expanded immediately, broadening exploration when the model is uncertain and narrowing it when the model is confident\. Finally, we integrate a learned distributional heuristic modelhϕh\_\{\\phi\}to guide node selection\.

Altogether, these components modernize OCL graph search by combining even depth\-level coverage with heuristic value guidance, while dynamically constraining breadth based on model confidence\. We detail each modification below\.

### 4\.2Depth\-Partitioned Selection

Heuristic models optimized to predict the cost\-to\-goh\(n\)h\(n\)from suboptimal training data systematically overestimate the true cost\-to\-goh∗\(n\)h^\{\*\}\(n\)\. To observe how this biases selection, we model the resulting bias as multiplicative,h\(n\)≈αh∗\(n\)h\(n\)\\approx\\alpha\\,h^\{\*\}\(n\)for someα\>1\\alpha\>1\.

Consider a pathological case: two unvisited frontier nodes,n1n\_\{1\}andn2n\_\{2\}, situated on different branches but projected to yield solutions of identical total lengthLL\(i\.e\.,g\(n1\)\+h∗\(n1\)=g\(n2\)\+h∗\(n2\)=Lg\(n\_\{1\}\)\+h^\{\*\}\(n\_\{1\}\)=g\(n\_\{2\}\)\+h^\{\*\}\(n\_\{2\}\)=L\)\. Assuming uniform unit action costs, the occured costg\(n\)g\(n\)equals the node’s depthdd\. Supposen1n\_\{1\}is shallower thann2n\_\{2\}\(g\(n1\)<g\(n2\)g\(n\_\{1\}\)<g\(n\_\{2\}\)\)\. Under our bias model, the priority function evaluates to:

f\(n\)=g\(n\)\+α\(L−g\(n\)\)=αL−\(α−1\)g\(n\)\.f\(n\)=g\(n\)\+\\alpha\(L\-g\(n\)\)=\\alpha L\-\(\\alpha\-1\)\\,g\(n\)\.\(1\)Becausef\(n\)f\(n\)is strictly*decreasing*with respect tog\(n\)g\(n\), the scores satisfyf\(n2\)<f\(n1\)f\(n\_\{2\}\)<f\(n\_\{1\}\)\. Consequently, selecting the node with the lowestff\-score from a global open list𝒪\\mathcal\{O\}\(as in standardA∗A^\{\*\}\) strictly prefers the deeper noden2n\_\{2\}, despite both offering paths to equally good solutions\. The search thus systematically over\-commits to deeper, arbitrary branches, leaving promising shallower alternative routes—where course corrections are most valuable—unexpanded\.

To mitigate this, we partition the unvisited nodes by depth, maintaining a separate open list𝒪d\\mathcal\{O\}\_\{d\}for each depth leveldd\(whereg\(n\)=dg\(n\)=d\)\. At each iteration, we first select a depth leveldsd\_\{\\text\{s\}\}\(Figure[1](https://arxiv.org/html/2606.00618#S4.F1)a\), then extract a nodensn\_\{\\text\{s\}\}from𝒪ds\\mathcal\{O\}\_\{d\_\{s\}\}based on the heuristic estimatehϕh\_\{\\phi\}\(Figure[1](https://arxiv.org/html/2606.00618#S4.F1)b\)\. This two\-stage selection ensures even exploration across potential solution lengths\. Crucially, at a fixed depthdd, pairwise comparisons within𝒪d\\mathcal\{O\}\_\{d\}depend only on the ranking signalαh∗\(n\)\\alpha h^\{\*\}\(n\)\. Thus, our learned heuristichϕh\_\{\\phi\}provides a useful ranking signal within each depth level despite its absolute overestimation\.

We investigate two strategies for selectingdsd\_\{\\text\{s\}\}from\{0,1,…,dmax−1\}\\\{0,1,\\ldots,d\_\{\\max\}\-1\\\}, wheredmaxd\_\{\\max\}is the depth of the shallowest goal state found so far \(or the maximum graph depth if no solution exists\):*Uniform selection*samplesdsd\_\{\\text\{s\}\}uniformly at random, while*Scan selection*iterates sequentially fromd=0d=0todmax−1d\_\{\\max\}\-1, ensuring systematic coverage across all levels\.

![Refer to caption](https://arxiv.org/html/2606.00618v1/x1.png)Figure 1:Illustration of one iteration with OCLGen\. Note that we maintain a graph structure of states and transitions that is updated after every search iteration\.
### 4\.3Confidence\-based Node Expansion

Having selected a depth leveldsd\_\{\\text\{s\}\}and extracted the highest\-scoring nodens∈𝒪dsn\_\{s\}\\in\\mathcal\{O\}\_\{d\_\{s\}\}according to the heuristichϕh\_\{\\phi\}, we perform a truncated rollout starting fromnsn\_\{\\text\{s\}\}with the generative modelπθ\\pi\_\{\\theta\}\(Figure[1](https://arxiv.org/html/2606.00618#S4.F1)c\), producing a partial plan segment\(a0,a1,…\)\(a\_\{0\},a\_\{1\},\\ldots\)\. The rollout is capped at a fixed number of tokens, chosen shorter than the longest plan’s token sequence in the training data; this reduces computational overhead while still providing useful search guidance\. We then decide, action by action, where to branch\. For each generated action in the rollout, we compute a confidence value by taking the minimum probability assigned to any of the sampled tokens comprising that action \(operator and argument tokens\)\. We then apply*adaptive expansion*\(Figure[1](https://arxiv.org/html/2606.00618#S4.F1)d\): for each nodenin\_\{i\}and outgoing actionaia\_\{i\}along the rollout where the action’s confidence value falls below a thresholdτconf\\tau\_\{\\text\{conf\}\}, we generate valid successor states as in classical node expansion and move nodenin\_\{i\}to𝒞\\mathcal\{C\}\. Nodes whose outgoing action has confidence aboveτconf\\tau\_\{\\text\{conf\}\}are not expanded, concentrating computational resources on the critical decision points where exploration is most valuable\. This contrasts with best\-of\-N sampling, which redraws complete plans and repeatedly regenerates the high\-confidence segments shared across samples\. Low confidence flags the decision points worth exploring; one plausible source is that the training data contains multiple alternative solutions to similar problems, leading the model to spread probability across competing actions\.

To quantify the resulting savings, assume a constant branching factorbb\. Exhaustive expansion explores∑k=1dbk\\sum\_\{k=1\}^\{d\}b^\{k\}nodes to reach depthdd\. Adaptive expansion instead follows confident actions greedily \(local branching factor 1\) and branches exhaustively otherwise \(local branching factorbb\)\. If the model is confident at fractionγ∈\[0,1\]\\gamma\\in\[0,1\]of decision steps, the expected branching factor becomes

b~=\(1−γ\)b\+γ\\tilde\{b\}=\(1\-\\gamma\)\\,b\+\\gamma\(2\)and, assuming branching decisions are approximately independent across nodes, the expected number of explored nodes to depthddbecomes∑k=1db~k\\sum\_\{k=1\}^\{d\}\\tilde\{b\}^\{\\,k\}\. Atγ=0\\gamma=0, Equation \([2](https://arxiv.org/html/2606.00618#S4.E2)\) recovers full expansion; atγ=1\\gamma=1, the sum collapses todd\(a single path\)\. Because the node count grows asb~k\\tilde\{b\}^\{\\,k\}, even a modestγ\\gammacompounds into substantial savings with depth, making rollouts with adaptive expansion especially effective in domains with large combinatorial action spaces\.

All newly generated but unexpanded nodes are assigned to their respective depth\-partitioned open lists𝒪d\\mathcal\{O\}\_\{d\}\(or the closed list𝒞\\mathcal\{C\}if their state has already been visited\)\. Consistent with standard OCL search, we updategg\-costs and reassign nodes to the appropriate depth list𝒪d\\mathcal\{O\}\_\{d\}\.

### 4\.4Distributional Heuristic Model

We frame cost\-to\-go prediction as a discrete classification task over the set of possible remaining plan lengthsℒ=\{0,1,…,Lmax\}\\mathcal\{L\}=\\\{0,1,\\ldots,L\_\{\\max\}\\\}\. The model, parameterized byϕ\\phi, outputs a conditional probability distributionPϕ\(c∣s,sg\)P\_\{\\phi\}\(c\\mid s,s\_\{g\}\)representing the predicted probability that the true remaining plan length from statessto goalsgs\_\{g\}is exactlyc∈ℒc\\in\\mathcal\{L\}\. This model is trained via maximum likelihood estimation using a standard cross\-entropy loss on𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}\. The scalar heuristic valuehϕ\(n\)h\_\{\\phi\}\(n\)used downstream for search is extracted as a summary statistic from this distribution\.

Because the model is trained on suboptimal demonstrations,PϕP\_\{\\phi\}concentrates on suboptimal plan lengths, leaving the true optimal cost\-to\-goh∗h^\{\*\}in its lower tail\. Depth partitioning \(Section[4\.2](https://arxiv.org/html/2606.00618#S4.SS2)\) already neutralizes the bias that overestimation induces when comparing nodes across depths; the choice of summary statistic plays a complementary role within a depth\. Since search ultimately retains only the best plan reachable from a node, we wanthϕh\_\{\\phi\}to rank nodes by their best attainable completion rather than their typical, suboptimal one\. We therefore summarizePϕP\_\{\\phi\}by its lowerkk\-th percentile rather than the mode, targeting the optimistic tail\. While selecting thekk\-th percentile is not universally more optimistic than the mode for all distribution shapes, it empirically provides a useful ranking signal in our domains \(see Appendix[F\.3](https://arxiv.org/html/2606.00618#A6.SS3)\)\.

For unexpanded nodesnin\_\{i\}along a rollout \(where confidence exceededτconf\\tau\_\{\\text\{conf\}\}\), we minimize forward passes through the heuristic model by backing up the estimate from the nearest downstream expanded nodenjn\_\{j\}\. Specifically, we seth\(ni\)=hϕ\(nj\)\+g\(nj\)−g\(ni\)h\(n\_\{i\}\)=h\_\{\\phi\}\(n\_\{j\}\)\+g\(n\_\{j\}\)\-g\(n\_\{i\}\), unlessnin\_\{i\}is the final state of the rollout\.

### 4\.5Base Model Improvements via Action Compilation

Building onRossettiet al\.\([2024b](https://arxiv.org/html/2606.00618#bib.bib1)\), we introduce a new training data augmentation strategy that samples a random offset for each ground\-truth plan, computes the corresponding intermediate state using a formal plan validator\(Howeyet al\.,[2004](https://arxiv.org/html/2606.00618#bib.bib34)\), and constructs training examples from these intermediate states paired with their remaining action sequences\. This exposes the model to diverse starting states and shorter planning horizons, improving generalization necessary when calling the model on intermediate states within the search graph\. We apply this strategy when training both policy and heuristic models\. Empirical validation is provided in Appendix[F\.1](https://arxiv.org/html/2606.00618#A6.SS1)\.

### 4\.6Algorithm Overview

Algorithm 1Depth\-partitioned OCL for Generative ModelsInput: Initial states0s\_\{0\}, goalsgs\_\{g\}, iterationsnitersn\_\{\\text\{iters\}\}, initial rolloutsNinitN\_\{\\text\{init\}\}, generative modelπθ\\pi\_\{\\theta\}, heuristic modelhϕh\_\{\\phi\}

\# Initialize empty search graph

G←G\\leftarrowSearchGraph\(\)

\# Add root node to graph

G\.addG\\texttt\{\.add\}\(Node\(

s0s\_\{0\}, g =

0, h =

hϕ\(s0,sg\)h\_\{\\phi\}\(s\_\{0\},s\_\{g\}\)\)\)

\# Execute initial rollouts to populate graph

G←initial\_rolloutsG\\leftarrow\\texttt\{initial\\\_rollouts\}\(

GG,

πθ\\pi\_\{\\theta\},

s0s\_\{0\},

sgs\_\{g\},

NinitN\_\{\\text\{init\}\}\)

\# Main search loop

for

iiteri\_\{\\text\{iter\}\}in

1…niters1\\ldots n\_\{\\text\{iters\}\}do

\# Select a depth level

ds←select\_depthd\_\{\\text\{s\}\}\\leftarrow\\texttt\{select\\\_depth\}\(

GG,

iiteri\_\{\\text\{iter\}\}\)\(Fig\.[1](https://arxiv.org/html/2606.00618#S4.F1)a\)

\# Select node from depth\-specific open list

ns←select\_noden\_\{\\text\{s\}\}\\leftarrow\\texttt\{select\\\_node\}\(

GG,

dsd\_\{\\text\{s\}\}\)\(Fig\.[1](https://arxiv.org/html/2606.00618#S4.F1)b\)

\# Generate new nodes from rollout

Sroll←partial\_rolloutS\_\{\\text\{roll\}\}\\leftarrow\\texttt\{partial\\\_rollout\}\(

πθ\\pi\_\{\\theta\},

nsn\_\{\\text\{s\}\},

sgs\_\{g\}\)\(Fig\.[1](https://arxiv.org/html/2606.00618#S4.F1)c\)

\# Generate new child nodes by expansion

Sexp←adaptive\_expansionS\_\{\\text\{exp\}\}\\leftarrow\\texttt\{adaptive\\\_expansion\}\(

nsn\_\{\\text\{s\}\},

sgs\_\{g\},

SrollS\_\{\\text\{roll\}\}\)\(Fig\.[1](https://arxiv.org/html/2606.00618#S4.F1)d\)

\# Queryhϕh\_\{\\phi\}on expanded nodes; assign heuristics via backpropagation to remaining rollout nodes

ℋ←assign\_heuristics\\mathcal\{H\}\\leftarrow\\texttt\{assign\\\_heuristics\}\(

SexpS\_\{\\text\{exp\}\},

SrollS\_\{\\text\{roll\}\},

hϕh\_\{\\phi\},

sgs\_\{g\}\)

\# Add new nodes to graph

for

\(si,hi\)\(s\_\{i\},h\_\{i\}\)in

ℋ\\mathcal\{H\}do

G\.addG\\texttt\{\.add\}\(Node\(

sis\_\{i\}, g =

g\(parent\(si\)\)\+1g\(\\text\{parent\}\(s\_\{i\}\)\)\{\+\}1, h =

hih\_\{i\}\)\)

endfor

endfor

return

GG\.get\_best\_plan\(\)

We structureOCLGenas an anytime algorithm — one that returns a valid solution at any point and continues refining it as more compute becomes available\. Initially, we executeNinitN\_\{\\text\{init\}\}full rollouts froms0s\_\{0\}to establish baseline solutions and populate the search graph with goal\-reaching trajectories\. The algorithm then iteratively refines these solutions through depth\-partitioned search until a compute budget is reached \(e\.g\., a maximum number of iterations or wall\-clock time\)\. Algorithm[1](https://arxiv.org/html/2606.00618#alg1)provides pseudocode for the main search loop, and Figure[1](https://arxiv.org/html/2606.00618#S4.F1)illustrates each step visually\. Each iteration proceeds as follows: \(a\)select\_depthchooses a depth level to focus expansion; \(b\)select\_nodepicks a promising node from the corresponding open list; \(c\)partial\_rolloutgenerates a truncated trajectory from the selected node towardsgs\_\{g\}; \(d\)adaptive\_expansionuses rollout confidence to decide which intermediate states to expand; and \(e\) newly generated nodes are scored and added to the graph, withgg\-costs updated and nodes reassigned to their depth\-partitioned open lists𝒪d\\mathcal\{O\}\_\{d\}or the closed list𝒞\\mathcal\{C\}if their state has already been visited\. Upon termination, the algorithm returns the lowest\-cost plan found\.

## 5Experiments

We provide a large\-scale evaluation of test\-time inference methods for generative planning in classical AI planning domains\. Our experiments address the following questions:

1. 1\.DoesOCLGenproduce shorter plans than baseline methods for a given compute budget while maintaining high completion rates?
2. 2\.DoesOCLGensignificantly increase the proportion of generated solutions that are optimal?
3. 3\.Which components of our method contribute most to its effectiveness?
4. 4\.To what extent isOCLGensuitable as the basis for a model self\-improvement framework?

### 5\.1Test\-time Inference Benchmark

#### Planning Domains\.

We consider the following domains with varying challenges and levels of complexity:

- •Blocksworld: A classic domain where blocks must be rearranged into goal configurations\. Optimal planning is NP\-hard\(Slaney and Thiébaux,[2001](https://arxiv.org/html/2606.00618#bib.bib23)\)\.
- •Logistics: A transportation domain where packages are delivered between locations using trucks and airplanes across multiple cities\.
- •Labyrinth: A navigation domain on a grid where the agent can move between connected cells or shift entire rows/columns, with cells wrapping around edges\.
- •Sokoban: A puzzle domain where an agent pushes boxes to target locations while navigating around walls\. Solving Sokoban is known to be PSPACE\-complete\(Culberson,[1998](https://arxiv.org/html/2606.00618#bib.bib22)\)\.

#### Training and Evaluation Setup\.

For each domain, we trainπθ\\pi\_\{\\theta\}andhϕh\_\{\\phi\}on10510^\{5\}problem instances with suboptimal solutions generated by Fast Downward \(LAMA\-first configuration\(Richter and Westphal,[2010](https://arxiv.org/html/2606.00618#bib.bib24)\)\)\. Both models use the data augmentation described in Section[4\.5](https://arxiv.org/html/2606.00618#S4.SS5)\. We evaluate on 1000 held\-out test instances per domain\. We use the GPT\-2\(Radfordet al\.,[2019](https://arxiv.org/html/2606.00618#bib.bib21)\)architecture to implementπθ\\pi\_\{\\theta\}, and a smaller BERT encoder\(Devlinet al\.,[2019](https://arxiv.org/html/2606.00618#bib.bib46)\)forhϕh\_\{\\phi\}\. Both models are trained using the standard cross\-entropy loss\. Final models are selected based on the checkpoint with the lowest validation loss\. Further details on model architecture and training hyperparameters are provided in App\.[C](https://arxiv.org/html/2606.00618#A3)\.

#### Baselines\.

We compare against the following baselines:

- •MCTS \(full rollouts\): Monte Carlo Tree Search usingπθ\\pi\_\{\\theta\}for rollouts and with PUCT selection\. We usehϕh\_\{\\phi\}to compute values of unsuccessful rollouts\.
- •MCTS \(partial rollouts\): MCTS with shorter rollouts, usinghϕh\_\{\\phi\}to obtain a value for backpropagation from the last state of the rollout\.
- •OCL\-Anytime A\*: Classical A\* search usinghϕh\_\{\\phi\}as the heuristic, run in anytime mode\.
- •OCL\-GBFS: Greedy best\-first search usinghϕh\_\{\\phi\}as the heuristic for node prioritization\.
- •Best\-of\-N: Generates a set of plan sequences withπθ\\pi\_\{\\theta\}given a runtime limit and returns the shortest valid one\.
- •FD\-LAMA\-anytime: Fast Downward with LAMA configuration\(Richter and Westphal,[2010](https://arxiv.org/html/2606.00618#bib.bib24)\)which returns the best solution within the time limit\.
- •FD\-LAMA\-first: This baseline returns the first solution found by FD\-LAMA\-anytime\. This method generated our data for generative model pretraining\.
- •FD\-optimal: Fast Downward with the LM\-Cut heuristic \(48h timeout\), establishing reference sets of optimal plans for each domain\.

All learned methods use the same generative and heuristic models, isolating the effect of the inference algorithm\. A 10\-minute timeout is given per problem \(except for FD solvers\)\. ForOCLGen, we evaluate two different depth selection strategies*uniform*and*scan*\(Section[4\.2](https://arxiv.org/html/2606.00618#S4.SS2)\)\. We empirically determined an action confidence thresholdτconf=0\.95\\tau\_\{\\text\{conf\}\}\{=\}0\.95for all domains, except for Logistics where we useτconf=0\.2\\tau\_\{\\text\{conf\}\}\{=\}0\.2\. More details on hyperparameter choices for inference are given in Appendix[D](https://arxiv.org/html/2606.00618#A4)\.

Table 1:Benchmark on unseen problems \(1000 per domain\)\. Comp\.: Completion \(%\); Length: Mean plan length \(± std\. error\)\.
#### Plan Lengths\.

Table[1](https://arxiv.org/html/2606.00618#S5.T1)presents completion rates and plan lengths across all domains\.OCLGenachieves the best combination of completion rate and plan quality among all methods operating within the 10\-minute budget per problem instance\. BothOCLGenvariants attain near\-perfect completion rates \(99\.7–100%\) while producing consistently shorter plans than competing approaches\. Notably,OCLGenreduces average plan length by 19\.6% compared to MCTS with full rollouts on Blocksworld and by 17\.8% on Labyrinth\. While MCTS with partial rollouts and search\-based methods \(OCL\-Anyt\. A\*, OCL\-GBFS\) achieve competitive results on simpler domains, they fail to scale to the more challenging Logistics and Sokoban benchmarks, with completion rates dropping to 1\.3–37\.3%\. Best\-of\-N sampling achieves high completion rates but consistently yields longer plans thanOCLGenacross all domains\. The FD\-optimal solver, even with a 48\-hour time limit, solves only 16\.9–63% of instances on Blocksworld, Logistics, and Sokoban, underscoring the inherent difficulty of these domains\. Finally, the*uniform*and*scan*selection strategies yield comparable performance, while*uniform*achieves slightly shorter plans overall except for Sokoban\.

Table 2:Benchmark on subset of known optimal problems\. Optimal: num\. solved optimally; Length: Plan length \(num\. solved\)\.
#### Solution Optimality\.

Table[2](https://arxiv.org/html/2606.00618#S5.T2)examines plan statistics on the subset of instances for which optimal solutions are known \(solved by FD\-optimal within 48h\)\.OCLGenyields the most optimal solutions among all learning methods, solving 83\.8% \(528/630\) of Blocksworld, 61\.5% \(104/169\) of Logistics, 98\.8% \(988/1000\) of Labyrinth, and 77\.1% \(377/489\) of Sokoban with the*uniform*variant\. The*scan*variant performs comparably, marginally improving the Sokoban rate to 78\.5% \(384/489\)\. This represents a 2\.9×\\timesimprovement over MCTS with full rollouts on Blocksworld and a 1\.6×\\timesimprovement on Labyrinth\. Notably,OCLGen’s \(*uniform*\) average plan lengths on solved instances closely approach those of the FD\-optimal solver: within 1\.7% on Blocksworld \(30\.64 vs\. 30\.12\), 5\.2% on Logistics \(28\.73 vs\. 27\.31\), 0\.2% on Labyrinth \(12\.99 vs\. 12\.96\), and 3\.0% on Sokoban \(48\.45 vs\. 47\.03\)\. While OCL\-Anyt\. A\* achieves competitive optimality rates on Blocksworld \(336/630\) and Labyrinth \(975/1000\), it fails to solve the majority of Logistics and Sokoban instances within the time limit\. On this subset, FD\-LAMA\-anytime \(10min\) yields more optimal solutions in Blocksworld, Logistics, and Sokoban\. However, these instances favor lower\-complexity problems solvable by FD\-optimal within 48 hours\. Crucially, on the full test set—which spans higher\-complexity problems—OCLGenscales significantly better, producing shorter plans than FD\-LAMA\-anytime across all four domains \(Table[1](https://arxiv.org/html/2606.00618#S5.T1)\)\. Moreover, we later show that our method can distill its search results back into the model, surpassing FD\-LAMA\-anytime even on this subset after recursive self\-improvement \(Section[5\.4](https://arxiv.org/html/2606.00618#S5.SS4)\)\. Overall, these results demonstrate thatOCLGensignificantly increases the percentage of optimal solutions while maintaining high completion rates\.

![Refer to caption](https://arxiv.org/html/2606.00618v1/x2.png)\(a\)Blocksworld
![Refer to caption](https://arxiv.org/html/2606.00618v1/x3.png)\(b\)Logistics
![Refer to caption](https://arxiv.org/html/2606.00618v1/x4.png)\(c\)Labyrinth
![Refer to caption](https://arxiv.org/html/2606.00618v1/x5.png)\(d\)Sokoban

Figure 2:Plan length over time across all domains\.OCLGenrapidly converges to shorter plans compared to baseline methods\.![Refer to caption](https://arxiv.org/html/2606.00618v1/x6.png)\(a\)Blocksworld
![Refer to caption](https://arxiv.org/html/2606.00618v1/x7.png)\(b\)Logistics
![Refer to caption](https://arxiv.org/html/2606.00618v1/x8.png)\(c\)Labyrinth
![Refer to caption](https://arxiv.org/html/2606.00618v1/x9.png)\(d\)Sokoban

Figure 3:Completion rate over time across all domains\.
#### Plan Quality Convergence\.

Figure[2](https://arxiv.org/html/2606.00618#S5.F2)shows the average plan length and completion rates against runtime\. Note that average plan lengths initially increase as shorter problems tend to be solved first\. Across all domains,OCLGenachieves shorter plans significantly faster than baseline methods\. On Blocksworld, it reaches an average plan length of 50 steps within 30 seconds, a quality that MCTS and Best\-of\-N fail to match even after 5 minutes of runtime\. Similarly, in Labyrinth,OCLGenconverges to near\-optimal lengths \(∼\\sim13 steps\) in under 200 seconds, while MCTS requires the full 600\-second budget to plateau at a substantially higher length\. This pattern holds in Logistics and Sokoban, whereOCLGenconsistently achieves lower plan lengths after completion rates stabilized\. The steeper descent ofOCLGen’s curves indicates more effective use of compute: each additional second of runtime yields greater plan quality improvements compared to the baselines\. Importantly, as shown in Figure[3](https://arxiv.org/html/2606.00618#S5.F3), this improved plan quality does not come at the cost of completion rate\.OCLGenreaches near\-perfect completion rates within seconds on Blocksworld, Labyrinth, and Logistics, and remains competitive with MCTS on Sokoban\.

Table 3:Completion rate and plan length statistics ofOCLGen\(uniform\) and Best\-of\-N on unseen problems for which the training data generation with FD\-LAMA\-first \(tmax=20t\_\{\\text\{max\}\}\{=\}20min\) failed \(number of problems: 66 for Logistics and 250 for Sokoban\)\. Comp\.: Completion \(%\); Length: Mean plan length \(± std\. error\)\.

### 5\.2Generalization Beyond Training Data

Table[3](https://arxiv.org/html/2606.00618#S5.T3)evaluatesOCLGen\(uniform\) on problem instances where the training data planner \(FD\-LAMA with a 20\-minute budget\) failed to find a solution\.OCLGensolves 100% of these challenging Logistics instances and 93\.6% of Sokoban instances\. This indicates that the generative model learns transferable strategies rather than merely overfitting the training data\. Furthermore,OCLGenconsistently outperforms Best\-of\-N sampling on these difficult instances, improving completion rates by 1\.5 percentage points on Logistics and 2\.8 percentage points on Sokoban, while also producing shorter plans \(374\.85 vs\. 376\.09 on Logistics; 348\.94 vs\. 364\.27 on Sokoban\)\.

### 5\.3Ablations

To evaluate the contribution of each component inOCLGen, we systematically ablate three key design choices across all four domains: \(1\)depth partitioning, \(2\)adaptive expansion, and \(3\) thepercentile\-based cost\-to\-go estimator\. For each ablation, we remove or replace a single component while keeping all other settings fixed, usingOCLGen\(uniform\) as the base configuration\. Table[4](https://arxiv.org/html/2606.00618#S5.T4)reports completion rates and plan lengths across 1000 test instances per domain, again given a 10\-min runtime limit\. Each component provides incremental improvements that collectively drive the full method’s performance\.

Table 4:Ablation study across problem domains\. Comp\.: Completion rate \(%\); Length: Plan length \(mean±\\pmstd\. error\)\.#### Impact of Depth Partitioning\.

Table[4](https://arxiv.org/html/2606.00618#S5.T4)shows that removing depth\-based selection leads to notably longer plans, particularly on Blocksworld \(48\.81 vs\. 43\.88\) and Logistics \(159\.23 vs\. 155\.83\)\. Without depth partitioning, the search tends to over\-exploit deep branches, reducing the diversity of explored trajectories and ultimately yielding suboptimal solutions\.

#### Confidence\-based Adaptive Expansion\.

The adaptive expansion mechanism dynamically decides when to expand the search tree based on rollout confidence\. Removing this component yields comparable completion rates and overall longer plans \(e\.g\. 44\.35 vs\. 43\.88 on Blocksworld\)\. The adaptive threshold allowsOCLGento allocate more computational effort to uncertain regions of the search space while quickly committing to high\-confidence trajectories, leading to an improvement in plan quality\.

#### Influence of Heuristic Point Estimator\.

We compare our percentile\-based cost\-to\-go estimate against using the mode of the predicted distribution\. Table[4](https://arxiv.org/html/2606.00618#S5.T4)shows that replacing the percentile estimate with the mode results in marginally longer plans across all domains\. While the differences are modest, the percentile\-based estimator yields more optimistic cost\-to\-go estimates that counteract overestimation bias, improving plan quality on Blocksworld \(43\.88 vs\. 44\.68\) and Logistics \(155\.83 vs\. 156\.11\)\.

### 5\.4OCLGen for Model Self\-Improvement

We evaluate the use ofOCLGenfor recursive model self\-improvement\. Given a random subset of training problems, we run our search algorithm for a fixed runtime budget on each problem instance to compute improved plans, then finetune bothπθ\\pi\_\{\\theta\}andhϕh\_\{\\phi\}on this data\. This procedure is applied recursively, using updated model weights from the previous iteration to generate improved plans on newly sampled problems\. We compareOCLGen\(uniform\) and MCTS \(full rollouts\) fornloop=3n\_\{\\text\{loop\}\}\{=\}3iterations of self\-improvement in Blocksworld and Sokoban\. At each iteration, we sample 3000 problems and run the search method on each of these given a time limit of 3 minutes in Blocksworld and 5 minutes in Sokoban \(further details in Appendix[E](https://arxiv.org/html/2606.00618#A5)\)\.

Table 5:Self\-improvement plan statistics on the test dataset \(1000 samples\) using best\-of\-N sampling \(N=10N\{=\}10\)\. Comp\.: Completion rate \(%\); Length: Plan length \(mean±\\pmstd\. error\)\.Table[5](https://arxiv.org/html/2606.00618#S5.T5)reports the test set evaluation of the finetuned generative models after each self\-improvement iteration using Best\-of\-N sampling \(N=10N\{=\}10\)\. Both methods maintain near\-full completion on Blocksworld, butOCLGenconsistently produces shorter plans, providing a higher quality training signal\. On Sokoban, a similar pattern is observed, although completion rates drop for both methods, presumably because finetuning on improved plans reduces policy entropy, decreasing diversity within the Best\-of\-N batch\.

Table[6](https://arxiv.org/html/2606.00618#S5.T6)details the test set evaluation of our final models, which utilize test\-time search after three rounds of self\-improvement\. On Blocksworld,OCLGenyields significantly shorter plans \(40\.74 vs\. 44\.89\) and achieves 100% optimal solutions \(vs\. 71\.4% for MCTS\)\. On Sokoban, our method produces shorter plans \(123\.60 vs\. 125\.06\) and achieves 94\.7% optimal solutions \(vs\. 81\.8% for MCTS\), while recovering from the completion rate drop observed with the model alone to achieve 99\.8% completion\. Notably, on the problem subsets with known optimal solutions, self\-improvement withOCLGensurpasses the FD\-LAMA\-anytime solver \(601/630 on Blocksworld and 441/489 on Sokoban at a 20\-minute budget; Table[2](https://arxiv.org/html/2606.00618#S5.T2)\)\. These results demonstrate the suitability ofOCLGenfor efficient model self\-improvement\.

Table 6:Results on Blocksworld and Sokoban withOCLGenand MCTS \(tmax=t\_\{\\text\{max\}\}\{=\}10min\) afternloop=3n\_\{\\text\{loop\}\}=3iteration of model self\-improvement on the test sets from Table[1](https://arxiv.org/html/2606.00618#S5.T1)\(1000 samples\)\. Comp\.: Completion rate \(%\); Length: Plan length \(mean±\\pmstd\. error\)\. Optimal: number solved optimally\.
### 5\.5Accuracy of Heuristic Model

Table[7](https://arxiv.org/html/2606.00618#S5.T7)reports mean absolute error \(MAE\) statistics for Blocksworld and Sokoban before and after self\-improvement\. The MAE is computed on the subset of instances with known optimal solutions, using either the*mode*or*k\-th percentile*to derive scalar values from the predicted cost\-to\-go distributions \(k=3k=3andk=10k=10for Blocksworld and Sokoban, respectively\)\. As expected, percentile\-based estimates yield lower MAE than mode\-based estimates for all base models \(nloop=0n\_\{\\text\{loop\}\}=0\), reducing the gap to optimality\. This behavior holds consistently across all four domains \(see Table[15](https://arxiv.org/html/2606.00618#A6.T15)in the Appendix\)\.

Self\-improvement viaOCLGen\(nloop=3n\_\{\\text\{loop\}\}=3\) substantially reduces the model’s MAE – for example, from 4\.07 to 0\.66 on Blocksworld and from 3\.31 to 2\.51 on Sokoban \(percentile method\)\. This confirms thatOCLGennot only discovers better solutions across self\-improvement rounds, but also successfully refines the heuristic model\.

Table 7:MAE \(mean absolute error\)±\\pmstd\. error of heuristic models on the subset of known optimal solutions\. Values are reported for both mode\-based and percentile\-based heuristics\.

## 6Conclusion

We presentOCLGen, a test\-time search algorithm adapting classical OCL search to autoregressive generative planning models\. Key innovations include depth\-partitioned selection, partial rollouts with adaptive expansion, and learned distributional heuristics\. Across four domains,OCLGendelivers near\-optimal plans at high completion rates, outperforming neurosymbolic and classical baselines on the full test set\. Finally, on the known\-optimal subsets, recursive self\-improvement yields a 100% optimality rate on Blocksworld and 94\.7% on Sokoban\.

#### Future Work\.

Our depth selection strategies are currently uninformed, following fixed schedules\. Adaptive strategies that concentrate search effort on the most promising depths may yield further gains\. Characterizing when recursive self\-improvement provably converges to optimal solutions remains an open theoretical question\. Scaling to larger problems and generalizing to object counts or grid sizes outside the training distribution remain open challenges, as does transfer to new domains\. Finally, our method assumes suboptimal initial solutions for pretraining\. ExtendingOCLGento support efficient policy improvement without prior data represents a compelling future direction\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- T\. Anthony, Z\. Tian, and D\. Barber \(2017\)Thinking fast and slow with deep learning and tree search\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Antoniades, A\. Örwall, K\. Zhang, Y\. Xie, A\. Goyal, and W\. Y\. Wang \(2025\)SWE\-search: enhancing software agents with monte carlo tree search and iterative refinement\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=G7sIFXugTX)Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p2.1)\.
- G\. Chen, M\. Liao, C\. Li, and K\. Fan \(2024\)AlphaMath almost zero: process supervision without process\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=VaXnxQ3UKo)Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p2.1),[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Coulom \(2007\)Efficient selectivity and backup operators in monte\-carlo tree search\.InComputers and Games,H\. J\. van den Herik, P\. Ciancarini, and H\. H\. L\. M\. \(\. Donkers \(Eds\.\),Berlin, Heidelberg,pp\. 72–83\.External Links:ISBN 978\-3\-540\-75538\-8Cited by:[Appendix B](https://arxiv.org/html/2606.00618#A2.SS0.SSS0.Px1.p1.10)\.
- J\. Culberson \(1998\)Sokoban is pspace complete\.\.Proceedings of the International Conference on Fun with Algorithms,pp\. 65–76\.Cited by:[4th item](https://arxiv.org/html/2606.00618#S5.I2.i4.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§5\.1](https://arxiv.org/html/2606.00618#S5.SS1.SSS0.Px2.p1.5)\.
- R\. Eifler and D\. Fišer \(2023\)Labyrinth PDDL domain\.Github\.Note:[https://github\.com/ipc2023\-classical/domain\-labyrinth](https://github.com/ipc2023-classical/domain-labyrinth)Cited by:[Appendix A](https://arxiv.org/html/2606.00618#A1.SS0.SSS0.Px3.p1.2)\.
- M\. Fox and D\. Long \(2003\)PDDL2\.1: an extension to pddl for expressing temporal planning domains\.Journal of Artificial Intelligence Research20,pp\. 61–124\.External Links:ISSN 1076\-9757,[Link](http://dx.doi.org/10.1613/jair.1129),[Document](https://dx.doi.org/10.1613/jair.1129)Cited by:[§3\.1](https://arxiv.org/html/2606.00618#S3.SS1.p2.1)\.
- M\. Fritzsche, E\. Gestrin, and J\. Seipp \(2025\)Symmetry\-aware transformer training for automated planning\.arXiv preprint arXiv:2508\.07743\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Gieselmann, H\. von Huelsen, M\. Samson, M\. Meyer, D\. Piotrowski, O\. Radomskyi, J\. Okamoto, T\. Gojayev, M\. Painter, G\. Brown, F\. Pecora, and J\. L\. Wyatt \(2026\)Self\-improvement for fast, high\-quality plan generation\.External Links:2605\.03625,[Link](https://arxiv.org/abs/2605.03625)Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- E\. Groshev, M\. Goldstein, A\. Tamar, S\. Srivastava, and P\. Abbeel \(2018\)Learning generalized reactive policies using deep neural networks\.InIntl\. Conf\. on Automated Planning and Scheduling,Vol\.28,pp\. 408–416\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Guan, L\. L\. Zhang, Y\. Liu, N\. Shang, Y\. Sun, Y\. Zhu, F\. Yang, and M\. Yang \(2025\)RStar\-math: small llms can master math reasoning with self\-evolved deep thinking\.arXiv preprint arXiv:2501\.04519\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- P\. E\. Hart, N\. J\. Nilsson, and B\. Raphael \(1968\)A formal basis for the heuristic determination of minimum cost paths\.IEEE Transactions on Systems Science and Cybernetics4\(2\),pp\. 100–107\.External Links:[Document](https://dx.doi.org/10.1109/TSSC.1968.300136)Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.00618#S4.SS1.p1.9)\.
- M\. Helmert \(2003\)Complexity results for standard benchmark domains in planning\.Artificial Intelligence143\(2\),pp\. 219–262\.External Links:ISSN 0004\-3702,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0004-3702%2802%2900364-8),[Link](https://www.sciencedirect.com/science/article/pii/S0004370202003648)Cited by:[Appendix A](https://arxiv.org/html/2606.00618#A1.p1.1),[§1](https://arxiv.org/html/2606.00618#S1.p3.1)\.
- R\. Howey, D\. Long, and M\. Fox \(2004\)VAL: automatic plan validation, continuous effects and mixed initiative planning using pddl\.In16th IEEE Intl\. Conf\. on Tools with Artificial Intelligence,pp\. 294–301\.Cited by:[§4\.5](https://arxiv.org/html/2606.00618#S4.SS5.p1.1)\.
- T\. Hubert, R\. Mehta, L\. Sartran, J\. Schmitt, P\. Buchlovsky, S\. Coward, G\. Fletcher, G\. Holland, E\. Huang, T\. Hubert,et al\.\(2025\)Olympiad\-level formal mathematical reasoning with reinforcement learning\.Nature\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09833-y),[Link](https://doi.org/10.1038/s41586-025-09833-y)Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p2.1)\.
- S\. Jabbari Arfaee, S\. Zilles, and R\. C\. Holte \(2011\)Learning heuristic functions for large state spaces\.Artificial Intelligence175\(16\),pp\. 2075–2098\.External Links:ISSN 0004\-3702,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.artint.2011.08.001),[Link](https://www.sciencedirect.com/science/article/pii/S0004370211000877)Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Lehnert, S\. Sukhbaatar, D\. Su, Q\. Zheng, P\. Mcvay, M\. Rabbat, and Y\. Tian \(2024\)Beyond a\*: better planning with transformers via search dynamics bootstrapping\.arXiv preprint arXiv:2402\.14083\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- D\. McDermott, M\. Ghallab, A\. Howe, C\. Knoblock, A\. Ram, M\. Veloso, D\. Weld, and D\. Wilkins \(1998\)PDDL\-the planning domain definition language\.Technical reportTechnical ReportTR\-98\-003/DCS TR\-1165,Yale Center for Computational Vision and Control\.Cited by:[§3\.1](https://arxiv.org/html/2606.00618#S3.SS1.p2.1)\.
- V\. Pallagani, B\. Muppasani, B\. Srivastava, F\. Rossi, L\. Horesh, K\. Murugesan, A\. Loreggia, F\. Fabiano, R\. Joseph, Y\. Kethepalli,et al\.\(2023\)Plansformer tool: demonstrating generation of symbolic plans using transformers\.\.InIJCAI,Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§3\.2](https://arxiv.org/html/2606.00618#S3.SS2.p1.6),[§5\.1](https://arxiv.org/html/2606.00618#S5.SS1.SSS0.Px2.p1.5)\.
- S\. Richter and M\. Westphal \(2010\)The lama planner: guiding cost\-based anytime planning with landmarks\.Journal of Artificial Intelligence Research39,pp\. 127–177\.Cited by:[6th item](https://arxiv.org/html/2606.00618#S5.I3.i6.p1.1),[§5\.1](https://arxiv.org/html/2606.00618#S5.SS1.SSS0.Px2.p1.5)\.
- C\. D\. Rosin \(2011\)Multi\-armed bandits with episode context\.Annals of Mathematics and Artificial Intelligence61\(3\),pp\. 203–230\.External Links:ISSN 1012\-2443,[Link](https://doi.org/10.1007/s10472-011-9258-6),[Document](https://dx.doi.org/10.1007/s10472-011-9258-6)Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p2.1)\.
- N\. Rossetti, M\. Tummolo, A\. E\. Gerevini, M\. Olivato, L\. Putelli, and I\. Serina \(2024a\)Enhancing gpt\-based planning policies by model\-based plan validation\.InIntl\. Conf\. on Neural\-Symbolic Learning and Reasoning,Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Rossetti, M\. Tummolo, A\. E\. Gerevini, L\. Putelli, I\. Serina, M\. Chiari, and M\. Olivato \(2024b\)Learning general policies for planning through gpt models\.InIntl\. Conf\. on Automated Planning and Scheduling,Vol\.34,pp\. 500–508\.Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p1.1),[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.00618#S3.SS2.p1.6),[§4\.5](https://arxiv.org/html/2606.00618#S4.SS5.p1.1)\.
- A\. L\. Samuel \(1959\)Some studies in machine learning using the game of checkers\.IBM Journal of Research and Development3\(3\),pp\. 210–229\.External Links:[Document](https://dx.doi.org/10.1147/rd.33.0210)Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Seipp, Á\. Torralba, and J\. Hoffmann \(2022\)PDDL generators\.Zenodo\.Note:[https://doi\.org/10\.5281/zenodo\.6382173](https://doi.org/10.5281/zenodo.6382173)Cited by:[Appendix A](https://arxiv.org/html/2606.00618#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2606.00618#A1.SS0.SSS0.Px4.p1.2)\.
- D\. Silver, A\. Huang, C\. J\. Maddison, A\. Guez, L\. Sifre, G\. Van Den Driessche, J\. Schrittwieser, I\. Antonoglou, V\. Panneershelvam, M\. Lanctot,et al\.\(2016\)Mastering the game of go with deep neural networks and tree search\.Nature529\(7587\),pp\. 484–489\.Cited by:[Appendix B](https://arxiv.org/html/2606.00618#A2.SS0.SSS0.Px1.p1.10),[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel,et al\.\(2017\)Mastering chess and shogi by self\-play with a general reinforcement learning algorithm\.arXiv preprint arXiv:1712\.01815\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Slaney and S\. Thiébaux \(2001\)Blocks world revisited\.Artificial Intelligence125\(1\),pp\. 119–153\.External Links:ISSN 0004\-3702,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0004-3702%2800%2900079-5),[Link](https://www.sciencedirect.com/science/article/pii/S0004370200000795)Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p1.1),[1st item](https://arxiv.org/html/2606.00618#S5.I2.i1.p1.1)\.
- N\. Stiennon, L\. Ouyang, J\. Wu, D\. Ziegler, R\. Lowe, C\. Voss, A\. Radford, D\. Amodei, and P\. F\. Christiano \(2020\)Learning to summarize with human feedback\.Advances in neural information processing systems33,pp\. 3008–3021\.Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p2.1)\.
- A\. Taitler, R\. Alford, J\. Espasa, G\. Behnke, D\. Fišer, M\. Gimelfarb, F\. Pommerening, S\. Sanner, E\. Scala, D\. Schreiber, J\. Segovia‐Aguas, and J\. Seipp \(2024\)The 2023 international planning competition\.AI Mag\.45\(2\),pp\. 280–296\.External Links:ISSN 0738\-4602,[Link](https://doi.org/10.1002/aaai.12169),[Document](https://dx.doi.org/10.1002/aaai.12169)Cited by:[Appendix A](https://arxiv.org/html/2606.00618#A1.SS0.SSS0.Px3.p1.2)\.
- M\. Tummolo, N\. Rossetti, A\. E\. Gerevini, M\. Olivato, L\. Putelli, and I\. Serina \(2024\)Integrating classical planners with gpt\-based planning policies\.InIntl\. Conf\. of the Italian Association for Artificial Intelligence,pp\. 315–329\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Valenzano and F\. Xie \(2016\)On the completeness of best\-first search variants that use random exploration\.Proceedings of the AAAI Conference on Artificial Intelligence30\(1\)\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/10081),[Document](https://dx.doi.org/10.1609/aaai.v30i1.10081)Cited by:[§1](https://arxiv.org/html/2606.00618#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.00618#S4.SS1.p1.9)\.
- M\. Vallati, L\. Chrpa, M\. Grzes, T\. L\. McCluskey, M\. Roberts, and S\. Sanner \(2015\)The 2014 international planning competition: progress and trends\.Artificial Intelligence Magazine \(AI Magazine\)36\(3\),pp\. 90–98\.Cited by:[§3\.1](https://arxiv.org/html/2606.00618#S3.SS1.p2.1)\.
- H\. Xin, Z\.Z\. Ren, J\. Song, Z\. Shao, W\. Zhao, H\. Wang, B\. Liu, L\. Zhang, X\. Lu, Q\. Du, W\. Gao, H\. Zhang, Q\. Zhu, D\. Yang, Z\. Gou, Z\.F\. Wu, F\. Luo, and C\. Ruan \(2025\)DeepSeek\-prover\-v1\.5: harnessing proof assistant feedback for reinforcement learning and monte\-carlo tree search\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=I4YAIwrsXa)Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Zhang, S\. Zhoubian, Y\. Yue, Y\. Dong, and J\. Tang \(2024\)ReST\-mcts\*: llm self\-training via process reward guided tree search\.arXiv preprint arXiv:2406\.03816\.Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Zhang, Z\. Chen, Y\. Shen, M\. Ding, J\. B\. Tenenbaum, and C\. Gan \(2023\)Planning with large language models for code generation\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Lr8cOOtYbfL)Cited by:[§2](https://arxiv.org/html/2606.00618#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix ADatasets

We construct dedicated datasets for each of the four planning domains evaluated in this work\. Each dataset comprises 100k training instances, 1k validation instances for model selection, and 1k held\-out test instances\. All problem instances are unique and paired with reference solutions obtained using Fast Downward\(Helmert,[2003](https://arxiv.org/html/2606.00618#bib.bib2)\)in the LAMA\-first configuration with a 20\-minute timeout\. Instances that could not be solved within this time limit were excluded from the training datasets\.

#### Blocksworld

We generate Blocksworld instances containing 3 to 25 blocks using the generator from\(Seippet al\.,[2022](https://arxiv.org/html/2606.00618#bib.bib25)\)\. All generated instances were solvable by FD\-LAMA within the 20\-minute budget\. The distribution of instances follows a logarithmic scaling with respect to the number of blocks, as the limited number of unique configurations for small block counts precludes a uniform distribution\.

#### Logistics

Our Logistics dataset covers a broad range of problem sizes: 1–50 cities with 1–5 locations each, 1–50 packages, 1–10 airplanes, and one truck per city\. Instances are sampled uniformly over all valid combinations of these parameters\. Fast Downward solved 99\.97% of generated instances within the time limit; additional instances were generated to ensure the target dataset sizes\.

#### Labyrinth

Labyrinth was introduced in IPC 2023\(Taitleret al\.,[2024](https://arxiv.org/html/2606.00618#bib.bib27)\)\. We generate a plan dataset with3×33\\times 3and4×44\\times 4grids, which remain challenging while ensuring sufficient training coverage using FD\-LAMA\. Problem instances were generated using the official IPC 2023 generator\(Eifler and Fišer,[2023](https://arxiv.org/html/2606.00618#bib.bib26)\)\.

#### Sokoban

The Sokoban PDDL domain appeared in IPC 2008 and 2011\. We constrain our instances to grid sizes between5×55\\times 5and14×1414\\times 14, with 1–10 boxes and up to 10 walls\. We use the generator from\(Seippet al\.,[2022](https://arxiv.org/html/2606.00618#bib.bib25)\)\.

## Appendix BBaselines

#### Monte Carlo Tree Search\.

We implemented Monte Carlo Tree Search using a PUCT\-style selection policy similar toSilveret al\.\([2016](https://arxiv.org/html/2606.00618#bib.bib39)\)\. Specifically, we maintain separate Q\-values:QoptQ\_\{\\text\{opt\}\}for expected plan length \(normalized by plan lengths of neighboring actions\) following a transition, andQsatQ\_\{\\text\{sat\}\}for the expected probability of finding a solution\. The overall Q\-value is computed as a convex combination ofQoptQ\_\{\\text\{opt\}\}andQsatQ\_\{\\text\{sat\}\}using a mixing coefficientα\\alpha\. To compute the policy priorP\(s,a\)P\(s,a\), we determine action probabilities from token sequences by computing the geometric mean of token probabilities for individual action sequences, then normalize over the set of possible actions at a state\. The selection policy isa∗=arg⁡maxa∈𝒜⁡\[αQsat\(s,a\)\+\(1−α\)Qopt\(s,a\)\+cPUCTP\(s,a\)∑bN\(s,b\)1\+N\(s,a\)\]a^\{\*\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\}\\left\[\\alpha\\,Q\_\{\\text\{sat\}\}\(s,a\)\+\(1\-\\alpha\)\\,Q\_\{\\text\{opt\}\}\(s,a\)\+c\_\{\\text\{PUCT\}\}\\,P\(s,a\)\\,\\frac\{\\sqrt\{\\sum\_\{b\}N\(s,b\)\}\}\{1\+N\(s,a\)\}\\right\]\. For all experiments, we setα=0\.1\\alpha=0\.1andcPUCT=1\.0c\_\{\\text\{PUCT\}\}=1\.0\. We also incorporate progressive widening\(Coulom,[2007](https://arxiv.org/html/2606.00618#bib.bib35)\), which significantly improved performance\. For MCTS with partial rollouts, we use a maximum token limit of 50\. We use the heuristic modelhϕh\_\{\\phi\}to estimate the cost\-to\-go of unsuccessful rollouts\.

#### Anytime A\*\.

We implement a version of Anytime A\* using our learned heuristichϕh\_\{\\phi\}to guide the search\. Similar to our method, we use the lower percentile to obtain heuristic point estimates from distributions\. At every step, we select the node with the lowestffvalue from the open list, wheref=g\+hf=g\+h\. The search does not stop once a goal is found but continues until the runtime limit is exhausted or no unvisited nodes remain in the open list\.

#### Greedy Best\-First Search\.

This method is similar to Anytime A\* but purely uses the learned heuristichϕh\_\{\\phi\}to guide the search \(i\.e\.,f=hf=h\)\. Again, the search continues after finding a goal until the runtime limit is exhausted or no unvisited nodes remain\.

#### Best\-of\-N\.

This method involves repeated sampling from the generative modelπθ\\pi\_\{\\theta\}\. For all experiments, we use a softmax temperature ofTsoftmax=1T\_\{\\text\{softmax\}\}=1and sample a batch of 10 plans\. The generated plans are then validated for syntactic and semantic correctness, and we report the shortest valid plan in the batch\. For the experiments, we terminate once the time limit oftmax=10t\_\{\\text\{max\}\}\{=\}10minutes is reached\. Table[8](https://arxiv.org/html/2606.00618#A2.T8)presents the number of candidate plans per domain generated by Best\-of\-N within this runtime budget\.

Table 8:Number of generated plans \(mean ± std\. dev\.\) with Best\-of\-N for the experiments in Table[1](https://arxiv.org/html/2606.00618#S5.T1)

## Appendix CTraining Details

We train domain\-specific policy and heuristic cost models on 4 compute instances, each equipped with 8 NVIDIA A100 GPUs \(40GB VRAM\)\. Table[9](https://arxiv.org/html/2606.00618#A3.T9)shows the transformer architecture and training hyperparameters for the generative model\. Table[10](https://arxiv.org/html/2606.00618#A3.T10)shows corresponding hyperparameters for the heuristic cost model\. We use the same model architectures across all domains\.

Table 9:Training hyperparameters for the generative model\. Unless otherwise indicated, values are identical across domains \(BW: Blocksworld, Log: Logistics, Lab: Labyrinth, Sok: Sokoban\)\.Table 10:Training hyperparameters for the heuristic model\. Unless otherwise indicated, values are identical across domains \(BW: Blocksworld, Log: Logistics, Lab: Labyrinth, Sok: Sokoban\)\.
## Appendix DTest\-time Inference Hyperparameters

For all methods that employ rollouts, we use softmax temperatureTsoftmax=1T\_\{\\text\{softmax\}\}=1and sample plans in batches of 10\. To further reduce the branching factor for highly complex problem instances, we use a simple heuristic based on the number of grounded actions in a problem instance\. Specifically, if the number of grounded actions exceeds a predefined thresholdτ\|𝒜\|\\tau\_\{\|\\mathcal\{A\}\|\}\(in practice, this occurs only for a small portion of problems in Logistics\), we perform expansions inOCLGenby sampling a batch of actions fromπθ\\pi\_\{\\theta\}with batch size 32; otherwise, we enumerate all successor states using the grounded operator models\.

Table 11:Test\-time hyperparameters for our method\. Unless otherwise indicated, values are identical across domains \(BW: Blocksworld, Log: Logistics, Lab: Labyrinth, Sok: Sokoban\)\.
## Appendix ESelf\-improvement Details

Our self\-improvement pipeline runs fornloop=3n\_\{\\text\{loop\}\}=3iterations on Blocksworld and Sokoban\. Every iteration samples 3000 problems per domain, executing the search method withtmax=3t\_\{\\text\{max\}\}=3minutes \(Blocksworld\) andtmax=5t\_\{\\text\{max\}\}=5minutes \(Sokoban\)\. We collect the successfully optimized plans to finetune both the generative model and the heuristic using standard cross\-entropy loss\. To maintain a consistent training set size, instances unsolved within the time limit default to their original suboptimal training plans\. The respective training configurations are detailed in Tables[12](https://arxiv.org/html/2606.00618#A5.T12)and[13](https://arxiv.org/html/2606.00618#A5.T13)\.

Table 12:Finetuning hyperparameters for the generative model\. Unless otherwise indicated, values are identical across domains\.Table 13:Finetuning hyperparameters for the heuristic model\. Unless otherwise indicated, values are identical across domains\.
## Appendix FAdditional Results

### F\.1Data Augmentation via Action Compilation

In Section[4\.5](https://arxiv.org/html/2606.00618#S4.SS5), we introduce a data augmentation method for training our generative planning model\. Given the plan sequences in𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}, we use the operator effects specified in the PDDL domain definition to compile intermediate states along each plan, generating new state\-plan pairs\. We apply this procedure when training both the generative model and the heuristic model\. During batch generation, we first sample a random offset within the plan length, then compute the corresponding intermediate state by sequentially applying operator effects up to this offset\. A new training sample is formed by pairing this intermediate state with the remaining plan suffix\.

The results in Table[14](https://arxiv.org/html/2606.00618#A6.T14)compare models trained with and without this state compilation augmentation\. The evaluation, performed on 1000 unseen problem instances per domain, uses Best\-of\-N sampling withN=10N=10for both model variants\. Outside of Blocksworld, where the two variants perform comparably, augmentation improves completion rate on Logistics, Labyrinth, and Sokoban, and shortens plans on Labyrinth and Sokoban\. Overall, exposing the model to intermediate states during training enhances its ability to generalize across planning instances\.

Table 14:Performance comparison across problem domains for a generative planning model with and without action compilation\. Comp\.: Completion \(%\); Length: Plan length \(mean±\\pmstd error\)\.
### F\.2Accuracy of the learned cost model

Figures[4](https://arxiv.org/html/2606.00618#A6.F4)–[7](https://arxiv.org/html/2606.00618#A6.F7)show the per\-domain distribution of absolute errors betweenhϕh\_\{\\phi\}’s predictions and the held\-out \(suboptimal\) test labels, characterizing how closely the heuristic fits the data\-generating planner\. Accuracy against the*true optimal*cost\-to\-go is reported separately in Table[15](https://arxiv.org/html/2606.00618#A6.T15)\.

![Refer to caption](https://arxiv.org/html/2606.00618v1/x10.png)Figure 4:Blocksworld \- Distribution of absolute errors of learned heuristic
cost model with respect to test labels \(suboptimal data\)\.![Refer to caption](https://arxiv.org/html/2606.00618v1/x11.png)Figure 5:Logistics \- Distribution of absolute errors of learned heuristic
cost model with respect to test labels \(suboptimal data\)\.![Refer to caption](https://arxiv.org/html/2606.00618v1/x12.png)Figure 6:Labyrinth \- Distribution of absolute errors of learned heuristic
cost model with respect to test labels \(suboptimal data\)\.![Refer to caption](https://arxiv.org/html/2606.00618v1/x13.png)Figure 7:Sokoban \- Distribution of absolute errors of learned heuristic
cost model with respect to test labels \(suboptimal data\)\.
### F\.3Illustrations of Learned Cost Distribution

![Refer to caption](https://arxiv.org/html/2606.00618v1/x14.png)

![Refer to caption](https://arxiv.org/html/2606.00618v1/x15.png)

![Refer to caption](https://arxiv.org/html/2606.00618v1/x16.png)

Figure 8:Blocksworld \- Examples of cost distribution generated by the learned heuristic model\.![Refer to caption](https://arxiv.org/html/2606.00618v1/x17.png)

![Refer to caption](https://arxiv.org/html/2606.00618v1/x18.png)

![Refer to caption](https://arxiv.org/html/2606.00618v1/x19.png)

Figure 9:Logistics \- Examples of cost distribution generated by the learned heuristic model\.![Refer to caption](https://arxiv.org/html/2606.00618v1/x20.png)

![Refer to caption](https://arxiv.org/html/2606.00618v1/x21.png)

![Refer to caption](https://arxiv.org/html/2606.00618v1/x22.png)

Figure 10:Labyrinth \- Examples of cost distribution generated by the learned heuristic model\.![Refer to caption](https://arxiv.org/html/2606.00618v1/x23.png)

![Refer to caption](https://arxiv.org/html/2606.00618v1/x24.png)

![Refer to caption](https://arxiv.org/html/2606.00618v1/x25.png)

Figure 11:Sokoban \- Examples of cost distribution generated by the learned heuristic model\.
### F\.4Heuristic model vs\. true optimal cost\-to\-go

We compare the predictions of our learned heuristic model against the true optimal cost\-to\-go for the subset of solved problems with known optimal solutions\. Table[15](https://arxiv.org/html/2606.00618#A6.T15)reports the mean absolute error \(MAE\) and mean error \(ME;prediction−ground truth\\text\{prediction\}\-\\text\{ground truth\}\) statistics across all domains\. We evaluate both the predicted heuristic value with the highest probability \(Mode\) and the value obtained through percentile estimation \(Perc\.\)\.

Table 15:MAE \(mean absolute error\) and ME \(mean error\) statistics \(±\\pmstd\. error\) with respect to the subset of known optimal data\. Shown for both mode and percentile\-based heuristic estimation across domains\.These results confirm the benefit of our percentile\-based approach, which consistently shifts the predictions of the base heuristic model closer to the true optimal costs\. Moreover, the self\-improvement loop \(nloop=3n\_\{\\text\{loop\}\}=3\) further reduces the estimation error of both configurations, substantially improving overall heuristic accuracy\.

### F\.5Model Generalization to Larger Number of Objects

Sec\.[5\.2](https://arxiv.org/html/2606.00618#S5.SS2)shows thatOCLGengenerates valid plans where FD\-LAMA failed, and outperforms Best\-of\-N sampling\. Generalization beyond trained object counts is limited by the generative model’s token vocabulary, which is a limitation shared by all baselines, not specific to our inference method\.

Yet, we ran an additional experiment: training on Blocksworld with 3–25 blocks while excluding problems with 11, 12, 18, or 19 blocks, then testing on those \(interpolate\) and testing on instances with 26 to 40 blocks \(extrapolate\)\. The model with Best\-of\-N solved 100% of interpolation instances but 0% of extrapolation \(unseen object names\)\. The results on extrapolation are not surprising since the inputs during testing contain unobserved object name tokens \(e\.g\. block31\)\. In order to improve generalization to unseen object names, we retrained the base model while augmenting the data by randomly shuffling object names during the batch generation\. With this new generative model, Best\-of\-N achieved 74\.0% completion on the extrapolation set within a 10min time limit per problem\. Test\-time search withOCLGenfurther improved the completion rate to 79\.9%\.

### F\.6Statistical Analysis

Table[1](https://arxiv.org/html/2606.00618#S5.T1)in Section[5](https://arxiv.org/html/2606.00618#S5)presents the mean plan length and standard errors of all methods across different domains\. Yet, in order to determine statistical significance, simply analyzing these statistics alone is insufficient since we test on the same 1000 instances per domain where problem difficulty dominates the variance\. For example, the standard deviation of plan length is close to 110 on the Logistics test set\. This shared structure means measurements are highly correlated across methods \(ρ\>0\.81\\rho\>0\.81, often\>0\.99\>0\.99\), so unpaired comparisons vastly inflate the standard error and mask consistent improvements\. For example, in Logistics the unpairedSESEof the difference is4\.944\.94, while the pairedSESEis just0\.110\.11— a45×45\\timesreduction\.

We therefore perform Wilcoxon signed\-rank tests on paired per\-problem differencesΔi=LOCLGen,i−Lbaseline,i\\Delta\_\{i\}=L\_\{\\text\{OCLGen\},i\}\-L\_\{\\text\{baseline\},i\}, comparing OCLGen against MCTS \(full rollouts\), the strongest baseline with comparable completion rates:

Table 16:Statistical comparison ofOCLGenvs\. MCTS across domains\.Table[17](https://arxiv.org/html/2606.00618#A6.T17)extends this comparison to a self\-improvement setting, where both methods undergo 3 iterations of self\-improvement using their respective internal search strategies \(OCLGen vs\. MCTS\)\.

Table 17:Statistical comparison ofOCLGenvs\. MCTS \(self\-improvement, 3 iterations\) across domains\.Across all domains and both experimental settings, the Wilcoxon signed\-rank test returns p\-values that are vanishingly small, ranging from5\.4×10−1365\.4\\times 10^\{\-136\}in Blocksworld to3\.2×10−483\.2\\times 10^\{\-48\}in Sokoban \(Table[16](https://arxiv.org/html/2606.00618#A6.T16)\)\. All results comfortably exceed any conventional significance threshold \(e\.g\.,α=0\.05\\alpha=0\.05or evenα=0\.001\\alpha=0\.001\)\. The self\-improvement setting \(Table[17](https://arxiv.org/html/2606.00618#A6.T17)\) confirms this pattern, with p\-values of4\.1×10−894\.1\\times 10^\{\-89\}and2\.6×10−562\.6\\times 10^\{\-56\}for Blocksworld and Sokoban respectively\.

The Cohen’sddvalues provide a far more informative picture of OCLGen’s advantage\. Using the conventional benchmarks \(\|d\|=0\.2\|d\|=0\.2small,\|d\|=0\.5\|d\|=0\.5medium,\|d\|=0\.8\|d\|=0\.8large\), the results reveal a consistent but domain\-dependent advantage for OCLGen\.

In Table[16](https://arxiv.org/html/2606.00618#A6.T16), Blocksworld shows the strongest effect \(d=−1\.08d=\-1\.08\), a large effect by any standard, accompanied by the largest absolute mean difference of−10\.68±0\.31\-10\.68\\pm 0\.31steps\. Logistics follows with a large effect \(d=−0\.81d=\-0\.81,Δ=−2\.92±0\.11\\Delta=\-2\.92\\pm 0\.11\), reflecting a similar advantage\. Labyrinth yields a medium\-to\-large effect \(d=−0\.60d=\-0\.60,Δ=−2\.82±0\.15\\Delta=\-2\.82\\pm 0\.15\), while Sokoban shows a smaller effect \(d=−0\.39d=\-0\.39,Δ=−2\.55±0\.21\\Delta=\-2\.55\\pm 0\.21\), falling in the small\-to\-medium range\.

In Table[17](https://arxiv.org/html/2606.00618#A6.T17), both methods undergo 3 iterations of refinement using their own internal strategy before being compared head\-to\-head\. OCLGen maintains a meaningful advantage over self\-improved MCTS in both tested domains\. Blocksworld yieldsd=−0\.69d=\-0\.69\(medium\-to\-large\) withΔ=−4\.15±0\.19\\Delta=\-4\.15\\pm 0\.19, and Sokoban yieldsd=−0\.51d=\-0\.51\(medium\) withΔ=−2\.08±0\.13\\Delta=\-2\.08\\pm 0\.13\.

Taken together, the results demonstrate that OCLGen consistently and substantially outperforms MCTS in plan length across all tested domains\. The statistical evidence is unambiguous, and the effect sizes range from practically meaningful \(d≈−0\.39d\\approx\-0\.39in Sokoban\) to large \(d≈−1\.08d\\approx\-1\.08in Blocksworld\)\. When both methods are given equal opportunity to self\-improve over 3 iterations, OCLGen retains its advantage with medium\-to\-large effects in both domains\.

## Appendix GPDDL Domain files

### G\.1Blocksworld

Listing 1:Blocksworld domain file\(define\(domainblocksworld\-4ops\)

\(:requirements:strips\)

\(:predicates\(clear?x\)

\(on\-table?x\)

\(arm\-empty\)

\(holding?x\)

\(on?x?y\)\)

\(:actionpickup

:parameters\(?ob\)

:precondition\(and\(clear?ob\)\(on\-table?ob\)\(arm\-empty\)\)

:effect\(and\(holding?ob\)\(not\(clear?ob\)\)\(not\(on\-table?ob\)\)

\(not\(arm\-empty\)\)\)\)

\(:actionputdown

:parameters\(?ob\)

:precondition\(holding?ob\)

:effect\(and\(clear?ob\)\(arm\-empty\)\(on\-table?ob\)

\(not\(holding?ob\)\)\)\)

\(:actionstack

:parameters\(?ob?underob\)

:precondition\(and\(clear?underob\)\(holding?ob\)\)

:effect\(and\(arm\-empty\)\(clear?ob\)\(on?ob?underob\)

\(not\(clear?underob\)\)\(not\(holding?ob\)\)\)\)

\(:actionunstack

:parameters\(?ob?underob\)

:precondition\(and\(on?ob?underob\)\(clear?ob\)\(arm\-empty\)\)

:effect\(and\(holding?ob\)\(clear?underob\)

\(not\(on?ob?underob\)\)\(not\(clear?ob\)\)\(not\(arm\-empty\)\)\)\)\)

### G\.2Logistics

Listing 2:Logistics domain file\(define\(domainlogistics\)

\(:requirements:strips:typing\)

\(:types

packagelocationvehicle\-object

truckairplane\-vehicle

cityairport\-location\)

\(:predicates

\(at?vehicle\-or\-package\-\(eithervehiclepackage\)?location\-location\)

\(in?package\-package?vehicle\-vehicle\)

\(in\-city?loc\-or\-truck\-\(eitherlocationtruck\)?citys\-city\)\)

\(:actionload\-truck

:parameters

\(?obj\-package

?truck\-truck

?loc\-location\)

:precondition

\(and\(at?truck?loc\)

\(at?obj?loc\)\)

:effect

\(and\(not\(at?obj?loc\)\)

\(in?obj?truck\)\)\)

\(:actionload\-airplane

:parameters

\(?obj\-package

?airplane\-airplane

?loc\-airport\)

:precondition

\(and

\(at?obj?loc\)

\(at?airplane?loc\)\)

:effect

\(and\(not\(at?obj?loc\)\)

\(in?obj?airplane\)\)\)

\(:actionunload\-truck

:parameters

\(?obj\-package

?truck\-truck

?loc\-location\)

:precondition

\(and\(at?truck?loc\)

\(in?obj?truck\)\)

:effect

\(and\(not\(in?obj?truck\)\)

\(at?obj?loc\)\)\)

\(:actionunload\-airplane

:parameters

\(?obj\-package

?airplane\-airplane

?loc\-airport\)

:precondition

\(and\(in?obj?airplane\)

\(at?airplane?loc\)\)

:effect

\(and

\(not\(in?obj?airplane\)\)

\(at?obj?loc\)\)\)

\(:actiondrive\-truck

:parameters

\(?truck\-truck

?loc\-from\-location

?loc\-to\-location

?city\-city\)

:precondition

\(and\(at?truck?loc\-from\)

\(in\-city?loc\-from?city\)

\(in\-city?loc\-to?city\)\)

:effect

\(and\(not\(at?truck?loc\-from\)\)

\(at?truck?loc\-to\)\)\)

\(:actionfly\-airplane

:parameters

\(?airplane\-airplane

?loc\-from\-airport

?loc\-to\-airport\)

:precondition

\(at?airplane?loc\-from\)

:effect

\(and\(not\(at?airplane?loc\-from\)\)

\(at?airplane?loc\-to\)\)\)

\)

### G\.3Labyrinth

Listing 3:Labyrinth domain file\(define\(domainlabyrinth\)

\(:requirements:adl:action\-costs:strips:typing:numeric\-fluents\)

\(:types

card\-object

direction\-object

directionV\-direction

directionH\-direction

gridpos\-object

\)

\(:constants

sn\-directionV

we\-directionH

\)

\(:predicates

\(next?p1\-gridpos?p2\-gridpos\)

\(max\-pos?p\-gridpos\)

\(min\-pos?p\-gridpos\)

\(blocked?c\-card?d\-direction\)

\(robot\-at?c\-card\)

\(card\-at?c\-card?x\-gridpos?y\-gridpos\)

\(left\)

\(cards\-moving\)

\(cards\-moving\-west\)

\(cards\-moving\-east\)

\(cards\-moving\-south\)

\(cards\-moving\-north\)

\(next\-moving\-card?c\-card\)

\(new\-headtail\-card?c\-card\)

\)

\(:functions

\(total\-cost\)\-number

\(move\-robot\-cost\)\-number

\(move\-card\)\-number

\)

\(:actionmove\-west

:parameters\(?cfrom\-card?xfrom\-gridpos?yfrom\-gridpos?dfrom\-directionH?cto\-card?xto\-gridpos?yto\-gridpos?dto\-directionH\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(=?dfromw\)

\(robot\-at?cfrom\)

\(card\-at?cfrom?xfrom?yfrom\)

\(card\-at?cto?xto?yto\)

\(next?xfrom?xto\)

\(=?yfrom?yto\)

\(not\(=?dfrom?dto\)\)

\(not\(blocked?cfrom?dfrom\)\)

\(not\(blocked?cto?dto\)\)

\)

:effect

\(and

\(not\(robot\-at?cfrom\)\)

\(robot\-at?cto\)

\(increase\(total\-cost\)\(move\-robot\-cost\)\)

\)

\)

\(:actionmove\-east

:parameters\(?cfrom\-card?xfrom\-gridpos?yfrom\-gridpos?dfrom\-directionH?cto\-card?xto\-gridpos?yto\-gridpos?dto\-directionH\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(=?dfrome\)

\(robot\-at?cfrom\)

\(card\-at?cfrom?xfrom?yfrom\)

\(card\-at?cto?xto?yto\)

\(next?xto?xfrom\)

\(=?yfrom?yto\)

\(not\(=?dfrom?dto\)\)

\(not\(blocked?cfrom?dfrom\)\)

\(not\(blocked?cto?dto\)\)

\)

:effect

\(and

\(not\(robot\-at?cfrom\)\)

\(robot\-at?cto\)

\(increase\(total\-cost\)\(move\-robot\-cost\)\)

\)

\)

\(:actionmove\-north

:parameters\(?cfrom\-card?xfrom\-gridpos?yfrom\-gridpos?dfrom\-directionV?cto\-card?xto\-gridpos?yto\-gridpos?dto\-directionV\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(=?dfromn\)

\(robot\-at?cfrom\)

\(card\-at?cfrom?xfrom?yfrom\)

\(card\-at?cto?xto?yto\)

\(next?yfrom?yto\)

\(=?xfrom?xto\)

\(not\(=?dfrom?dto\)\)

\(not\(blocked?cfrom?dfrom\)\)

\(not\(blocked?cto?dto\)\)

\)

:effect

\(and

\(not\(robot\-at?cfrom\)\)

\(robot\-at?cto\)

\(increase\(total\-cost\)\(move\-robot\-cost\)\)

\)

\)

\(:actionmove\-south

:parameters\(?cfrom\-card?xfrom\-gridpos?yfrom\-gridpos?dfrom\-directionV?cto\-card?xto\-gridpos?yto\-gridpos?dto\-directionV\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(=?dfroms\)

\(robot\-at?cfrom\)

\(card\-at?cfrom?xfrom?yfrom\)

\(card\-at?cto?xto?yto\)

\(next?yto?yfrom\)

\(=?xfrom?xto\)

\(not\(=?dfrom?dto\)\)

\(not\(blocked?cfrom?dfrom\)\)

\(not\(blocked?cto?dto\)\)

\)

:effect

\(and

\(not\(robot\-at?cfrom\)\)

\(robot\-at?cto\)

\(increase\(total\-cost\)\(move\-robot\-cost\)\)

\)

\)

\(:actionstart\-move\-card\-west

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?cnext\-card?nextx\-gridpos\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(not\(cards\-moving\-west\)\)

\(not\(robot\-at?cm\)\)

\(card\-at?cm?x?y\)

\(min\-pos?x\)

\(card\-at?cnext?nextx?y\)

\(next?nextx?x\)

\)

:effect

\(and

\(cards\-moving\)

\(cards\-moving\-west\)

\(not\(card\-at?cm?x?y\)\)

\(new\-headtail\-card?cm\)

\(next\-moving\-card?cnext\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionmove\-card\-west

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?cnext\-card?nextx\-gridpos?prevx\-gridpos\)

:precondition

\(and

\(cards\-moving\)

\(cards\-moving\-west\)

\(not\(robot\-at?cm\)\)

\(next\-moving\-card?cm\)

\(card\-at?cm?x?y\)

\(card\-at?cnext?nextx?y\)

\(next?x?prevx\)

\(next?nextx?x\)

\)

:effect

\(and

\(cards\-moving\)

\(cards\-moving\-west\)

\(not\(card\-at?cm?x?y\)\)

\(card\-at?cm?prevx?y\)

\(not\(next\-moving\-card?cm\)\)

\(next\-moving\-card?cnext\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionstop\-move\-card\-west

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?prevx\-gridpos?newtc\-card\)

:precondition

\(and

\(cards\-moving\)

\(cards\-moving\-west\)

\(not\(robot\-at?cm\)\)

\(next\-moving\-card?cm\)

\(card\-at?cm?x?y\)

\(next?x?prevx\)

\(max\-pos?x\)

\(new\-headtail\-card?newtc\)

\)

:effect

\(and

\(not\(cards\-moving\)\)

\(not\(cards\-moving\-west\)\)

\(not\(card\-at?cm?x?y\)\)

\(card\-at?cm?prevx?y\)

\(card\-at?newtc?x?y\)

\(not\(new\-headtail\-card?newtc\)\)

\(not\(next\-moving\-card?cm\)\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionstart\-move\-card\-east

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?cnext\-card?nextx\-gridpos\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(not\(cards\-moving\-east\)\)

\(not\(robot\-at?cm\)\)

\(card\-at?cm?x?y\)

\(max\-pos?x\)

\(card\-at?cnext?nextx?y\)

\(next?x?nextx\)

\)

:effect

\(and

\(cards\-moving\)

\(cards\-moving\-east\)

\(not\(card\-at?cm?x?y\)\)

\(new\-headtail\-card?cm\)

\(next\-moving\-card?cnext\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionmove\-card\-east

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?cnext\-card?nextx\-gridpos?prevx\-gridpos\)

:precondition

\(and

\(cards\-moving\)

\(cards\-moving\-east\)

\(not\(robot\-at?cm\)\)

\(next\-moving\-card?cm\)

\(card\-at?cm?x?y\)

\(card\-at?cnext?nextx?y\)

\(next?prevx?x\)

\(next?x?nextx\)

\)

:effect

\(and

\(cards\-moving\)

\(cards\-moving\-east\)

\(not\(card\-at?cm?x?y\)\)

\(card\-at?cm?prevx?y\)

\(not\(next\-moving\-card?cm\)\)

\(next\-moving\-card?cnext\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionstop\-move\-card\-east

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?prevx\-gridpos?newtc\-card\)

:precondition

\(and

\(cards\-moving\)

\(cards\-moving\-east\)

\(not\(robot\-at?cm\)\)

\(next\-moving\-card?cm\)

\(card\-at?cm?x?y\)

\(next?prevx?x\)

\(min\-pos?x\)

\(new\-headtail\-card?newtc\)

\)

:effect

\(and

\(not\(cards\-moving\)\)

\(not\(cards\-moving\-east\)\)

\(not\(card\-at?cm?x?y\)\)

\(card\-at?cm?prevx?y\)

\(card\-at?newtc?x?y\)

\(not\(new\-headtail\-card?newtc\)\)

\(not\(next\-moving\-card?cm\)\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionstart\-move\-card\-north

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?cnext\-card?nexty\-gridpos\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(not\(cards\-moving\-north\)\)

\(not\(robot\-at?cm\)\)

\(card\-at?cm?x?y\)

\(min\-pos?y\)

\(card\-at?cnext?x?nexty\)

\(next?nexty?y\)

\)

:effect

\(and

\(cards\-moving\)

\(cards\-moving\-north\)

\(not\(card\-at?cm?x?y\)\)

\(new\-headtail\-card?cm\)

\(next\-moving\-card?cnext\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionmove\-card\-north

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?cnext\-card?nexty\-gridpos?prevy\-gridpos\)

:precondition

\(and

\(cards\-moving\)

\(cards\-moving\-north\)

\(not\(robot\-at?cm\)\)

\(next\-moving\-card?cm\)

\(card\-at?cm?x?y\)

\(card\-at?cnext?x?nexty\)

\(next?y?prevy\)

\(next?nexty?y\)

\)

:effect

\(and

\(cards\-moving\)

\(cards\-moving\-north\)

\(not\(card\-at?cm?x?y\)\)

\(card\-at?cm?x?prevy\)

\(not\(next\-moving\-card?cm\)\)

\(next\-moving\-card?cnext\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionstop\-move\-card\-north

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?prevy\-gridpos?newtc\-card\)

:precondition

\(and

\(cards\-moving\)

\(cards\-moving\-north\)

\(not\(robot\-at?cm\)\)

\(next\-moving\-card?cm\)

\(card\-at?cm?x?y\)

\(next?y?prevy\)

\(max\-pos?y\)

\(new\-headtail\-card?newtc\)

\)

:effect

\(and

\(not\(cards\-moving\)\)

\(not\(cards\-moving\-north\)\)

\(not\(card\-at?cm?x?y\)\)

\(card\-at?cm?x?prevy\)

\(card\-at?newtc?x?y\)

\(not\(new\-headtail\-card?newtc\)\)

\(not\(next\-moving\-card?cm\)\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionstart\-move\-card\-south

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?cnext\-card?nexty\-gridpos\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(not\(cards\-moving\-south\)\)

\(not\(robot\-at?cm\)\)

\(card\-at?cm?x?y\)

\(max\-pos?y\)

\(card\-at?cnext?x?nexty\)

\(next?y?nexty\)

\)

:effect

\(and

\(cards\-moving\)

\(cards\-moving\-south\)

\(not\(card\-at?cm?x?y\)\)

\(new\-headtail\-card?cm\)

\(next\-moving\-card?cnext\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionmove\-card\-south

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?cnext\-card?nexty\-gridpos?prevy\-gridpos\)

:precondition

\(and

\(cards\-moving\)

\(cards\-moving\-south\)

\(not\(robot\-at?cm\)\)

\(next\-moving\-card?cm\)

\(card\-at?cm?x?y\)

\(card\-at?cnext?x?nexty\)

\(next?prevy?y\)

\(next?y?nexty\)

\)

:effect

\(and

\(cards\-moving\)

\(cards\-moving\-south\)

\(not\(card\-at?cm?x?y\)\)

\(card\-at?cm?x?prevy\)

\(not\(next\-moving\-card?cm\)\)

\(next\-moving\-card?cnext\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionstop\-move\-card\-south

:parameters\(?cm\-card?x\-gridpos?y\-gridpos?prevy\-gridpos?newtc\-card\)

:precondition

\(and

\(cards\-moving\)

\(cards\-moving\-south\)

\(not\(robot\-at?cm\)\)

\(next\-moving\-card?cm\)

\(card\-at?cm?x?y\)

\(next?prevy?y\)

\(min\-pos?y\)

\(new\-headtail\-card?newtc\)

\)

:effect

\(and

\(not\(cards\-moving\)\)

\(not\(cards\-moving\-south\)\)

\(not\(card\-at?cm?x?y\)\)

\(card\-at?cm?x?prevy\)

\(card\-at?newtc?x?y\)

\(not\(new\-headtail\-card?newtc\)\)

\(not\(next\-moving\-card?cm\)\)

\(increase\(total\-cost\)\(move\-card\)\)

\)

\)

\(:actionleave

:parameters\(?c\-card?prow\-gridpos?pcolumn\-gridpos\)

:precondition

\(and

\(not\(cards\-moving\)\)

\(robot\-at?c\)

\(card\-at?c?prow?pcolumn\)

\(max\-pos?prow\)

\(max\-pos?pcolumn\)

\(not\(blocked?cs\)\)

\)

:effect

\(and

\(increase\(total\-cost\)\(move\-card\)\)

\(left\)

\)

\)

\)

### G\.4Sokoban

Listing 4:Sokoban domain file\(define\(domaintyped\-sokoban\)

\(:requirements:typing\)

\(:typesLOCDIR\)

\(:predicates

\(at\-robot?l\-LOC\)

\(has\-box?l\-LOC\)

\(adjacent?l1\-LOC?l2\-LOC?d\-DIR\)

\(clear?l\-LOC\)

\)

\(:actionmove

:parameters\(?from\-LOC?to\-LOC?dir\-DIR\)

:precondition\(and\(clear?to\)\(at\-robot?from\)\(adjacent?from?to?dir\)\)

:effect\(and\(at\-robot?to\)\(not\(at\-robot?from\)\)\)

\)

\(:actionpush

:parameters\(?rloc\-LOC?bloc\-LOC?floc\-LOC?dir\-DIR\)

:precondition\(and\(at\-robot?rloc\)\(has\-box?bloc\)\(clear?floc\)

\(adjacent?rloc?bloc?dir\)\(adjacent?bloc?floc?dir\)\)

:effect\(and\(at\-robot?bloc\)\(has\-box?floc\)\(clear?bloc\)

\(not\(at\-robot?rloc\)\)\(not\(has\-box?bloc\)\)\(not\(clear?floc\)\)\)

\)
Efficient Test-time Inference for Generative Planning Models

Similar Articles

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

Property-Guided LLM Program Synthesis for Planning

Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents

Submit Feedback

Similar Articles

PaT: Planning-after-Trial for Efficient Test-Time Code Generation
A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner
Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
Property-Guided LLM Program Synthesis for Planning
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents