Property-Guided LLM Program Synthesis for Planning

arXiv cs.AI Papers

Summary

This paper proposes property-guided LLM program synthesis, using counterexample-guided inductive synthesis (CEGIS) to provide concrete feedback when a candidate program fails a formal property, reducing the number of generations and evaluation costs. Applied to PDDL planning domains for synthesizing direct heuristic functions, the method outperforms prior approaches, generating seven times fewer programs and solving more tasks without search.

arXiv:2605.16142v1 Announce Type: new Abstract: LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:35 AM

# Property-Guided LLM Program Synthesis for Planning
Source: [https://arxiv.org/html/2605.16142](https://arxiv.org/html/2605.16142)
André G\. Pereira Federal University of Rio Grande do Sul Brazil &Augusto B\. Corrêa University of Oxford United Kingdom &Jendrik Seipp Linköping University Sweden

###### Abstract

LLMs have shown impressive success in program synthesis, discovering programs that surpass previously known solutions\. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests\. Because such a score offers no guidance on*why*a program failed, the system must generate and evaluate many candidates per problem in the hope that some succeed, increasing both LLM inference and evaluation costs\. We study a different approach: property\-guided LLM program synthesis\. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property\. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed\. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs\. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize*direct*heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor\. A heuristic with this property leads hill\-climbing algorithm directly to a goal state\. A counterexample\-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property\. We evaluate our approach on ten planning domains with an out\-of\-distribution test set\. The synthesized heuristics are effectively*direct*on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates\. Whenever a problem admits a verifiable property, property\-guided LLM program synthesis can reduce synthesis and evaluation cost while improving program quality\.

## 1Introduction

Using large language models \(LLMs\) for*program synthesis*yields programs that surpass previously known solutions on problems ranging from competitive programming\[[36](https://arxiv.org/html/2605.16142#bib.bib36)\]to open mathematical questions\[[50](https://arxiv.org/html/2605.16142#bib.bib50)\]and algorithmic discovery\[[42](https://arxiv.org/html/2605.16142#bib.bib42),[39](https://arxiv.org/html/2605.16142#bib.bib39),[62](https://arxiv.org/html/2605.16142#bib.bib62),[16](https://arxiv.org/html/2605.16142#bib.bib16),[12](https://arxiv.org/html/2605.16142#bib.bib12)\]\. These approaches use simple numeric scores to rank candidate programs, such as the average value of the solution or the number of passed tests\. Because such a score offers no guidance on*why*a program failed, the system must generate and evaluate many candidates per problem in the hope that some succeed, increasing both LLM inference cost and evaluation cost\.

There is an idea from formal methods that replaces this fitness score with something more informative:*counterexample\-guided inductive synthesis*\(CEGIS\)\[[57](https://arxiv.org/html/2605.16142#bib.bib57),[35](https://arxiv.org/html/2605.16142#bib.bib35),[1](https://arxiv.org/html/2605.16142#bib.bib1)\]\. A verifier checks each candidate against a formal specification, and on failure it returns a concrete input on which the candidate is wrong; the synthesizer can then repair or discard the candidate using that witness instead of a single number\. CEGIS underpins much of classical program synthesis, yet it has seen little use with LLMs\. We use an LLM as the synthesizer in a CEGIS\-style loop with a property checker as the verifier\. On every failed candidate, we hand the LLM a concrete counterexample of where and how the program went wrong, which lets us stop evaluation as soon as the property is violated\. This reduces the number of candidates the LLM has to generate and steers it toward stronger programs\.

We evaluate this idea on*classical planning*\[[26](https://arxiv.org/html/2605.16142#bib.bib26)\], a problem known to be challenging for LLMs\[[64](https://arxiv.org/html/2605.16142#bib.bib64),[63](https://arxiv.org/html/2605.16142#bib.bib63),[59](https://arxiv.org/html/2605.16142#bib.bib59)\]\. Given a planning domain, we ask the LLM to synthesize a*heuristic function*as Python code, which is then automatically injected into a planner\. Ideally, such a heuristic would guide hill climbing from the initial state to a goal without ever getting stuck, solving tasks without any combinatorial search\. To capture this requirement formally, we target the*direct*property\[[19](https://arxiv.org/html/2605.16142#bib.bib19)\]: a heuristic is direct if every state reachable from the initial state via strictly improving transitions has a strictly improving successor\. This property is a natural fit for property\-guided synthesis, since any violation immediately yields a concrete counterexample\. Synthesizing a direct heuristic is challenging both in theory\[[32](https://arxiv.org/html/2605.16142#bib.bib32),[19](https://arxiv.org/html/2605.16142#bib.bib19)\]and in practice\[e\.g\.,[24](https://arxiv.org/html/2605.16142#bib.bib24)\]: it requires reasoning about every state on every strictly improving path, and a single state without an improving successor is enough to break the property\. In fact, synthesizing a direct heuristic for a given task is as hard as solving the task itself\.

Our counterexample\-driven repair loop takes a PDDL \(Planning Domain Definition Language\) domain together with a training set of PDDL tasks from this domain as input, and validates the direct property over the training set\. Validation terminates as soon as a counterexample is found\. In this case, we prompt the LLM with the failing task and the counterexample: the state where the property fails, its heuristic value, and the successor states with their heuristic values\. The LLM is asked to provide a new heuristic that is direct for this task\. This loop repeats until the heuristic passes the full training set\. Although validation is confined to the training tasks, we aim for the synthesized heuristic to generalize, remaining direct on the much harder, out\-of\-distribution test tasks\. Figure[1](https://arxiv.org/html/2605.16142#S1.F1)illustrates this process\. To our knowledge, our work is the first to use LLMs to synthesize heuristics via property\-guided repair\.

On the ten planning domains from the Learning Track of the International Planning Competition \(IPC\) 2023\[[60](https://arxiv.org/html/2605.16142#bib.bib60)\], evaluated on an out\-of\-distribution test set, our approach generates seven times fewer programs per domain on average compared to the best prior generation method\[[16](https://arxiv.org/html/2605.16142#bib.bib16)\], requires significantly less computation to evaluate candidates, and solves more tasks via hill climbing without any combinatorial search\. Our repair loop always finds a heuristic that is direct on every training task within the iteration budget, and the synthesized heuristics remain direct on virtually all out\-of\-distribution test tasks, confirming that the property generalizes beyond the training set\.

Initial PromptDomain &Training TaskLLMProgram SynthesisCandidate Heuristic\(Python Code\)ValidatorChecks DirectPropertyCounterexample PromptFailing State and SuccessorsVerifiedDirectHeuristicPassFail \(Property Violated\)RepairFigure 1:Counterexample\-driven repair loop for property\-guided heuristic function synthesis\. Given a planning domain and a task from the training set, the LLM proposes a candidate heuristic function in Python\. The validator evaluates this heuristic on the training set, verifying the*direct*property\. If the property is satisfied, the valid heuristic is output\. If the property is violated, evaluation stops early, and a concrete counterexample is returned\.
## 2Background

h=3h\{=\}3h=2h\{=\}2h=1h\{=\}1h=0h\{=\}0s0s\_\{0\}aabbccddgg\(a\)Descending heuristic: every alive state has at least one strictly improving successor, including stateaawhich will never be expanded by GBFS or HC\.h=3h\{=\}3h=2h\{=\}2h=1h\{=\}1h=0h\{=\}0s0s\_\{0\}aabbccddgg\(b\)Direct heuristic: the same state space with a different heuristic admits a strictly improving path \(s0→b→d→gs\_\{0\}\\\!\\to\\\!b\\\!\\to\\\!d\\\!\\to\\\!g\), but stateaais a plateau \(h​\(a\)=h​\(s0\)h\(a\)=h\(s\_\{0\}\)\)\.
Figure 2:Example descending and direct heuristics for the same state space\. Nodes are placed on horizontal heuristic layers; downward transitions strictly decrease the heuristic valuehh\. States0s\_\{0\}is the initial state andggis the goal state\. Both heuristics are dead\-end avoiding since all states are solvable\. The left heuristic is descending and thus also direct; the right heuristic is direct but not descending\.*Classical planning*asks for a sequence of deterministic actions, a*plan*, that transforms a given initial state into a state satisfying a goal in a fully\-observable, discrete environment with a single agent\. Planning tasks are commonly described in the Planning Domain Definition Language \(PDDL\)\[[40](https://arxiv.org/html/2605.16142#bib.bib40),[30](https://arxiv.org/html/2605.16142#bib.bib30)\], and our work directly uses this representation\.

A PDDL task is defined by a set of*objects*and*predicates*\. Together they form*ground atoms*, i\.e\., predicates instantiated with concrete objects, which describe the current state\.*Actions*have preconditions that must hold for the action to be applicable, and effects that add or remove ground atoms from the state\. The task also specifies an*initial state*s0s\_\{0\}and a*goal*\. PDDL separates the*domain*\(shared actions and predicates\) from the*task*\(specific objects, initial state, and goal\), typically stored in two separate files\.

We consider*satisficing planning*, where any plan is acceptable regardless of its length, and we assume all actions have unit cost\. Most satisficing planners are based on*greedy best\-first search*\(GBFS\)\[[47](https://arxiv.org/html/2605.16142#bib.bib47),[8](https://arxiv.org/html/2605.16142#bib.bib8),[33](https://arxiv.org/html/2605.16142#bib.bib33),[31](https://arxiv.org/html/2605.16142#bib.bib31),[49](https://arxiv.org/html/2605.16142#bib.bib49),[37](https://arxiv.org/html/2605.16142#bib.bib37)\], a state\-space search algorithm guided by a*heuristic function*hhthat maps each statessto a valueh​\(s\)∈ℝ∪\{∞\}h\(s\)\\in\\mathbb\{R\}\\cup\\\{\\infty\\\}estimating the cost to reach a goal fromss\[[46](https://arxiv.org/html/2605.16142#bib.bib46)\]\. Infinite values indicate that the goal is unreachable fromss\. GBFS starts froms0s\_\{0\}and maintains a priority queue of generated states ordered by their heuristic value\. At each step, it*expands*the most promising statess, i\.e\., the one with the smallest heuristic valueh​\(s\)h\(s\), generating its successorssucc​\(s\)\\textit\{succ\}\(s\)and adding them to the open list\. This process continues until a goal state is reached\. GBFS is complete: if a solution exists, it will eventually find one, regardless of the heuristic, as long as all generated states are eventually expanded

In contrast to the global nature of GBFS,*hill climbing*\(HC\) is a simple local\-search algorithm: starting from the initial state, it greedily moves to a successor with a strictly smaller heuristic value and terminates when it reaches a goal state\. Because HC keeps no open list, it is very memory efficient\. However, HC is incomplete: it can fail at*plateaus*\(where no successor strictly improves the heuristic\) or*local minima*\(where every successor has a larger heuristic value\), even when the goal is reachable\. Completeness of HC therefore depends on the heuristic\.

A*descending and dead\-end avoiding*heuristic\[[54](https://arxiv.org/html/2605.16142#bib.bib54)\]eliminates both failure modes by construction\. FollowingSeipp et al\. \[[54](https://arxiv.org/html/2605.16142#bib.bib54)\], we call a state*alive*if it is reachable, solvable, and not a goal\.

###### Definition 1\(Descending and Dead\-End Avoiding Heuristics\[[54](https://arxiv.org/html/2605.16142#bib.bib54),[32](https://arxiv.org/html/2605.16142#bib.bib32)\]\)\.

A heuristichhis*descending*for a planning taskΠ\\Piif every alive statesshas at least one successors′∈succ​\(s\)s^\{\\prime\}\\in\\textit\{succ\}\(s\)withh​\(s′\)<h​\(s\)h\(s^\{\\prime\}\)<h\(s\), and it is*dead\-end avoiding*if every such improving successor is solvable\.

###### Proposition 1\(Seipp et al\. \[[54](https://arxiv.org/html/2605.16142#bib.bib54)\]\)\.

Lethhbe a descending, dead\-end avoiding heuristic for a solvable planning taskΠ\\Pi\. Then hill climbing guided byhhreaches the goal froms0s\_\{0\}\.

The proof is straightforward: every reachable non\-goal state has an improving successor, so heuristic values strictly decrease at each step\. Strict decrease prevents revisiting a state, and since the reachable state space is finite, the search must reach a goal in finitely many steps\.

The DDA property is, however, unnecessarily strong\. It enforces that all alive states have improving successors,*even when these states would never be expanded*by a planner\. For example, the heuristichhin Figure[2\(a\)](https://arxiv.org/html/2605.16142#S2.F2.sf1)is DDA because all alive states have improving successors\. But this includesaa, which is never expanded by GBFS or HC usinghh, becauseh​\(b\)<h​\(a\)h\(b\)<h\(a\), andbbalready has a strictly improving path to the goal \(i\.e\.,hhstrictly decreases at every step\)\. To address this issue,Dold and Helmert \[[19](https://arxiv.org/html/2605.16142#bib.bib19)\]introduce a relaxation that only considers states reachable froms0s\_\{0\}by repeatedly taking strictly improving transitions\. They call this property*wet descending and dead\-end avoiding*\(WDDA\)\. For brevity, we refer to a heuristic with this property as a*direct*heuristic throughout the rest of the paper\.

###### Definition 2\(Direct Heuristic\[[19](https://arxiv.org/html/2605.16142#bib.bib19)\]\)\.

LetS↓S\_\{\\downarrow\}be the set of alive states reachable froms0s\_\{0\}via strictly improving transitions\. A heuristichhis*direct*if everys∈S↓s\\in S\_\{\\downarrow\}has at least one successors′∈succ​\(s\)s^\{\\prime\}\\in\\textit\{succ\}\(s\)withh​\(s′\)<h​\(s\)h\(s^\{\\prime\}\)<h\(s\), and every such improving successor is either a goal state or itself inS↓S\_\{\\downarrow\}\.

For instance, stateaain Figure[2\(b\)](https://arxiv.org/html/2605.16142#S2.F2.sf2)is alive but hash​\(a\)=h​\(s0\)h\(a\)=h\(s\_\{0\}\), soa∉S↓a\\notin S\_\{\\downarrow\}and the direct property places no requirement on it, whereas DDA would still demand an improving successor\. This seemingly minor relaxation can make the property substantially easier to satisfy and to synthesize\[[19](https://arxiv.org/html/2605.16142#bib.bib19)\]\. DDA implies directness, but not the other way round\. HC guided by a direct heuristic also reaches a goal state without ever getting stuck\.

Synthesizing a direct heuristic for a single PDDL task is also theoretically hard\. A direct heuristic exists for a taskΠ\\Piif and only ifΠ\\Piis solvable: any solvable task has a direct heuristic \(e\.g\., the perfect heuristic\), and any direct heuristic guides hill climbing to a goal\. Deciding whether a direct heuristic exists is therefore equivalent to deciding plan existence, which isEXPSPACE\-hard for PDDL planning \(a\.k\.a\., “lifted planning”\)\[[22](https://arxiv.org/html/2605.16142#bib.bib22),[14](https://arxiv.org/html/2605.16142#bib.bib14)\]\. We sidestep this per\-task cost by synthesizing a single*domain\-general*direct heuristic from a small training set, amortizing the synthesis effort over the potentially infinite distribution of tasks in the domain\.

## 3Property\-Guided LLM Heuristic Function Synthesis

We use an LLM to synthesize a domain\-dependent heuristic function, implemented as Python code, for a given PDDL domain\. Our approach uses a single counterexample\-driven repair loop: the LLM generates one candidate, we validate it on a set of training tasks, and if it fails we return a concrete counterexample to the LLM and ask for a new candidate\. This loop repeats until the heuristic passes the full training set or a generation budget is exhausted\. The following subsections discuss the property we target \(Section[3\.1](https://arxiv.org/html/2605.16142#S3.SS1)\), the prompts used in the repair loop \(Section[3\.2](https://arxiv.org/html/2605.16142#S3.SS2)\), and the validation procedure \(Section[3\.3](https://arxiv.org/html/2605.16142#S3.SS3)\)\.

### 3\.1Targeting Direct Heuristics

The DDA property is a strict global requirement: it must hold at every alive state, and a single violating state is enough to fail, even at states that hill climbing would never reach\. We therefore target heuristics with the*direct*property: every alive state reachable froms0s\_\{0\}by strictly improving transitions must have a strictly improving successor\. A direct heuristic still guarantees that hill climbing reaches a goal, but it focuses only on the subgraph that hill climbing can actually traverse\.

Crucially for our method, every direct\-property failure yields a concrete, local counterexample\. The first type of failure occurs at a statesson some strictly improving path froms0s\_\{0\}whose successorss′s^\{\\prime\}all satisfyh​\(s′\)≥h​\(s\)h\(s^\{\\prime\}\)\\geq h\(s\); the counterexample reportsss, its heuristic valueh​\(s\)h\(s\), and the list of successors together with their heuristic values\. In domains with dead ends, a second type of failure can occur when the DFS reaches a non\-goal statesswith no applicable action via a strictly improving transition from a parents′s^\{\\prime\}; the counterexample reportsss, its heuristic valueh​\(s\)h\(s\), and the parent’s heuristic valueh​\(s′\)h\(s^\{\\prime\}\), together with the actionable suggestion thatssshould be assigned a value≥h​\(s′\)\\geq h\(s^\{\\prime\}\)so that hill climbing never enters this dead end\. This information is directly actionable and gives the LLM precise feedback on how to improve the heuristic\. Although validation is confined to the training tasks, we aim for the synthesized heuristic to generalize and remain direct on the much harder, out\-of\-distribution test tasks\.

### 3\.2Counterexample\-Driven Repair Loop

The repair loop maintains a history of all previously generated heuristics and their failure feedback\. It starts with an*initial prompt*that requests the first candidate and uses a*repair prompt*after each validation failure\.

#### Initial Prompt\.

We use the same prompt asCorrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]to make a direct comparison easier\. It provides the PDDL domain, the smallest and largest training PDDL tasks from this domain, heuristic Python code examples for two domains disjoint from the evaluated domains, relevant planner code, and a checklist\. Note that the prompt does not instruct the LLM to generate a direct heuristic\.

#### Repair Prompt\.

When a candidate fails validation on a training task, the repair prompt asks the LLM to propose a heuristic that is direct on this task as well, and provides the definition of the direct property\. It also includes:

1. 1\.the PDDL domain file;
2. 2\.the PDDL task on which validation failed;
3. 3\.the failure feedback: the failing state, its heuristic value, and the successor states with their heuristic values;
4. 4\.all previously generated heuristic functions, together with their failure feedback;
5. 5\.a checklist similar to the one in the initial prompt\.

Items 2 and 3 give the LLM the exact nature of the failure\. Item 4 lets the LLM reason about what has already been tried and avoid repeating the same mistakes \(see Appendix[D](https://arxiv.org/html/2605.16142#A4)for an example prompt\)\.

### 3\.3Validation

To check whether a candidate heuristichhsatisfies the direct property on a taskΠ\\Pi, we run a depth\-first search \(DFS\) froms0s\_\{0\}\. At each expanded non\-goal statess, we check whether it has at least one successors′s^\{\\prime\}withh​\(s′\)<h​\(s\)h\(s^\{\\prime\}\)<h\(s\)\. If not, the direct property fails; we recordss, its heuristic value, and all its successor states with their heuristic values as the counterexample\. Otherwise, the DFS pushes onto the stack only the successorss′s^\{\\prime\}withh​\(s′\)<h​\(s\)h\(s^\{\\prime\}\)<h\(s\)\.

We apply this check to each task in the training set\. As soon as a failure is found, we return the counterexample and stop, without checking the remaining tasks\. We consider the heuristic a success if all expanded states have a strictly improving successor across all training tasks\.

Even modestly sized planning tasks can admit exponentially many strictly improving paths to a goal, so validation cannot be expected to finish within a reasonable time on every task\. We therefore impose a per\-task time limit during validation\. If this limit is reached before a counterexample is found, we treat the heuristic as direct on the task and continue with the next task\.

## 4Experimental Results

#### Experimental Setting\.

We run all experiments using Downward Lab\[[55](https://arxiv.org/html/2605.16142#bib.bib55)\]on AMD EPYC 7742 processors running at 2\.25 GHz\. We use Pyperplan\[[3](https://arxiv.org/html/2605.16142#bib.bib3)\]with PyPy 7\.3\.9 as our planning framework for both training and testing\. We generate all heuristics with Gemini 3\.1 Pro\[[25](https://arxiv.org/html/2605.16142#bib.bib25)\], using default API parameters\. We use the ten domains together with their training and test tasks from the IPC 2023 Learning Track\[[60](https://arxiv.org/html/2605.16142#bib.bib60)\]; we summarise each domain in Appendix[A](https://arxiv.org/html/2605.16142#A1)\.111Two domains, Ferry and Satellite, use negative preconditions not supported by Pyperplan\. We convert them to equivalent domains via a transformation that eliminates negative preconditions\.The test set contains 90 tasks per domain \(900 in total\) and is out\-of\-distribution with respect to the training set: test tasks are generally much larger and may differ structurally from training tasks\. Figure[3](https://arxiv.org/html/2605.16142#S4.F3)shows the range of the main domain parameter on training and testing tasks\. During training, each validation run is limited to 30 seconds and 8 GiB per task, and the repair loop runs for at most 10 iterations, i\.e\., 10 LLM calls\. During testing, each run is limited to 5 minutes and 8 GiB\. For our methods, we report averages over three runs\. The source code, prompts, generated heuristics, and experimental data are included as supplementary material; we will additionally release them publicly upon acceptance\.

TrainingTesting

02002004004006006008008001,0001\{,\}000MiconicFloortileFerryChildsnackBlocks\.22954881104292120297423012980110148502002004004006006008008001,0001\{,\}000Transp\.SpannerSokobanSatelliteRovers1413011029914179110148717350Main domain parameter

Figure 3:Size of training and test tasks measured by the main domain parameter\. Each line spans the range of values for the corresponding set; maximum endpoints are annotated with their numeric values\. The parameter counts blocks \(Blocksworld\), children \(Childsnack\), cars \(Ferry\), tiles \(Floortile\), passengers \(Miconic\), rovers \(Rovers\), satellites \(Satellite\), boxes \(Sokoban\), spanners \(Spanner\), and vehicles \(Transport\)\. The test set is out\-of\-distribution: in most domains the testing range extends well beyond the training range, and test tasks may also differ structurally from training tasks\.
#### Scope\.

We evaluate whether property\-guided synthesis reduces inference and evaluation cost while improving heuristic quality; a single representative LLM suffices to answer this question\. For a comparison of different LLMs on heuristic generation for planning, seeCorrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]\. We study the LLM as a*program synthesizer*: the LLM is called once per domain to generate a heuristic, which is then used to solve all tasks in that domain, including unseen test tasks\. In*end\-to\-end*planning, by contrast, the LLM is invoked separately for each training and test task, incurring much higher costs and losing all soundness guarantees on the resulting plans\. For end\-to\-end planning performance across LLM families, sizes, and reasoning models, seeCorrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]; for frontier LLMs in this setting, seeCorrêa et al\. \[[17](https://arxiv.org/html/2605.16142#bib.bib17)\]\. For analyses regarding data contamination, obfuscated tasks and evaluation on domains absent from LLM training data, seeCorrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]andChen et al\. \[[12](https://arxiv.org/html/2605.16142#bib.bib12)\]\.

#### Baselines\.

The*FF*heuristic\[[33](https://arxiv.org/html/2605.16142#bib.bib33)\]is one of the most commonly used heuristics for satisficing planning\[e\.g\.,[10](https://arxiv.org/html/2605.16142#bib.bib10),[15](https://arxiv.org/html/2605.16142#bib.bib15),[27](https://arxiv.org/html/2605.16142#bib.bib27)\]\. It is a delete\-relaxation heuristic computed in polynomial time by solving a relaxed planning problem that ignores*delete effects*; it is not designed to be*direct*\.Corrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]introduce the current state\-of\-the\-art method for LLM\-based heuristic function generation for planning: it uses an LLM to independently generate a fixed budget of 25 candidate heuristics per domain and selects the one with the highest GBFS coverage on the training set and runs it with GBFS on the test set; it is also not designed to be direct\. We refer to this method as*sample\-and\-select*throughout \(abbreviated S&S in tables\)\. We re\-run this method with Gemini 3\.1 Pro \(the same model used by our approach\) to ensure a fair comparison\.

#### Inference and Evaluation Costs of Candidate Heuristics\.

Our repair loop is much more efficient than fixed\-budget sample\-and\-select regarding the number of candidates the LLM must produce and the time spent evaluating them\. Across the ten IPC domains and three independent runs per domain \(3030runs in total\), our method generates101101candidate heuristics, on average3\.43\.4per run with standard deviation3\.03\.0\. Eight of the ten domains converge \(i\.e\., generate a heuristic that is direct on the full training set\) within at most55candidates per run, and four converge within22candidates on average\. Sokoban and Floortile are the clear outliers: Sokoban averages99candidates per run with a maximum of1111\(initial attempt plus 10 repairs\), and Floortile averages88with a maximum of99\. By comparison, the sample\-and\-select baseline ofCorrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]fixes a budget of2525candidates per domain regardless of difficulty, so the repair loop yields roughly a7\.4×7\.4\\timesreduction in inference cost per run\. Table[3](https://arxiv.org/html/2605.16142#A3.T3)in Appendix[C](https://arxiv.org/html/2605.16142#A3)reports the per\-domain breakdown\.

Evaluation cost shrinks even more dramatically\. In sample\-and\-select\[[16](https://arxiv.org/html/2605.16142#bib.bib16)\], each domain evaluates a fixed pool of candidates with GBFS, so evaluation cost scales with the number of candidates\. In our repair loop, we validate one candidate at a time and stop at the first counterexample, which makes failed candidates cheap to reject\. The average evaluation time per domain is10\.7510\.75minutes, whileCorrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]report206\.25206\.25CPU\-hours per domain\. This corresponds to roughly a1\.15×1031\.15\\times 10^\{3\}reduction in evaluation cost\.

#### Direct Heuristics vs\. State of the Art\.

Table 1:Per\-domain coverage \(number of solved tasks\) on the 900 IPC 2023 test tasks\.FFis the classical delete\-relaxation heuristic;S&Sis our re\-run of the sample\-and\-select baseline\[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]with Gemini 3\.1 Pro\. Both baselines are run using GBFS and HC\.Directis our proposed method \(direct heuristics run with HC\)\.Cov\.is the ablation that replaces the direct\-property check with numeric coverage feedback \(see “Ablation” paragraph\), run using GBFS and HC\. All proposed and ablation values are averages over three independent runs\. Bold marks the best value per domain \(ties are co\-bolded\)\. The full table with standard deviations is in Appendix[B](https://arxiv.org/html/2605.16142#A2)\(Table[2](https://arxiv.org/html/2605.16142#A2.T2)\)\.![Refer to caption](https://arxiv.org/html/2605.16142v1/x1.png)

\\phantomcaption

![Refer to caption](https://arxiv.org/html/2605.16142v1/x2.png)

\\phantomcaption

![Refer to caption](https://arxiv.org/html/2605.16142v1/x3.png)

\\phantomcaption

Figure 4:State expansions of direct \+ HC versus each baseline, on test tasks solved by both methods\. Each point is one task, colored by domain\. Points below the diagonal favor direct \+ HC\. Note that restricting to tasks solved by both methods, in general, favors the method that solves fewer tasks\.We compare the direct heuristics to the two baselines: the FF heuristic\[[33](https://arxiv.org/html/2605.16142#bib.bib33)\]and the sample\-and\-select method byCorrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]\. Both baselines use GBFS, which may backtrack extensively before finding a solution\. Our method uses HC, which follows a single greedy path and reaches the goal without any search\. Table[1](https://arxiv.org/html/2605.16142#S4.T1)reports results for the baselines under both GBFS and HC\.

The main result is that our direct heuristics with HC solve623\.3623\.3tasks by going directly to the goal, while sample\-and\-select with GBFS solves573573tasks, and FF with GBFS solves only298298\. That is, we solve more tasks than the best prior LLM method, and we do so without any search\. Running the sample\-and\-select heuristics under HC solves only430430tasks, confirming that the gain comes from the property our repair loop trains for, not merely from the choice of search algorithm\. FF under HC collapses to3030solved tasks\. It is not a descending heuristic and offers no guarantee of progress at every state, so HC fails at the first plateau or local minimum it encounters\. Figures[4](https://arxiv.org/html/2605.16142#S4.F4)and[4](https://arxiv.org/html/2605.16142#S4.F4)make the efficiency gap concrete: on tasks solved by both methods, our heuristics with HC expand far fewer states than sample\-and\-select, sometimes by several orders of magnitude\. Together with the7\.4×7\.4\\timesreduction in inference cost and the1\.15×1031\.15\\times 10^\{3\}reduction in evaluation cost reported above, these results show that property\-guided synthesis is generally better than sample\-and\-select: it solves more tasks, reaches the goal with fewer expansions, and costs far less to synthesize\.

#### Generalization of the Direct Property\.

We validate the direct property only on small training tasks, and there is no guarantee that a heuristic that passes this check will remain direct on the much larger, out\-of\-distribution test tasks\. At test time, hill climbing fails as soon as it reaches a statesswith no successors′s^\{\\prime\}satisfyingh​\(s′\)<h​\(s\)h\(s^\{\\prime\}\)<h\(s\), which is exactly a violation of the direct property\. Across the three independent runs over all 900 test tasks \(2,7002\{,\}700evaluations in total\), only a single evaluation stops for this reason \(in Floortile\)\. The remaining evaluations either find a plan or hit the time limit\. The synthesized heuristics are therefore effectively direct on the out\-of\-distribution test set as well\. This is especially striking in Sokoban, aPSPACE\-complete domain\[[18](https://arxiv.org/html/2605.16142#bib.bib18)\], where no test evaluation across any of the three runs ever fails because of a property violation\. The heuristics we generate from small training tasks therefore generalize robustly to much harder test tasks\.

#### Ablation\.

To isolate the contribution of the direct property, we run the same repair loop but replace the property check with a numeric coverage score: the repair prompt reports which training task the heuristic failed to solve within the time limit and how many states were expanded before timing out, and asks for a new heuristic that solves more tasks with GBFS\. We refer to this ablation as*Cov\.*Under GBFS, Cov\. solves572\.7572\.7\(±15\.95\\pm 15\.95\) tasks, roughly matching sample\-and\-select \(573573\)\. Under HC, however, Cov\. degrades sharply to412\.3412\.3tasks, well below our direct heuristics with HC \(623\.3623\.3,±6\.7\\pm 6\.7\)\. Coverage feedback alone is therefore not sufficient to obtain heuristics that guide HC\. Figure[4](https://arxiv.org/html/2605.16142#S4.F4)reinforces this point: on tasks solved by both methods, direct heuristics consistently require fewer expansions\.

#### Qualitative Analysis\.

Inspecting what the repair loop changes in the heuristic reveals why the direct property is a useful synthesis target\. Two domains illustrate the outcomes\.

*Blocksworld\.*Blocksworld has a single arm that rearranges blocks into goal stacks by picking up, putting down, stacking, and unstacking one block at a time\. The initial heuristic counts 2 actions per misplaced block \(one to pick it up, one to put it down\), subtracts 1 if a block is currently held, and adds 2 for each deadlock it detects\. It uses two different deadlock conditions: for blocks on the table, the penalty activates only when the target block is itself misplaced; for a held block, the penalty activates whenever its target is not ready, meaning the target is either misplaced or covered by another block\. Consider the state whereb1b\_\{1\}is onb2b\_\{2\}andb3b\_\{3\}is onb4b\_\{4\}, with the goal of swapping them:b1b\_\{1\}should go onb4b\_\{4\}andb3b\_\{3\}onb2b\_\{2\}, whileb2b\_\{2\}andb4b\_\{4\}are already correctly on the table\. Bothb1b\_\{1\}andb3b\_\{3\}are misplaced, givingh=4h=4\. Sinceb2b\_\{2\}andb4b\_\{4\}are correctly on the table, the first condition does not activate and no penalty is added\. After unstackingb3b\_\{3\}, the arm holdsb3b\_\{3\}and needs to place it onb2b\_\{2\}, butb1b\_\{1\}still coversb2b\_\{2\}, sob2b\_\{2\}is not ready: the second condition activates and adds\+2\+2, givingh=2×2−1\+2=5h=2\\times 2\-1\+2=5\. The same happens when unstackingb1b\_\{1\}\. The inconsistency is that the arm\-empty state and the holding state use different conditions, so the mutual blocking betweenb1b\_\{1\}andb3b\_\{3\}is invisible in the arm\-empty state but penalized in every successor, creating a local minimum\. The final heuristic fixes this by building a dependency graph among misplaced blocks\. It adds an edge fromb3b\_\{3\}tob1b\_\{1\}\(becauseb3b\_\{3\}coversb4b\_\{4\}, the target ofb1b\_\{1\}\) and fromb1b\_\{1\}tob3b\_\{3\}\(becauseb1b\_\{1\}coversb2b\_\{2\}, the target ofb3b\_\{3\}\), detects the cycle, and adds\+2\+2in the arm\-empty state itself, givingh=6h=6\. The successors then giveh=5h=5, restoring descent\. The final heuristic is effectively direct on all test tasks\.

*Satellite\.*Satellite requires a satellite to switch on an instrument, turn to one of its calibration targets, calibrate it, and then point to the image direction before taking the image\. The initial heuristic estimates each missing image goal independently by adding costs for switching on an instrument, calibrating it, turning to the needed directions, and taking the image\. This gives an estimate of goal distance, but it ignores the ordering constraints created by calibration and by the single power slot on each satellite\. An action may temporarily move the satellite away from the image goal direction, and switching from one instrument to another can force recalibration\. The final heuristic fixes this by reasoning separately about each satellite’s remaining work as a short action sequence\. For each satellite, it groups the images that still need to be taken, considers which instrument should achieve them, and counts the actions still needed to finish them in order: switching instruments, calibrating, turning, and taking the images\. It no longer treats the remaining requirements as an unordered set of subgoals, but as a sequence whose order matters\. Because of this, each improving move removes a real remaining step in that sequence, which is why the final heuristic becomes direct\.

## 5Related Work

#### Counterexample\-Guided Inductive Synthesis\.

Counterexample\-guided inductive synthesis \(CEGIS\)\[[57](https://arxiv.org/html/2605.16142#bib.bib57),[35](https://arxiv.org/html/2605.16142#bib.bib35),[1](https://arxiv.org/html/2605.16142#bib.bib1)\]alternates between a synthesizer that proposes candidate programs and a verifier that checks them against a formal specification\. When a candidate fails, the verifier returns a concrete counterexample, and the synthesizer uses it to rule out or repair the candidate\. Our work and that ofOrvalho et al\. \[[43](https://arxiv.org/html/2605.16142#bib.bib43)\]are recent examples of CEGIS\-style counterexample feedback applied to LLM\-based program synthesis\.

#### LLM\-Based Program Synthesis\.

Most LLM\-based program synthesis evaluates candidates with numeric scores such as test\-pass rates or the value of the produced solutions, and uses these scores to rank or select among many candidates\[[36](https://arxiv.org/html/2605.16142#bib.bib36),[39](https://arxiv.org/html/2605.16142#bib.bib39),[62](https://arxiv.org/html/2605.16142#bib.bib62),[12](https://arxiv.org/html/2605.16142#bib.bib12),[50](https://arxiv.org/html/2605.16142#bib.bib50),[42](https://arxiv.org/html/2605.16142#bib.bib42)\]\. In general, these approaches can combine multiple scores and use specific prompt strategies to guide the LLM toward better candidates, but they do not provide concrete feedback on how to improve a candidate\.

#### LLM\-Based Heuristic Generation\.

Our closest precursor isCorrêa et al\. \[[16](https://arxiv.org/html/2605.16142#bib.bib16)\], who sample a fixed budget of heuristic candidates with an LLM and select the one that solves the most training tasks\. Other recent work that generates heuristics with LLMs includesTuisov et al\. \[[62](https://arxiv.org/html/2605.16142#bib.bib62)\], who focus on a different planning formalism, andChen et al\. \[[12](https://arxiv.org/html/2605.16142#bib.bib12)\], who focus on generalized policies\.

#### Descending Heuristics\.

The descending and dead\-end avoiding \(DDA\) property was coined bySeipp et al\. \[[54](https://arxiv.org/html/2605.16142#bib.bib54)\]\.Helmert et al\. \[[32](https://arxiv.org/html/2605.16142#bib.bib32)\]showed that verifying or synthesizing DDA potential heuristics\[[48](https://arxiv.org/html/2605.16142#bib.bib48)\]isPSPACE\-complete for planning tasks encoded in propositional logic, with structural relaxations dropping the complexity to lower levels of the polynomial hierarchy\.Dold and Helmert \[[19](https://arxiv.org/html/2605.16142#bib.bib19)\]introduce the direct \(orig\. “wet DDA”\) variant, which only requires descent along states reachable froms0s\_\{0\}by strictly improving transitions\. These properties are useful to estimate the hardness of planning tasks based on the simplest form of their DDA/direct heuristics\[[54](https://arxiv.org/html/2605.16142#bib.bib54),[19](https://arxiv.org/html/2605.16142#bib.bib19)\]\.

#### Learning Search Control\.

Francès et al\. \[[24](https://arxiv.org/html/2605.16142#bib.bib24)\]synthesize generalized potential heuristics as weighted sums over description\-logic features\[[5](https://arxiv.org/html/2605.16142#bib.bib5)\]and prove DDA on the covered domains; their approach is restricted to domains admitting a DDA heuristic representable in their specific feature class and is therefore not directly comparable to ours\. Several lines of work synthesize or learn search control mechanisms for planning domains\[[9](https://arxiv.org/html/2605.16142#bib.bib9),[20](https://arxiv.org/html/2605.16142#bib.bib20),[21](https://arxiv.org/html/2605.16142#bib.bib21)\]\. There is also a long history of approaches that learn heuristics for planning\[[53](https://arxiv.org/html/2605.16142#bib.bib53),[13](https://arxiv.org/html/2605.16142#bib.bib13),[52](https://arxiv.org/html/2605.16142#bib.bib52),[4](https://arxiv.org/html/2605.16142#bib.bib4)\]\. These include methods that learn for specific tasks\[[23](https://arxiv.org/html/2605.16142#bib.bib23),[45](https://arxiv.org/html/2605.16142#bib.bib45),[6](https://arxiv.org/html/2605.16142#bib.bib6)\]and methods that learn for entire domains\[[56](https://arxiv.org/html/2605.16142#bib.bib56),[58](https://arxiv.org/html/2605.16142#bib.bib58),[11](https://arxiv.org/html/2605.16142#bib.bib11),[29](https://arxiv.org/html/2605.16142#bib.bib29)\]\.

#### LLMs for Planning\.

In addition, there is a large body of recent work that uses LLMs for planning in several settings such as end\-to\-end plan generation\[[7](https://arxiv.org/html/2605.16142#bib.bib7),[59](https://arxiv.org/html/2605.16142#bib.bib59),[34](https://arxiv.org/html/2605.16142#bib.bib34)\], PDDL domain or task construction\[[61](https://arxiv.org/html/2605.16142#bib.bib61),[28](https://arxiv.org/html/2605.16142#bib.bib28),[44](https://arxiv.org/html/2605.16142#bib.bib44),[38](https://arxiv.org/html/2605.16142#bib.bib38)\], and generalized planning\[[51](https://arxiv.org/html/2605.16142#bib.bib51),[41](https://arxiv.org/html/2605.16142#bib.bib41)\]\.

## 6Limitations

Our method depends on the existence of a property that can be efficiently checked and turned into actionable feedback\. In planning, the direct property provides exactly this\. For problems where no such informative property is available, or where the training set is too small to expose meaningful counterexamples, the repair loop loses its main advantage and degenerates to score\-based ranking\.

A second limitation concerns the scope of verification\. We check the direct property only on a small training set under bounded resources, so passing this check does not certify that the heuristic is direct on unseen tasks\. Our experiments show that the property typically generalizes well to much larger out\-of\-distribution tasks, but counterexamples on test tasks remain possible in principle\.

A third limitation is that we synthesize unconstrained Python code\. This flexibility is useful, but it makes the resulting heuristics harder to interpret and verify than structured representations such as fixed feature sets or linear potentials\. A promising direction for future work is to combine our repair loop with strategies that automatically prove the synthesized heuristic to be direct on an entire domain, as has been done in restricted settings\[[2](https://arxiv.org/html/2605.16142#bib.bib2)\], possibly using theorem provers powered by LLMs\.

## 7Conclusions

In this paper, we studied property\-guided LLM program synthesis for planning and showed that a counterexample\-driven repair loop can synthesize direct heuristic functions efficiently\. Across the ten IPC 2023 Learning Track domains, and compared to the previous best heuristic generation method, our method requires about7\.4×7\.4\\timesfewer LLM calls for heuristic generation, reduces candidate evaluation cost by roughly1\.15×1031\.15\\times 10^\{3\}, and solves more out\-of\-distribution test tasks by hill climbing directly to the goal\. The direct property is effective both as a synthesis target and as a feedback mechanism: when violated, it provides concrete counterexamples that guide repair, and when satisfied on the training tasks, it often generalizes to much larger unseen tasks\. More broadly, our results suggest that whenever a synthesis problem admits a checkable property with actionable counterexamples, property\-guided LLM repair can be a practical alternative to score\-based candidate ranking\.

## Acknowledgments

This work was supported by the Wallenberg AI, Autonomous Systems and Software Program \(WASP\) funded by the Knut and Alice Wallenberg Foundation\. André G\. Pereira acknowledges support from FAPERGS with projects 21/2551\-0000741\-9 and 25/2551\-0002590\-7\. This study was financed in part by theCoordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil\(CAPES\) – Finance Code 001\.

## References

- Abate et al\. \[2018\]Alessandro Abate, Cristina David, Pascal Kesseli, Daniel Kroening, and Elizabeth Polgreen\.Counterexample guided inductive synthesis modulo theories\.In*International Conference Computer Aided Verification*, 2018\.
- Abdulaziz et al\. \[2022\]Mohammad Abdulaziz, Florian Pommerening, and Augusto B\. Corrêa\.Mechanically proving guarantees of generalized heuristics: First results and ongoing work\.In*ICAPS Workshop on Heuristics and Search for Domain\-independent Planning*, 2022\.
- Alkhazraji et al\. \[2020\]Yusra Alkhazraji, Matthias Frorath, Markus Grützner, Malte Helmert, Thomas Liebetraut, Robert Mattmüller, Manuela Ortlieb, Jendrik Seipp, Tobias Springenberg, Philip Stahl, and Jan Wülfing\.Pyperplan\.[https://doi\.org/10\.5281/zenodo\.3700819](https://doi.org/10.5281/zenodo.3700819), 2020\.
- Arfaee et al\. \[2011\]Shahab J\. Arfaee, Sandra Zilles, and Robert C\. Holte\.Learning heuristic functions for large state spaces\.*Artificial Intelligence*, 175:2075–2098, 2011\.
- Baader et al\. \[2017\]Franz Baader, Ian Horrocks, Carsten Lutz, and Uli Sattler, editors\.*An Introduction to Description Logic*\.Cambridge University Press, 2017\.
- Bettker et al\. \[2024\]Rafael V Bettker, Pedro P Minini, André G Pereira, and Marcus Ritt\.Understanding sample generation strategies for learning heuristic functions in classical planning\.*Journal of Artificial Intelligence Research*, 80:243–271, 2024\.
- Bohnet et al\. \[2024\]Bernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, and Hanie Sedghi\.Exploring and benchmarking the planning capabilities of large language models\.arXiv:2406\.13094 \[cs\.CL\], 2024\.
- Bonet and Geffner \[2001\]Blai Bonet and Héctor Geffner\.Planning as heuristic search\.*Artificial Intelligence*, 129\(1\):5–33, 2001\.
- Bonet et al\. \[2019\]Blai Bonet, Guillem Francès, and Héctor Geffner\.Learning features and abstract actions for computing generalized plans\.In*Proc\. AAAI 2019*, pages 2703–2710, 2019\.
- Büchner et al\. \[2023\]Clemens Büchner, Remo Christen, Augusto B\. Corrêa, Salomé Eriksson, Patrick Ferber, Jendrik Seipp, and Silvan Sievers\.Fast Downward Stone Soup 2023\.In*IPC\-10 Planner Abstracts*, 2023\.
- Chen et al\. \[2024\]Dillon Z\. Chen, Sylvie Thiébaux, and Felipe Trevizan\.Learning domain\-independent heuristics for grounded and lifted planning\.In*Proc\. AAAI 2024*, pages 20078–20086, 2024\.
- Chen et al\. \[2025\]Dillon Z\. Chen, Johannes Zenn, Tristan Cinquin, and Sheila A\. McIlraith\.Language models for generalised pddl planning: Synthesising sound and programmatic policies\.arXiv:2508\.18507 \[cs\.CL\], 2025\.
- Christensen and Korf \[1986\]Jens Christensen and Richard E Korf\.A unified theory of heuristic evaluation functions and its application to learning\.In Tom Kehler and Stanley J\. Rosenschein, editors,*Proc\. AAAI 1986*, pages 148–152, 1986\.
- Corrêa and De Giacomo \[2024\]Augusto B\. Corrêa and Giuseppe De Giacomo\.Lifted planning: Recent advances in planning using first\-order representations\.In*Proc\. IJCAI 2024*, pages 8010–8019, 2024\.
- Corrêa et al\. \[2023\]Augusto B\. Corrêa, Guillem Francès, Markus Hecher, Davide Mario Longo, and Jendrik Seipp\.Scorpion Maidu: Width search in the Scorpion planning system\.In*IPC\-10 Planner Abstracts*, 2023\.
- Corrêa et al\. \[2025\]Augusto B\. Corrêa, André Grahl Pereira, and Jendrik Seipp\.Classical planning with LLM\-generated heuristics: Challenging the state of the art with Python code\.In*Proc\. NeurIPS 2025*, 2025\.
- Corrêa et al\. \[2025\]Augusto B\. Corrêa, André G\. Pereira, and Jendrik Seipp\.The 2025 planning performance of frontier large language models\.arXiv:2511\.09378 \[cs\.CL\], 2025\.
- Culberson \[1997\]Joseph C\. Culberson\.Sokoban is PSPACE\-complete\.Technical Report TR 97\-02, Department of Computing Science, The University of Alberta, Edmonton, Alberta, Canada, 1997\.
- Dold and Helmert \[2024\]Simon Dold and Malte Helmert\.Novelty vs\. potential heuristics: A comparison of hardness measures for satisficing planning\.In*Proc\. AAAI 2024*, pages 20692–20699, 2024\.
- Drexler et al\. \[2022\]Dominik Drexler, Jendrik Seipp, and Hector Geffner\.Learning sketches for decomposing planning problems into subproblems of bounded width\.In*Proc\. ICAPS 2022*, pages 62–70, 2022\.
- Drexler et al\. \[2024\]Dominik Drexler, Jendrik Seipp, and Hector Geffner\.Expressing and exploiting subgoal structure in classical planning using sketches\.*Journal of Artificial Intelligence Research*, 80:171–208, 2024\.
- Erol et al\. \[1995\]Kutluhan Erol, Dana S\. Nau, and V\. S\. Subrahmanian\.Complexity, decidability and undecidability results for domain\-independent planning\.*Artificial Intelligence*, 76\(1–2\):75–88, 1995\.
- Ferber et al\. \[2022\]Patrick Ferber, Florian Geißer, Felipe Trevizan, Malte Helmert, and Jörg Hoffmann\.Neural network heuristic functions for classical planning: Bootstrapping and comparison to other methods\.In*Proc\. ICAPS 2022*, pages 583–587, 2022\.
- Francès et al\. \[2019\]Guillem Francès, Augusto B\. Corrêa, Cedric Geissmann, and Florian Pommerening\.Generalized potential heuristics for classical planning\.In*Proc\. IJCAI 2019*, pages 5554–5561, 2019\.
- Gemini Team Google \[2026\]Gemini Team Google\.Gemini 3\.1 Pro model card\.Google DeepMind, 2026\.
- Ghallab et al\. \[2004\]Malik Ghallab, Dana Nau, and Paolo Traverso\.*Automated Planning: Theory and Practice*\.Morgan Kaufmann, 2004\.
- Gnad et al\. \[2023\]Daniel Gnad, Álvaro Torralba, and Alexander Shleyfman\.DecStar\-2023\.In*IPC\-10 Planner Abstracts*, 2023\.
- Guan et al\. \[2023\]Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati\.Leveraging pre\-trained large language models to construct and utilize world models for model\-based task planning\.In*Proc\. NeurIPS 2023*, pages 79081–79094, 2023\.
- Hao et al\. \[2024\]Mingyu Hao, Felipe Trevizan, Sylvie Thiébaux, Patrick Ferber, and Jörg Hoffmann\.Guiding GBFS through learned pairwise rankings\.In*Proc\. IJCAI 2024*, pages 6724–6732, 2024\.
- Haslum et al\. \[2019\]Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, and Christian Muise\.*An Introduction to the Planning Domain Definition Language*, volume 13 of*Synthesis Lectures on Artificial Intelligence and Machine Learning*\.Morgan & Claypool, 2019\.
- Helmert \[2006\]Malte Helmert\.The Fast Downward planning system\.*Journal of Artificial Intelligence Research*, 26:191–246, 2006\.
- Helmert et al\. \[2022\]Malte Helmert, Silvan Sievers, Alexander Rovner, and Augusto B\. Corrêa\.On the complexity of heuristic synthesis for satisficing classical planning: Potential heuristics and beyond\.In*Proc\. ICAPS 2022*, pages 124–133, 2022\.
- Hoffmann and Nebel \[2001\]Jörg Hoffmann and Bernhard Nebel\.The FF planning system: Fast plan generation through heuristic search\.*Journal of Artificial Intelligence Research*, 14:253–302, 2001\.
- Huang et al\. \[2024\]Sukai Huang, Trevor Cohn, and Nir Lipovetzky\.Chasing progress, not perfection: Revisiting strategies for end\-to\-end LLM plan generation\.arXiv:2412\.10675 \[cs\.CL\], 2024\.
- Jha et al\. \[2010\]Susmit Jha, Sumit Gulwani, Sanjit A\. Seshia, and Ashish Tiwari\.Oracle\-guided component\-based program synthesis\.In*ACM/IEEE International Conference on Software*, 2010\.
- Li et al\. \[2022\]Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po\-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J\. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals\.Competition\-level code generation with AlphaCode\.arXiv:2203\.07814 \[cs\.PL\], 2022\.
- Lipovetzky and Geffner \[2017\]Nir Lipovetzky and Hector Geffner\.Best\-first width search: Exploration and exploitation in classical planning\.In*Proc\. AAAI 2017*, pages 3590–3596, 2017\.
- Liu et al\. \[2023\]Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone\.LLM\+P: Empowering large language models with optimal planning proficiency\.arXiv:2304\.11477 \[cs\.CL\], 2023\.
- Liu et al\. \[2024\]Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang\.Evolution of heuristics: Towards efficient automatic algorithm design using large language model\.In*Proc\. ICML 2024*, 2024\.
- McDermott \[2000\]Drew McDermott\.The 1998 AI Planning Systems competition\.*AI Magazine*, 21\(2\):35–55, 2000\.
- Murray et al\. \[2026\]Andrew Murray, Danial Dervovic, Alberto Pozanco, and Michael Cashmore\.GenePlan: Evolving better generalized PDDL plans using large language models\.In*Proc\. ICAPS 2026*, 2026\.
- Novikov et al\. \[2025\]Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po\-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J\. R\. Ruiz, Abbas Mehrabian, M\. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog\.Alphaevolve: A coding agent for scientific and algorithmic discovery\.arXiv:2506\.13131 \[cs\.CL\], 2025\.
- Orvalho et al\. \[2025\]Pedro Orvalho, Mikolás Janota, and Vasco M\. Manquinho\.Counterexample guided program repair using zero\-shot learning and maxsat\-based fault localization\.In*AAAI Conference on Artificial Intelligence*, 2025\.
- Oswald et al\. \[2024\]James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, and Shirin Sohrabi\.Large language models as planning domain generators\.In*Proc\. ICAPS 2024*, pages 423–431, 2024\.
- O’Toole et al\. \[2022\]Stefan O’Toole, Miquel Ramirez, Nir Lipovetzky, and Adrian R\. Pearce\.Sampling from pre\-images to learn heuristic functions for classical planning \(extended abstract\)\.In*Proc\. SoCS 2022*, pages 308–310, 2022\.
- Pearl \[1984\]Judea Pearl\.*Heuristics: Intelligent Search Strategies for Computer Problem Solving*\.Addison\-Wesley, 1984\.
- Pohl \[1969\]Ira Pohl\.First results on the effect of error in heuristic search\.In Bernard Meltzer and Donald Michie, editors,*Machine Intelligence 5*, pages 219–236\. Edinburgh University Press, 1969\.
- Pommerening et al\. \[2015\]Florian Pommerening, Malte Helmert, Gabriele Röger, and Jendrik Seipp\.From non\-negative to general operator cost partitioning\.In*Proc\. AAAI 2015*, pages 3335–3341, 2015\.
- Richter and Westphal \[2010\]Silvia Richter and Matthias Westphal\.The LAMA planner: Guiding cost\-based anytime planning with landmarks\.*Journal of Artificial Intelligence Research*, 39:127–177, 2010\.
- Romera\-Paredes et al\. \[2024\]Bernardino Romera\-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M\. Pawan Kumar, Emilien Dupont, Francisco J\. R\. Ruiz, Jordan S\. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi\.Mathematical discoveries from program search with large language models\.*Nature*, 625\(7995\):468–475, 2024\.
- Rossetti et al\. \[2024\]Nicholas Rossetti, Massimiliano Tummolo, Alfonso Emilio Gerevini, Luca Putelli, Ivan Serina, Mattia Chiari, and Matteo Olivato\.Learning general policies for planning through GPT models\.In*Proc\. ICAPS 2024*, pages 500–508, 2024\.
- Samadi et al\. \[2008\]Mehdi Samadi, Ariel Felner, and Jonathan Schaeffer\.Learning from multiple heuristics\.In*Proc\. AAAI 2008*, pages 357–362, 2008\.
- Samuel \[1959\]Arthur L Samuel\.Some studies in machine learning using the game of checkers\.*IBM Journal of research and development*, 1959\.
- Seipp et al\. \[2016\]Jendrik Seipp, Florian Pommerening, Gabriele Röger, and Malte Helmert\.Correlation complexity of classical planning domains\.In*Proc\. IJCAI 2016*, pages 3242–3250, 2016\.
- Seipp et al\. \[2017\]Jendrik Seipp, Florian Pommerening, Silvan Sievers, and Malte Helmert\.Downward Lab\.[https://doi\.org/10\.5281/zenodo\.790461](https://doi.org/10.5281/zenodo.790461), 2017\.
- Shen et al\. \[2020\]William Shen, Felipe Trevizan, and Sylvie Thiébaux\.Learning domain\-independent planning heuristics with hypergraph networks\.In*Proc\. ICAPS 2020*, pages 574–584, 2020\.
- Solar\-Lezama et al\. \[2005\]Armando Solar\-Lezama, Rodric M\. Rabbah, Rastislav Bodík, and Kemal Ebcioglu\.Programming by sketching for bit\-streaming programs\.In Vivek Sarkar and Mary W\. Hall, editors,*ACM SIGPLAN 2005 Conference on Programming anguage Design and Implementation*, 2005\.
- Sthlberg et al\. \[2022\]Simon Sthlberg, Blai Bonet, and Hector Geffner\.Learning general optimal policies with graph neural networks: Expressive power, transparency, and limits\.In*Proc\. ICAPS 2022*, pages 629–637, 2022\.
- Stechly et al\. \[2024\]Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati\.Chain of thoughtlessness? an analysis of CoT in planning\.In*Proc\. NeurIPS 2024*, pages 29106–29141, 2024\.
- Taitler et al\. \[2024\]Ayal Taitler, Ron Alford, Joan Espasa, Gregor Behnke, Daniel Fišer, Michael Gimelfarb, Florian Pommerening, Scott Sanner, Enrico Scala, Dominik Schreiber, Javier Segovia\-Aguas, and Jendrik Seipp\.The 2023 International Planning Competition\.*AI Magazine*, 45\(2\):280–296, 2024\.doi:10\.1002/aaai\.12169\.
- Tantakoun et al\. \[2025\]Marcus Tantakoun, Christian Muise, and Xiaodan Zhu\.LLMs as planning formalizers: A survey for leveraging large language models to construct automated planning models\.In*Findings of the Association for Computational Linguistics: ACL 2025*, 2025\.
- Tuisov et al\. \[2026\]Alexander Tuisov, Yonatan Vernik, and Alexander Shleyfman\.Successor\-generator planning with LLM\-generated heuristics\.In*Proc\. ICAPS 2026*, 2026\.
- Valmeekam et al\. \[2023a\]Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati\.PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change\.In*Proc\. NeurIPS 2023*, pages 38975–38987, 2023a\.
- Valmeekam et al\. \[2023b\]Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati\.On the planning abilities of large language models \- A critical investigation\.In*Proc\. NeurIPS 2023*, pages 75993–76005, 2023b\.

## Appendix ABenchmark Domains

We evaluate on the ten domains of the Learning Track of the IPC 2023\[[60](https://arxiv.org/html/2605.16142#bib.bib60)\]\. We give a brief description of each domain below, based on the PDDL definitions distributed by the competition\.

#### Blocksworld\.

A single arm rearranges distinguishable blocks into a goal configuration of stacks\. The arm holds at most one block at a time and supports four actions: pick a clear block up from the table, put a held block down on the table, stack a held block onto another clear block, and unstack a clear block from the block below it\. Goals specify the desired stacking viaonandon\-tableatoms\.

#### Childsnack\.

An agent must prepare sandwiches in a kitchen and serve them to children seated at tables\. Each child is either gluten\-allergic or not; gluten\-free sandwiches require both gluten\-free bread and gluten\-free content, both of which are scarce\. Sandwiches are placed on trays that move between the kitchen and the tables\. The goal is to have every child served a sandwich compatible with their allergy\.

#### Ferry\.

A single ferry transports cars between locations on a sea\. The ferry carries at most one car at a time and alternates between sailing \(empty or loaded\), boarding, and disembarking\. The goal specifies the target location of each car\.

#### Floortile\.

Robots paint tiles on a rectangular grid in two colors\. Each robot can move onto an adjacent unpainted tile, change the color it currently holds, or paint the tile immediately above or below itself\. A robot cannot step onto a painted tile, so the planner must reason about painting order and about positioning robots so they do not block each other\.

#### Miconic\.

A single elevator picks up passengers at their origin floors and delivers them to their destination floors\. The four actions are board, depart, drive up, and drive down; passenger capacity is unbounded\. The goal is to have every passenger served at their destination\.

#### Rovers\.

A team of rovers performs scientific operations on a network of waypoints and communicates results to a lander\. Operations include sampling soil and rock at the current waypoint, calibrating an on\-board camera against a calibration target, taking images of objectives, and communicating data via line of sight to the lander\. Rovers differ in what they can traverse and which instruments they carry, so plans must allocate tasks across heterogeneous agents\.

#### Satellite\.

A fleet of satellites takes calibrated images of celestial directions in particular imaging modes\. Before an image can be acquired, the relevant instrument must be switched on, calibrated against a calibration target, and the satellite must be pointing at the target direction\. Each satellite has a single power slot, so instruments must be switched on one at a time\.

#### Sokoban\.

An agent on a grid pushes boxes onto designated goal cells\. The agent can move into a clear adjacent cell or, if a box sits in the adjacent cell, push it one cell further in the same direction, provided the destination cell is clear\. Boxes cannot be pulled, so a box pushed against a wall or into a corner in the wrong orientation creates an unrecoverable dead end\.

#### Spanner\.

A man walks along a directed chain of locations to a workshop, where he must tighten a number of loose nuts\. Spanners are scattered along the way; each spanner can be used to tighten exactly one nut and the chain cannot be retraversed\. The man must therefore pick up enough usable spanners before reaching the workshop\.

#### Transport\.

Trucks with bounded discrete capacity move packages along a road network\. Pick\-up and drop actions consume and free one unit of capacity, encoded as a chain of size objects related by a predecessor predicate\. The plan must coordinate truck routes and remaining capacity to deliver every package to its goal location\.

## Appendix BFull Coverage Results with Standard Deviations

Table[2](https://arxiv.org/html/2605.16142#A2.T2)reports the same per\-domain coverage numbers as Table[1](https://arxiv.org/html/2605.16142#S4.T1)in the main text, but additionally includes standard deviations across the three independent runs for theCov\.ablation \(GBFS and HC\) and theDirectproposed method \(HC\)\. The baseline columns \(FFandS&Sunder GBFS and HC\) are reproduced without standard deviations because they are deterministic\.

Table 2:Per\-domain coverage \(number of solved tasks\) on the 900 IPC 2023 test tasks\. This is the full version of Table[1](https://arxiv.org/html/2605.16142#S4.T1)in the main text, including standard deviations across three independent runs for the ablation and proposed method\.FFis the classical delete\-relaxation heuristic;S&Sis our re\-run of the sample\-and\-select baseline\[[16](https://arxiv.org/html/2605.16142#bib.bib16)\]with Gemini 3\.1 Pro\. Both baselines are run using GBFS and HC\.Directis our proposed method \(direct heuristics run with HC\)\.Cov\.is the ablation that replaces the direct\-property check with numeric coverage feedback, using both GBFS and HC\. Bold marks the best value per domain \(ties are co\-bolded\)\.
## Appendix CSynthesis Statistics

Table[3](https://arxiv.org/html/2605.16142#A3.T3)summarises how many candidate direct heuristics the repair loop generates before the property check passes on the full training set\. We report per\-domain statistics over the three independent runs: the minimum, mean \(with standard deviation\), and maximum number of candidates generated in a single run, and the total over the three runs\.

Table 3:Number of candidate direct heuristics generated by the repair loop per synthesis run, summarised across the three independent runs per domain\.nnis the number of runs, and the Min, Mean±\\pmSD, and Max columns are computed over the per\-run candidate counts\.
## Appendix DExample Repair Prompt

Below we show the repair prompt used in the repair loop, instantiated for the Blocksworld domain at iteration 3\. We edit the prompt to highlight its main structural components and match the nomenclature used in the paper\.

<role\>

Youareahighly\-skilledAIplanningresearcherandaproficientPythonprogrammercreatingadomain\-dependentheuristicfunctionforthePDDLdomainblocksworld\.

</role\>

<objective\>

Producearevisedheuristicwhereeverynon\-goalstatereachablebyDepth\-FirstSearch,whichonlyexpandsstrictlyimprovingsuccessors,hasatleastoneimprovingsuccessor,regardlessoftheorderinwhichoperatorsareapplied\.

</objective\>

<evaluation\-method\>

TheheuristicisvalidatedusingaDepth\-FirstSearch\(DFS\)thatenforcesthedirectpropertyateverystep:

\-Onlysuccessorswithh\(s’\)<h\(s\)areaddedtotheDFSstack\.

\-ApropertyviolationoccurswhenDFSexpandsanon\-goalstateandfindsnosuccessorwithastrictlylowerheuristicvalue\.

\.\.\.

</evaluation\-method\>

<success\-criteria\>

1\.AliveState:reachablefromtheinitialstate,solvable,andnotagoalstate\.

2\.ImprovingSuccessor:asuccessors’suchthath\(s’\)<h\(s\)\.

3\.Direct:everyalivestatereachablefromtheinitialstateviaasequenceofstrictlydecreasinghvalueshasatleastoneimprovingsuccessor\.AllstatesthatDFScanreachmustcontainanimprovingsuccessor\.

4\.Dead\-EndAvoiding:noimprovingsuccessorofanalivestatereachablebyDFSisitselfunsolvable\(h=infinity\)\.

</success\-criteria\>

<previous\-iterations\>

Iteration0:

\-PromptType:Init

\-Solved5instances\(Averageexpandednodes:2\.4\)

\-Directvalidatedon5tasks\(lastindex:4\)

\-FailureAnalysis:

FailureType:property

FailingTask:p06\.pddl

PropertyFailureKind:direct

State:\[’\(clearb1\)’,’\(clearb2\)’,’\(holdingb3\)’,’\(on\-tableb1\)’,’\(on\-tableb2\)’\]

HeuristicValue:3

Reason:Statehash=3butnosuccessorhash<3\.

Successors:

1\.action=\(stackb3b2\),h=6,added=\[’\(clearb3\)’,’\(onb3b2\)’,’\(arm\-empty\)’\],deleted=\[’\(holdingb3\)’,’\(clearb2\)’\]

2\.action=\(stackb3b1\),h=6,added=\[’\(clearb3\)’,’\(onb3b1\)’,’\(arm\-empty\)’\],deleted=\[’\(holdingb3\)’,’\(clearb1\)’\]

3\.action=\(putdownb3\),h=4,added=\[’\(clearb3\)’,’\(on\-tableb3\)’,’\(arm\-empty\)’\],deleted=\[’\(holdingb3\)’\]

<previous\-heuristic\-code\>

\.\.\.

</previous\-heuristic\-code\>

Iteration1:

\-PromptType:Property

\-MainIdea:\.\.\.

\-Solved16instances\(Averageexpandednodes:7\.0\)

\-Directvalidatedon16tasks\(lastindex:15\)

\-FailureAnalysis:

FailureType:property

FailingTask:p17\.pddl

PropertyFailureKind:direct

State:\[’\(arm\-empty\)’,’\(clearb1\)’,’\(clearb3\)’,’\(clearb4\)’,’\(onb2b5\)’,’\(onb3b2\)’,’\(on\-tableb1\)’,’\(on\-tableb4\)’,’\(on\-tableb5\)’\]

HeuristicValue:10

Reason:Statehash=10butnosuccessorhash<10\.

Successors:

1\.action=\(unstackb3b2\),h=11,added=\[’\(holdingb3\)’,’\(clearb2\)’\],deleted=\[’\(clearb3\)’,’\(onb3b2\)’,’\(arm\-empty\)’\]

2\.action=\(pickupb4\),h=11,added=\[’\(holdingb4\)’\],deleted=\[’\(on\-tableb4\)’,’\(clearb4\)’,’\(arm\-empty\)’\]

3\.action=\(pickupb1\),h=11,added=\[’\(holdingb1\)’\],deleted=\[’\(clearb1\)’,’\(on\-tableb1\)’,’\(arm\-empty\)’\]

<previous\-heuristic\-code\>

\.\.\.

</previous\-heuristic\-code\>

Iteration2:

\-PromptType:Property

\-MainIdea:\.\.\.

\-Solved17instances\(Averageexpandednodes:7\.8\)

\-Directvalidatedon17tasks\(lastindex:16\)

\-FailureAnalysis:

FailureType:property

FailingTask:p18\.pddl

PropertyFailureKind:direct

State:\[’\(arm\-empty\)’,’\(clearb1\)’,’\(clearb3\)’,’\(clearb5\)’,’\(onb2b4\)’,’\(onb5b2\)’,’\(on\-tableb1\)’,’\(on\-tableb3\)’,’\(on\-tableb4\)’\]

HeuristicValue:8

Reason:Statehash=8butnosuccessorhash<8\.

Successors:

1\.action=\(unstackb5b2\),h=9,added=\[’\(holdingb5\)’,’\(clearb2\)’\],deleted=\[’\(onb5b2\)’,’\(clearb5\)’,’\(arm\-empty\)’\]

2\.action=\(pickupb1\),h=9,added=\[’\(holdingb1\)’\],deleted=\[’\(clearb1\)’,’\(on\-tableb1\)’,’\(arm\-empty\)’\]

3\.action=\(pickupb3\),h=9,added=\[’\(holdingb3\)’\],deleted=\[’\(clearb3\)’,’\(on\-tableb3\)’,’\(arm\-empty\)’\]

<previous\-heuristic\-code\>

\.\.\.

</previous\-heuristic\-code\>

</previous\-iterations\>

<current\-status\>

DomainProgress:solving99instancesofincreasingdifficulty\.

ValidationTimeLimit:30secondspertask\.

CurrentResult:Theheuristicwasdirecton73instances,butfailedonp74\.pddl\.

</current\-status\>

<failure\-analysis\>

FailureType:property

PropertyFailureKind:direct

State:\[’\(arm\-empty\)’,’\(clearb10\)’,’\(clearb11\)’,’\(clearb15\)’,’\(clearb17\)’,’\(clearb2\)’,’\(clearb21\)’,’\(clearb22\)’,’\(clearb3\)’,’\(clearb4\)’,’\(clearb5\)’,’\(clearb6\)’,’\(clearb8\)’,’\(clearb9\)’,’\(onb1b14\)’,’\(onb12b18\)’,’\(onb14b13\)’,’\(onb15b12\)’,’\(onb18b7\)’,’\(onb21b1\)’,’\(onb6b20\)’,’\(onb7b19\)’,’\(onb9b16\)’,’\(on\-tableb10\)’,’\(on\-tableb11\)’,’\(on\-tableb13\)’,’\(on\-tableb16\)’,’\(on\-tableb17\)’,’\(on\-tableb19\)’,’\(on\-tableb2\)’,’\(on\-tableb20\)’,’\(on\-tableb22\)’,’\(on\-tableb3\)’,’\(on\-tableb4\)’,’\(on\-tableb5\)’,’\(on\-tableb8\)’\]

HeuristicValue:38

Reason:Statehash=38butnosuccessorhash<38\.

SuccessorAnalysis:

1\.action=\(unstackb15b12\),h=39,added=\[’\(holdingb15\)’,’\(clearb12\)’\],deleted=\[’\(clearb15\)’,’\(onb15b12\)’,’\(arm\-empty\)’\]

2\.action=\(unstackb9b16\),h=39,added=\[’\(holdingb9\)’,’\(clearb16\)’\],deleted=\[’\(onb9b16\)’,’\(clearb9\)’,’\(arm\-empty\)’\]

3\.action=\(pickupb11\),h=39,added=\[’\(holdingb11\)’\],deleted=\[’\(on\-tableb11\)’,’\(clearb11\)’,’\(arm\-empty\)’\]

4\.action=\(pickupb8\),h=39,added=\[’\(holdingb8\)’\],deleted=\[’\(arm\-empty\)’,’\(clearb8\)’,’\(on\-tableb8\)’\]

5\.action=\(pickupb3\),h=39,added=\[’\(holdingb3\)’\],deleted=\[’\(clearb3\)’,’\(on\-tableb3\)’,’\(arm\-empty\)’\]

6\.action=\(pickupb4\),h=39,added=\[’\(holdingb4\)’\],deleted=\[’\(on\-tableb4\)’,’\(clearb4\)’,’\(arm\-empty\)’\]

7\.action=\(pickupb17\),h=39,added=\[’\(holdingb17\)’\],deleted=\[’\(clearb17\)’,’\(on\-tableb17\)’,’\(arm\-empty\)’\]

8\.action=\(unstackb6b20\),h=39,added=\[’\(holdingb6\)’,’\(clearb20\)’\],deleted=\[’\(clearb6\)’,’\(onb6b20\)’,’\(arm\-empty\)’\]

9\.action=\(pickupb10\),h=39,added=\[’\(holdingb10\)’\],deleted=\[’\(on\-tableb10\)’,’\(clearb10\)’,’\(arm\-empty\)’\]

10\.action=\(pickupb22\),h=39,added=\[’\(holdingb22\)’\],deleted=\[’\(on\-tableb22\)’,’\(clearb22\)’,’\(arm\-empty\)’\]

11\.action=\(pickupb5\),h=39,added=\[’\(holdingb5\)’\],deleted=\[’\(on\-tableb5\)’,’\(clearb5\)’,’\(arm\-empty\)’\]

12\.action=\(unstackb21b1\),h=39,added=\[’\(clearb1\)’,’\(holdingb21\)’\],deleted=\[’\(clearb21\)’,’\(onb21b1\)’,’\(arm\-empty\)’\]

13\.action=\(pickupb2\),h=39,added=\[’\(holdingb2\)’\],deleted=\[’\(clearb2\)’,’\(on\-tableb2\)’,’\(arm\-empty\)’\]

</failure\-analysis\>

ThisisthecurrentPythoncodeoftheheuristicfunction:

<heuristic\-code\-file\>

\.\.\.

</heuristic\-code\-file\>

ThisisthePDDLdomainfileoftheblocksworlddomain:

<domain\-file\>

\.\.\.

</domain\-file\>

ThisisthefirstPDDLinstancefileoftheblocksworlddomainwheretheheuristicisfailingthedirectproperty:

<task\-file\>

\.\.\.

</task\-file\>

Hereisachecklisttohelpyouwithyourcode:

<checklist\>

1\)Allusedmodulesareimported\.

2\)Theinformationfromstaticfactsisextractedintosuitabledatastructuresintheconstructor\.

3\)Usethefailureanalysisanditerationhistorytopreservewhatworkedandfixwhatfailed\.

4\)ThenameoftheheuristicshouldbeBlocksworldHeuristic\.

</checklist\>

<final\-instruction\>

Returnanewheuristicfunctioncodethatsatisfiesthedirectpropertyintheinstancein<task\-file\>aswell\.

OutputyouranswerinthefollowingXMLformat:

<generated\-main\-idea\>

Ashortparagraphdescribingonlytheideaofthenewheuristicyouarereturning\.Donotincludeanyotherinformation\.

</generated\-main\-idea\>

<generated\-heuristic\-code\>

ProvidethePythoncodeofthedomain\-dependentheuristicfortheblocksworld\.

</generated\-heuristic\-code\>

</final\-instruction\>

Similar Articles

Hint-Guided Diversified Policy Optimization for LLM Reasoning

arXiv cs.CL

This paper introduces Hint-Guided Diversified Policy Optimization (HDPO), a two-stage RL framework that encourages LLMs to first generate multiple candidate solution outlines (hints) and then select the most reliable one for detailed reasoning, improving reasoning diversity and reliability.