Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

arXiv cs.AI 05/18/26, 04:00 AM Papers
research planning reinforcement-learning symbolic-reasoning imitation-learning embodied-ai long-horizon
Summary
Proposes BISON, a system combining learned low-level neural policies with high-level symbolic planning for long-horizon embodied tasks, showing strong generalization and efficiency.
arXiv:2605.15975v1 Announce Type: new Abstract: We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(\pi^{\mathrm{hl}}, \pi^{\mathrm{ll}})$, consisting of a neural policy $\pi^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $\pi^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:35 AM
# Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
Source: [https://arxiv.org/html/2605.15975](https://arxiv.org/html/2605.15975)
Dillon Z\. Chen1,2,3,4Till Hofmann5Toryn Q\. Klassen1,2Sheila A\. McIlraith1,2

1Vector Institute2University of Toronto 3LAAS\-CNRS4University of Toulouse5RWTH Aachen University

###### Abstract

We tackle the challenge of building embodied AI agents that can reliably solve long\-horizon planning problems\. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low\-level \(LL\), continuous environments\. Yet, it remains a difficult endeavour to generate long\-horizon plans from imitation learning alone\. In contrast, high\-level \(HL\), symbolic abstractions facilitate efficient and interpretable long\-horizon planning\. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long\-horizon planning\. We realise this idea via*bilevel policies*of the form\(πhl,πll\)\(\\pi^\{\\mathrm\{hl\}\},\\pi^\{\\mathrm\{ll\}\}\), consisting of a neural policyπll\\pi^\{\\mathrm\{ll\}\}learned from LL demonstrations, and an HL symbolic policyπhl\\pi^\{\\mathrm\{hl\}\}that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation\. We implement these ideas in the BISON system\. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end\-to\-end methods, and is more time and memory efficient in training and inference\. Notably, when ignoring LL execution, BISON’s HL policies can solve HL problems with 10,000 relevant objects in under a minute\.

Project page:[https://dillonzchen\.github\.io/bison](https://dillonzchen.github.io/bison)

Execution•Abstraction Language:𝒟\\mathcal\{D\}•Labelling Function:ℒ\\mathcal\{L\}•LL Demonstrations with HL goals:𝕋\\mathbb\{T\}Bilevel PolicyEnvironment![Refer to caption](https://arxiv.org/html/2605.15975v1/figures/mujoco.png)State Abstr\.shl\\mathit\{s\}^\{\\mathrm\{hl\}\}Observation𝐬ll\+ghl\\mathbf\{s\}^\{\\mathrm\{ll\}\}\+\\mathit\{g\}^\{\\mathrm\{hl\}\}HL Policyπhl\\pi^\{\\mathrm\{hl\}\}LL Policyπll\\pi^\{\\mathrm\{ll\}\}Temporal Abstr\.ahl\\mathit\{a\}^\{\\mathrm\{hl\}\}Action𝐚ll\\mathbf\{a\}^\{\\mathrm\{ll\}\}ℒ\\mathcal\{L\}InputsLearningHL Demos\{⟨ghl,a0hl,\.\.\.,amhl⟩\}\\\{\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{m\}\\rangle\\\}LL Demos w/ HL Goals\{⟨ghl,𝐬0ll,𝐚0ll,\.\.\.,𝐬mll,𝐚mll⟩\}\\\{\\\!\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\},\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{0\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{m\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\_\{m\}\\rangle\\\!\\\}HL Policyπhl\\pi^\{\\mathrm\{hl\}\}LL Policyπll\\pi^\{\\mathrm\{ll\}\}ℒ\\mathcal\{L\}Figure 1:Top Left– inputs for learning and executing bilevel policies: a domain theory𝒟\\mathcal\{D\}, a labelling functionℒ\\mathcal\{L\}that maps observations to state abstractions, and LL demos with HL goals\.Bottom Left– bilevel policy learning: LL demos induce HL demos viaℒ\\mathcal\{L\}, and LL/HL policies are separately learned from LL/HL demos\.Right– bilevel policy execution: state abstractionsshl\\mathit\{s\}^\{\\mathrm\{hl\}\}are computed from observations𝐬ll\\mathbf\{s\}^\{\\mathrm\{ll\}\}viaℒ\\mathcal\{L\}to propose HL actionsahl\\mathit\{a\}^\{\\mathrm\{hl\}\}which in turn help propose LL actions𝐚ll\\mathbf\{a\}^\{\\mathrm\{ll\}\}\.## 1Introduction

Long\-horizon planning remains a longstanding challenge to building embodied AI agents\. We present a framework that combines the strengths of symbolic\- and neural\-based learning and reasoning to produce goal\-conditioned policies that address long\-horizon planning and acting in continuous state and action spaces with varying numbers of objects\.

Scaling up reinforcement learning in simulation and imitation learning from demonstrations has produced robot policies capable of diverse motor control and manipulation tasks\(Ahnet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib110); Hoffmannet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib138); Weiet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib137); Huanget al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib71); Driesset al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib112); Huet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib155); Zitkovichet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib167); Kimet al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib133); Intelligenceet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib111); Team and DeepMind,[2025](https://arxiv.org/html/2605.15975#bib.bib158); Intelligenceet al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib1); Zhanget al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib173)\)\. However, scaling alone currently appears insufficient for efficient long\-horizon planning and acting\(Dziriet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib135); Kambhampatiet al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib136); Linet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib134); Parket al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib129)\)\. In contrast, symbolic planners \(e\.g\.,\(Helmert,[2006](https://arxiv.org/html/2605.15975#bib.bib115); Corrêaet al\.,[2020](https://arxiv.org/html/2605.15975#bib.bib182); Scalaet al\.,[2020](https://arxiv.org/html/2605.15975#bib.bib180); Seippet al\.,[2020](https://arxiv.org/html/2605.15975#bib.bib183); Bonassiet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib179); Specket al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib181)\)\) over symbolic world models have proven to be highly effective long\-horizon planners, but do not easily extend to settings with varying sensory modalities and hard\-to\-model low\-level dynamics\. More recently, a suite of bilevel planning systems has emerged that combine symbolic and neural methods \(e\.g\.,\(Srivastavaet al\.,[2014](https://arxiv.org/html/2605.15975#bib.bib22); Konidariset al\.,[2018](https://arxiv.org/html/2605.15975#bib.bib24); Garrettet al\.,[2021](https://arxiv.org/html/2605.15975#bib.bib21); Shenet al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib40)\)\)\. Unfortunately, they reflect a number of the inherent shortcomings of symbolic planning as they can struggle with open\-world planning and partial observability\.

In this work, as depicted in[Figure˜1](https://arxiv.org/html/2605.15975#S0.F1), we present a novel approach to long\-horizon planning, combining the benefits of low\-level \(LL\) imitation learning with the affordances of high\-level \(HL\), symbolic abstraction, reasoning, and learning\. Rather than generating plans from a symbolic domain theory, our approach extracts and inductively generalises symbolic policies from abstractions of our LL demonstrations, and learns to realise those symbolic policies at the LL by means of behaviour cloning\.

We realise our approach via the creation of*bilevel policies*of the form\(πhl,πll\)\(\\pi^\{\\mathrm\{hl\}\},\\pi^\{\\mathrm\{ll\}\}\), consisting of a neural policyπll\\pi^\{\\mathrm\{ll\}\}, learned from multiple LL demonstrations, and an HL symbolic policyπhl\\pi^\{\\mathrm\{hl\}\}that is constructed from a symbolic abstraction of the LL demonstrations\. We apply goal regression\(Waldinger,[1977](https://arxiv.org/html/2605.15975#bib.bib3); Lozano\-Perezet al\.,[1984](https://arxiv.org/html/2605.15975#bib.bib4); Reiter,[1991](https://arxiv.org/html/2605.15975#bib.bib5); Xuet al\.,[2019a](https://arxiv.org/html/2605.15975#bib.bib94)\)—a pre\-image rewriting technique—to the abstracted symbolic demonstrations to extract a set of*Condition*↦\\mapsto*Action*rules\. We inductively generalise the resulting rules to produce compact and expressive HL policies consisting of first\-order, condition\-action rules that can operate in open\-world environments\(Reiter,[2001](https://arxiv.org/html/2605.15975#bib.bib6); Sanner and Boutilier,[2009](https://arxiv.org/html/2605.15975#bib.bib93); Liuet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib95)\)\. Our LL policiesπll\\pi^\{\\mathrm\{ll\}\}are represented by a graph neural network\(Scarselliet al\.,[2009](https://arxiv.org/html/2605.15975#bib.bib141); Kipf and Welling,[2017](https://arxiv.org/html/2605.15975#bib.bib139); Gilmeret al\.,[2017](https://arxiv.org/html/2605.15975#bib.bib140)\)conditioned on actions returned byπhl\\pi^\{\\mathrm\{hl\}\}\.

We implement our approach in the BISON system\. We compare BISON to 8 baselines spanning VLA, end\-to\-end, and symbolic planning methods on 8 environments extending the MetaWorld benchmarks\(Yuet al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib34); McLeanet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib35)\)\. Experiments show that BISON solves tasks with longer horizons and more objects than those solved by end\-to\-end and VLA baselines, and handles more complex environment dynamics than symbolic planning baselines\. Our contributions include:

- •We introduce BISON, an approach that learns*bilevel policies*\(πhl,πll\)\(\\pi^\{\\mathrm\{hl\}\},\\pi^\{\\mathrm\{ll\}\}\)from LL demonstrations paired with HL goals, given a domain theory𝒟\\mathcal\{D\}and labelling functionℒ\\mathcal\{L\}\.
- •We apply goal regression and inductive generalisation to learn compact, expressive HL policiesπhl\\pi^\{\\mathrm\{hl\}\}that can generalise to arbitrary numbers of objects, and can operate in open\-world and partially observable environments, and we learn LL policies via a compact graph neural networkπll\\pi^\{\\mathrm\{ll\}\}with fewer than 33,000 parameters\.
- •We conduct 21,600 episodes’ worth of experiments across various baselines and environments extending the MetaWorld benchmark\. Results demonstrate that BISON achieves superior efficiency, generalisation, and long\-horizon planning capabilities compared to baselines\. When ignoring LL execution, BISON’s HL policies can solve problems with at least 10,000 objects in under a minute\.

## 2Related Work

#### Planning with Abstractions and Hierarchical Reinforcement Learning

The idea of leveraging abstractions to facilitate efficient long\-horizon planning is a longstanding one\(Sacerdoti,[1974](https://arxiv.org/html/2605.15975#bib.bib146); Tate,[1977](https://arxiv.org/html/2605.15975#bib.bib149); Giunchiglia and Walsh,[1992](https://arxiv.org/html/2605.15975#bib.bib160); Bacchus and Yang,[1994](https://arxiv.org/html/2605.15975#bib.bib87); Boutilier and Dearden,[1994](https://arxiv.org/html/2605.15975#bib.bib166); Erolet al\.,[1996](https://arxiv.org/html/2605.15975#bib.bib148); Dean and Givan,[1997](https://arxiv.org/html/2605.15975#bib.bib164); Nauet al\.,[2003](https://arxiv.org/html/2605.15975#bib.bib147); Marthiet al\.,[2008](https://arxiv.org/html/2605.15975#bib.bib145); Konidaris,[2019](https://arxiv.org/html/2605.15975#bib.bib132); Bercheret al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib150)\)\. Reinforcement learning \(RL\) approaches have also looked at learning and leveraging hierarchies and abstractions\(Dayan and Hinton,[1992](https://arxiv.org/html/2605.15975#bib.bib76); Hauskrechtet al\.,[1998](https://arxiv.org/html/2605.15975#bib.bib79); Suttonet al\.,[1999](https://arxiv.org/html/2605.15975#bib.bib78); Dietterich,[2000](https://arxiv.org/html/2605.15975#bib.bib77); Liet al\.,[2006](https://arxiv.org/html/2605.15975#bib.bib118); Konidariset al\.,[2018](https://arxiv.org/html/2605.15975#bib.bib24); Abelet al\.,[2016](https://arxiv.org/html/2605.15975#bib.bib117),[2018](https://arxiv.org/html/2605.15975#bib.bib116); Leet al\.,[2018](https://arxiv.org/html/2605.15975#bib.bib169)\)to improve sample efficiency\. There has also been recent interest in leveraging formal languages to represent such abstractions as well as to specify goals for RL agents\(Liet al\.,[2017](https://arxiv.org/html/2605.15975#bib.bib126); Toro Icarteet al\.,[2018](https://arxiv.org/html/2605.15975#bib.bib131); Camachoet al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib82); Bozkurtet al\.,[2020](https://arxiv.org/html/2605.15975#bib.bib121); Illaneset al\.,[2020](https://arxiv.org/html/2605.15975#bib.bib33); Vaezipooret al\.,[2021](https://arxiv.org/html/2605.15975#bib.bib81); Toro Icarteet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib80); Voloshinet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib168); Qiuet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib125); Yalcinkayaet al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib120); Jackermeier and Abate,[2025](https://arxiv.org/html/2605.15975#bib.bib119); Liet al\.,[2025a](https://arxiv.org/html/2605.15975#bib.bib130)\)\. Our work is inspired by the use of HL abstractions but differs in the problem setting: we learn policies entirely from demonstrations over both the abstraction and underlying environment that generalise to longer horizon problems than RL and exploration methods\.

#### Task and Motion Planning

Task and motion planning \(TAMP\) is a common framework for solving embodied AI that integrates HL planning and LL execution\. TAMP methods can be categorised\(Garrettet al\.,[2021](https://arxiv.org/html/2605.15975#bib.bib21); Zhaoet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib101)\)into interleaved search\-then\-sample\(Srivastavaet al\.,[2014](https://arxiv.org/html/2605.15975#bib.bib22); Shahet al\.,[2020](https://arxiv.org/html/2605.15975#bib.bib23); Mendez\-Mendezet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib107); Shenet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib170)\)or hybrid constraint satisfaction and optimisation\(Lozano\-Pérez and Kaelbling,[2014](https://arxiv.org/html/2605.15975#bib.bib50); Toussaint,[2015](https://arxiv.org/html/2605.15975#bib.bib47); Dantamet al\.,[2018](https://arxiv.org/html/2605.15975#bib.bib48)\)approaches\. Later works employ learning in various manners: predicting HL action feasibility\(Wellset al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib88); Driesset al\.,[2020b](https://arxiv.org/html/2605.15975#bib.bib89); Xuet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib96); Bouhsainet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib86); Yanget al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib156); Bouhsainet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib91)\), learning problem transformations, heuristics or policies for guiding search\(Chitniset al\.,[2016](https://arxiv.org/html/2605.15975#bib.bib49); Kimet al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib92); Silveret al\.,[2021a](https://arxiv.org/html/2605.15975#bib.bib172); Curtiset al\.,[2022b](https://arxiv.org/html/2605.15975#bib.bib29); Khodeiret al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib90); Mandlekaret al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib100); Mendez\-Mendezet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib107); Cieslaret al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib142); Duet al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib171)\), and learning policies that aim to imitate bilevel planners\(Driesset al\.,[2020a](https://arxiv.org/html/2605.15975#bib.bib109); McDonald and Hadfield\-Menell,[2021](https://arxiv.org/html/2605.15975#bib.bib99); Zhuet al\.,[2021](https://arxiv.org/html/2605.15975#bib.bib98); Linet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib97)\)\. Our work falls in the latter category whereby we learn policies that solve bilevel planning problems without relying on search\. Our work differentiates itself within this category through learning bilevel policies over both the LL and HL spaces for facilitating efficient, interpretable and long\-horizon planning in contrast to purely end\-to\-end methods\. Furthermore, some TAMP search methods struggle with open\-world planning and partial observability, owing to their reliance on symbolic planners, while our approach can support these problem settings\.

## 3Problem Statement

We are interested in learning policies over continuous environments that address HL goals from \(1\) LL demonstrations paired with HL goals, \(2\) an abstraction language𝒟\\mathcal\{D\}, and \(3\) a labelling functionℒ\\mathcal\{L\}that maps LL observations to HL abstractions, as depicted in[Figure˜1](https://arxiv.org/html/2605.15975#S0.F1), that can generalise to unseen states and goals in a zero\-shot manner\. We follow a basic*bilevel planning*paradigm \(e\.g\.,\(Srivastavaet al\.,[2014](https://arxiv.org/html/2605.15975#bib.bib22); Garrettet al\.,[2021](https://arxiv.org/html/2605.15975#bib.bib21)\)\) in which a planning problem, consisting of continuous LL states and actions, admits an HL model\-based abstraction using symbolic and relational states and actions\. We then list assumptions applied in our planning representations, and corresponding related work that validates such assumptions\. A summary of definitions, notations and typography semantics is provided in[Appendix˜A](https://arxiv.org/html/2605.15975#A1)\.

#### Bilevel Planning

A*planning problem*is a tuple𝐏=⟨𝐒,𝐀,𝐬0,𝐠⟩\\mathbf\{P\}=\\langle\\mathbf\{S\},\\mathbf\{A\},\\mathbf\{s\}\_\{0\},\\mathbf\{g\}\\ranglewhere𝐒\\mathbf\{S\}is a set of states,𝐀\\mathbf\{A\}is a set of actions,𝐬0\\mathbf\{s\}\_\{0\}is the initial state and𝐠⊆𝐒\\mathbf\{g\}\\subseteq\\mathbf\{S\}is a set of goal states\. An action𝐚∈𝐀\\mathbf\{a\}\\in\\mathbf\{A\}is a function𝐚:𝐒→Δ\(𝐒∪\{⊥\}\)\\mathbf\{a\}:\\mathbf\{S\}\\to\\Delta\(\\mathbf\{S\}\\cup\\left\\\{\\bot\\right\\\}\)that maps to a distribution of possible next states or⊥\\botindicating the action is not applicable\. A policy is a distribution over actionsπ\(𝐚∣𝐬,∗\)\\pi\(\\mathbf\{a\}\\mid\\mathbf\{s\},\\ast\)conditioned on a state and other arguments∗\\ast\. A policy is a*solution*for a planning problem if it achieves a goal state from the initial state𝐬0\\mathbf\{s\}\_\{0\}with an expected probability of 1\. A*bilevel planning problem*is a tupleℙ=⟨𝐏ll,𝐏hl,ℒ⟩\\mathbb\{P\}=\\langle\\mathbf\{P\}^\{\\mathrm\{ll\}\},\\mathbf\{P\}^\{\\mathrm\{hl\}\},\\mathcal\{L\}\\rangleconsisting of an HL and LL planning problem and a*labelling function*ℒ:𝐒ll→𝐒hl\\mathcal\{L\}:\\mathbf\{S\}^\{\\mathrm\{ll\}\}\\to\\mathbf\{S\}^\{\\mathrm\{hl\}\}such thatghl=\{ℒ\(𝐬ll\)∣𝐬ll∈𝐠ll\}\\mathit\{g\}^\{\\mathrm\{hl\}\}=\\\{\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)\\mid\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\in\\mathbf\{g\}^\{\\mathrm\{ll\}\}\\\}\. We next depict conditions onℒ\\mathcal\{L\}for ensuring useful HL abstractions of LL problems, followed by representations of HL and LL planning problems for our work\.

#### Downward Refinement Property

We next introduce a property of our bilevel planning paradigm that establishes when the HL abstraction is useful for constructing bilevel planning solutions\. The downward refinement property \(DRP\)\(Bacchus and Yang,[1994](https://arxiv.org/html/2605.15975#bib.bib87)\)states that any solution for an HL problem can be refined into a solution for the LL problem\. We extend this notion to the general nondeterministic setting\. Given a state𝐬\\mathbf\{s\}and policyπ\\pi, we denote𝐞𝐱𝐞𝐜\(𝐬,π\)\\mathbf\{exec\}\(\\mathbf\{s\},\\pi\)to be the set of all possible successor states achievable by applying some action with non\-zero support inπ\(⋅∣𝐬\)\\pi\(\\cdot\\mid\\mathbf\{s\}\)\.

###### Definition 1\(Nondeterministic Downward Refinement Property\)\.

A bilevel planning problem⟨𝐏ll,𝐏hl,ℒ⟩\\langle\\mathbf\{P\}^\{\\mathrm\{ll\}\},\\mathbf\{P\}^\{\\mathrm\{hl\}\},\\mathcal\{L\}\\rangleexhibits*the nondeterministic downward property \(NDRP\)*if for everyπhl\\pi^\{\\mathrm\{hl\}\}that solves𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}, there existsπll\\pi^\{\\mathrm\{ll\}\}that solves𝐏ll\\mathbf\{P\}^\{\\mathrm\{ll\}\}where for every𝐬ll\\mathbf\{s\}^\{\\mathrm\{ll\}\}reachable byπll\\pi^\{\\mathrm\{ll\}\}in𝐏ll\\mathbf\{P\}^\{\\mathrm\{ll\}\},

∀𝐬∈𝐞𝐱𝐞𝐜\(𝐬ll,πll\),ℒ\(𝐬\)∈𝐞𝐱𝐞𝐜\(ℒ\(𝐬ll\),πhl\)∪\{ℒ\(𝐬ll\)\}\.\\displaystyle\\forall\\mathbf\{s\}\\in\\mathbf\{exec\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\pi^\{\\mathrm\{ll\}\}\),\\mathcal\{L\}\(\\mathbf\{s\}\)\\in\\mathbf\{exec\}\(\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\),\\pi^\{\\mathrm\{hl\}\}\)\\cup\\\{\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)\\\}\.\(1\)

Intuitively, the LL policy can be viewed as analogous to a set of skills or options\(Konidaris and Barto,[2009](https://arxiv.org/html/2605.15975#bib.bib154)\)for implementing and refining HL actions\. NDRP states that every step determined byπll\\pi^\{\\mathrm\{ll\}\}either stays within the same HL abstract state or advances to a valid HL successor underπhl\\pi^\{\\mathrm\{hl\}\}\. In this way, a solution for the HL problem can be used as guidance to build a solution for the underlying LL problem\.

#### LL Representation: Object\-Centric, Ego\-Centric Viewpoints

We represent states in our LL problems𝐏ll\\mathbf\{P\}^\{\\mathrm\{ll\}\}in an object\- and ego\-centric manner\. This means that states are represented as tuples𝐬ll=⟨𝐱e,\{𝐱o\}o∈𝐎⟩\\mathbf\{s\}^\{\\mathrm\{ll\}\}=\\langle\\mathbf\{x\}\_\{e\},\\left\\\{\\mathbf\{x\}\_\{o\}\\right\\\}\_\{o\\in\\mathbf\{O\}\}\\ranglewhere𝐱e∈ℝn\\mathbf\{x\}\_\{e\}\\in\\mathbb\{R\}^\{n\}is a vector representing the agent’s state \(e\.g\., location of a robot\),𝐎\\mathbf\{O\}is a set of objects visible to the agent, and\{𝐱o\}o∈𝐎\\left\\\{\\mathbf\{x\}\_\{o\}\\right\\\}\_\{o\\in\\mathbf\{O\}\}where𝐱o∈ℝm\\mathbf\{x\}\_\{o\}\\in\\mathbb\{R\}^\{m\}is a set of vectors representing the states \(e\.g\. position\) of all other objects in the world relative to the agent\. LL actions are black box functions that determine transitions between LL states with unknown probabilities\. The object\-centric representation is commonly used for embodied AI planning\(Silveret al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib11),[2023](https://arxiv.org/html/2605.15975#bib.bib9); Kumaret al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib8); Lianget al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib10); Shenet al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib40)\)for which there is much work for learning or extracting such representations from raw sensory data\(Burgesset al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib13); Anandet al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib16); Kipfet al\.,[2020](https://arxiv.org/html/2605.15975#bib.bib14); Zadaianchuket al\.,[2021](https://arxiv.org/html/2605.15975#bib.bib12); Jameset al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib20); Yuanet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib39); Haramatiet al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib15); Wenet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib38); Parket al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib161)\)\.

#### HL Representation: Relational State and Action Abstractions

We represent our HL problems𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}via state and temporal abstractions of the world in a relational form, inspired by the STRIPS\(Fikes and Nilsson,[1971](https://arxiv.org/html/2605.15975#bib.bib17)\)and PDDL languages\(McDermottet al\.,[1998](https://arxiv.org/html/2605.15975#bib.bib19); Ghallabet al\.,[2004](https://arxiv.org/html/2605.15975#bib.bib69); Geffner and Bonet,[2013](https://arxiv.org/html/2605.15975#bib.bib186); Haslumet al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib18)\)from symbolic planning\. Specifically, LL states can be abstracted into HL states represented as symbolic databases and LL actions can be temporally abstracted into HL actions via an action schema language\. We define a planning*domain*as a tuple𝒟=⟨𝒫,𝒜⟩\\mathcal\{D\}=\\langle\\mathcal\{P\},\\mathcal\{A\}\\rangleconsisting of a set of predicates𝒫\\mathcal\{P\}and action schemata𝒜\\mathcal\{A\}\.

A*predicate*is a symbolppwith a set of argument terms, denotedp\(x1,…,xnp\)p\(x\_\{1\},\\ldots,x\_\{n\_\{p\}\}\)wherenp∈ℕ0n\_\{p\}\\in\\mathbb\{N\}\_\{0\}depends onppand denotes the arity ofpp\. A fact is a predicate whose argument terms are all instantiated with objects denotedp\(o1,…,onp\)p\(o\_\{1\},\\ldots,o\_\{n\_\{p\}\}\), and an HL state is a set of facts\. We concern ourselves with*open\-world planning*where facts that are not present in the state are assumed unknown\(Green,[1969](https://arxiv.org/html/2605.15975#bib.bib187)\), in contrast to the more restrictive closed\-world planning exhibited by PDDL where facts not present in the state are assumed false\(Fikes and Nilsson,[1971](https://arxiv.org/html/2605.15975#bib.bib17)\)\. Our labelling functionℒ\\mathcal\{L\}maps from LL to HL states with predicates from𝒟\\mathcal\{D\}\. We assume that all LL goal states can be abstracted into a single HL*goal condition*ghl\\mathit\{g\}^\{\\mathrm\{hl\}\}defined as a set of facts\. An HL stateshl\\mathit\{s\}^\{\\mathrm\{hl\}\}is a goal state ifghl⊆shl\\mathit\{g\}^\{\\mathrm\{hl\}\}\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}\.

An*action schema*is a temporally\-abstracted, nondeterministic action that can be parameterised by objects, given by a tuple⟨𝑣𝑎𝑟\(a\),𝑝𝑟𝑒\(a\),\{𝑎𝑑𝑑i\(a\),𝑑𝑒𝑙i\(a\)\}i=1n⟩\\langle\\mathit\{var\}\(a\),\\mathit\{pre\}\(a\),\\\{\\mathit\{add\}\_\{i\}\(a\),\\mathit\{del\}\_\{i\}\(a\)\\\}\_\{i=1\}^\{n\}\\ranglewhere𝑣𝑎𝑟\(a\)\\mathit\{var\}\(a\)is a set of parameter variables, and the preconditions𝑝𝑟𝑒\(a\)\\mathit\{pre\}\(a\)and nondeterministic add𝑎𝑑𝑑i\(a\)\\mathit\{add\}\_\{i\}\(a\)and delete𝑑𝑒𝑙i\(a\)\\mathit\{del\}\_\{i\}\(a\)effects are sets of predicates with arguments instantiated with variables from𝑣𝑎𝑟\(a\)\\mathit\{var\}\(a\)\. An HL actionahl\\mathit\{a\}^\{\\mathrm\{hl\}\}is an action schemaaawhere each variable is instantiated with an object, denotedahl=a\(o1,…,ona\)\\mathit\{a\}^\{\\mathrm\{hl\}\}=a\(o\_\{1\},\\ldots,o\_\{n\_\{a\}\}\), and is applicable in an HL stateshl\\mathit\{s\}^\{\\mathrm\{hl\}\}if𝑝𝑟𝑒\(ahl\)⊆shl\\mathit\{pre\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}in which case we define its successor states

𝑠𝑢𝑐𝑐\(shl,ahl\)=\{\(shl∖𝑑𝑒𝑙i\(ahl\)\)∪𝑎𝑑𝑑i\(ahl\)\}i=1n\.\\displaystyle\\mathit\{succ\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\)=\\\{\(\\mathit\{s\}^\{\\mathrm\{hl\}\}\\setminus\\mathit\{del\}\_\{i\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)\)\\cup\\mathit\{add\}\_\{i\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)\\\}\_\{i=1\}^\{n\}\.\(2\)
###### Example 1\(Pick and Place Domain\)\.

We can model a simple pick and place domain𝒟=⟨𝒫,𝒜⟩\\mathcal\{D\}=\\left<\\mathcal\{P\},\\mathcal\{A\}\\right\>where a robot can transport items between locations with𝒫=\{𝑟𝐴𝑡\(x\),𝑎𝑡\(x,y\),𝑓𝑟𝑒𝑒\(\),ℎ𝑜𝑙𝑑\(x\)\}\\mathcal\{P\}=\\\{\\mathit\{rAt\}\(\\mathit\{x\}\),\\mathit\{at\}\(\\mathit\{x\},\\mathit\{y\}\),\\allowbreak\\mathit\{free\}\(\),\\allowbreak\\mathit\{hold\}\(\\mathit\{x\}\)\\\}and𝒜\\mathcal\{A\}containing the three deterministic action schemata:

𝑝𝑖𝑐𝑘\(o,l\):𝑝𝑟𝑒:\{𝑟𝐴𝑡\(l\),𝑎𝑡\(o,l\),𝑓𝑟𝑒𝑒\(\)\}𝑎𝑑𝑑:\{ℎ𝑜𝑙𝑑\(o\)\}𝑑𝑒𝑙:\{𝑎𝑡\(o,l\),𝑓𝑟𝑒𝑒\(\)\}\\begin\{aligned\} \\mathit\{pick\(o,l\)\}:\\mathit\{pre\}&:\\\{\\mathit\{rAt\}\(\\mathit\{l\}\),\\mathit\{at\}\(\\mathit\{o\},\\mathit\{l\}\),\\mathit\{free\}\(\)\\\}\\\\ \\mathit\{add\}&:\\\{\\mathit\{hold\}\(\\mathit\{o\}\)\\\}\\\\ \\mathit\{del\}&:\\\{\\mathit\{at\}\(\\mathit\{o\},\\mathit\{l\}\),\\mathit\{free\}\(\)\\\}\\end\{aligned\}𝑚𝑜𝑣𝑒\(l1,l2\):𝑝𝑟𝑒:\{𝑟𝐴𝑡\(l1\)\}𝑎𝑑𝑑:\{𝑟𝐴𝑡\(l2\)\}𝑑𝑒𝑙:\{𝑟𝐴𝑡\(l1\)\}\\begin\{aligned\} \\mathit\{move\(l1,l2\)\}:\\mathit\{pre\}&:\\\{\\mathit\{rAt\}\(\\mathit\{l1\}\)\\\}\\\\ \\mathit\{add\}&:\\\{\\mathit\{rAt\}\(\\mathit\{l2\}\)\\\}\\\\ \\mathit\{del\}&:\\\{\\mathit\{rAt\}\(\\mathit\{l1\}\)\\\}\\end\{aligned\}𝑝𝑙𝑎𝑐𝑒\(o,l\):𝑝𝑟𝑒:\{𝑟𝐴𝑡\(l\),ℎ𝑜𝑙𝑑\(o\)\}𝑎𝑑𝑑:\{𝑎𝑡\(o,l\),𝑓𝑟𝑒𝑒\(\)\}𝑑𝑒𝑙:\{ℎ𝑜𝑙𝑑\(o\)\}\\begin\{aligned\} \\mathit\{place\(o,l\)\}:\\mathit\{pre\}&:\\\{\\mathit\{rAt\}\(\\mathit\{l\}\),\\mathit\{hold\}\(\\mathit\{o\}\)\\\}\\\\ \\mathit\{add\}&:\\\{\\mathit\{at\}\(\\mathit\{o\},\\mathit\{l\}\),\\mathit\{free\}\(\)\\\}\\\\ \\mathit\{del\}&:\\\{\\mathit\{hold\}\(\\mathit\{o\}\)\\\}\\end\{aligned\}

#### Where do𝒟\\mathcal\{D\}andℒ\\mathcal\{L\}come from?

There is significant work in generating HL state and temporal abstractions in this form from raw sensory data \(e\.g\.,\(Konidariset al\.,[2014](https://arxiv.org/html/2605.15975#bib.bib25),[2018](https://arxiv.org/html/2605.15975#bib.bib24); Asai and Fukunaga,[2018](https://arxiv.org/html/2605.15975#bib.bib41); Asaiet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib42); Chitniset al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib66); Silveret al\.,[2021b](https://arxiv.org/html/2605.15975#bib.bib28); Jameset al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib20); Silveret al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib11),[2023](https://arxiv.org/html/2605.15975#bib.bib9); Shahet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib27); Xiet al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib43); Huanget al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib153); Liet al\.,[2025b](https://arxiv.org/html/2605.15975#bib.bib32)\)\), whose progress has been propelled\(Liuet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib75); Hanet al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib26); Athalyeet al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib30); Ahmedet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib175); Lianget al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib10); Liet al\.,[2025b](https://arxiv.org/html/2605.15975#bib.bib32); Yanget al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib46); Shenet al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib40); Lianget al\.,[2026](https://arxiv.org/html/2605.15975#bib.bib31)\)by vision language model advances\(Radfordet al\.,[2021](https://arxiv.org/html/2605.15975#bib.bib45); Team,[2023](https://arxiv.org/html/2605.15975#bib.bib44)\)\. The result is an HL domain description𝒟\\mathcal\{D\}along with a labelling functionℒ\\mathcal\{L\}that maps LL states to HL states\. Specifically,𝒟\\mathcal\{D\}may be manually generated by a human and that𝒟\\mathcal\{D\}andℒ\\mathcal\{L\}can be generated via a VLA or learned\.

#### Problem Statement Summary

We can now define the problem: given an HL domain𝒟\\mathcal\{D\}that describes the abstract actions available to the agent, a labelling functionℒ\\mathcal\{L\}that maps LL states to HL states, and a finite dataset𝕋=\{⟨gihl,𝐬0ll,𝐚0ll,\.\.\.,𝐬mill,𝐚mill⟩\}i=1n\\mathbb\{T\}=\\\{\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\},\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{0\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{m\_\{i\}\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\_\{m\_\{i\}\}\\rangle\\\}\_\{i=1\}^\{n\}consisting of LL state\-action demonstrations along with HL goals, our objective is to learn a goal\-conditioned policyπ\(𝐚ll∣𝐬ll,ghl\)\\pi\(\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\mid\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)conditioned on LL states and HL goals that chooses low\-level actions to achieve the goal\.

## 4Learning Bilevel Policies for Bilevel Planning

We learn bilevel policies from the given HL\-goal labelled LL demonstrations𝕋\\mathbb\{T\}, using the HL domain𝒟\\mathcal\{D\}and labelling functionℒ\\mathcal\{L\}\. A*bilevel policy*\(BP\) consists of a pair of policies\(πhl,πll\)\(\\pi^\{\\mathrm\{hl\}\},\\pi^\{\\mathrm\{ll\}\}\)of the formsπhl\(ahl∣shl,ghl\)\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\\mid\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)andπll\(𝐚ll∣𝐬ll,ahl,ghl\)\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\mid\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\), respectively\. Both policies are goal\-conditioned and can be reused across different problems\. Together with the labelling functionℒ\\mathcal\{L\}, the pair of policies can be composed into a single policy operating on the LL environment conditioned on HL goals:

π\(𝐚ll∣𝐬ll,ghl\)=∑ahlπll\(𝐚ll∣𝐬ll,ahl,ghl\)⋅πhl\(ahl∣ℒ\(𝐬ll\),ghl\)\.\\displaystyle\\textstyle\\pi\(\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\mid\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)=\\sum\_\{\\mathit\{a\}^\{\\mathrm\{hl\}\}\}\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\mid\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)\\cdot\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\\mid\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\),\\mathit\{g\}^\{\\mathrm\{hl\}\}\)\.\(3\)In the BP framework, the HL and LL policies are learned separately, with the main idea being that HL policiesπhl\\pi^\{\\mathrm\{hl\}\}capture HL decision making, while LL policiesπll\\pi^\{\\mathrm\{ll\}\}provide LL realisations of the HL actions prescribed byπhl\\pi^\{\\mathrm\{hl\}\}\. Whileπll\\pi^\{\\mathrm\{ll\}\}is learned directly from𝕋\\mathbb\{T\}via imitation learning, the HL policy is learned by first constructing an abstract datasetThlT^\{\\mathrm\{hl\}\}by applyingℒ\\mathcal\{L\}to𝕋\\mathbb\{T\}, and then inductively learning a symbolic policyπhl\\pi^\{\\mathrm\{hl\}\}fromThlT^\{\\mathrm\{hl\}\}\. We realise the HL policyπhl\\pi^\{\\mathrm\{hl\}\}as a set of first\-order, condition\-action rules \([Section˜4\.1](https://arxiv.org/html/2605.15975#S4.SS1)\) and the LL policyπll\\pi^\{\\mathrm\{ll\}\}as a graph neural network \([Section˜4\.2](https://arxiv.org/html/2605.15975#S4.SS2)\)\.

Step 1Step 2Step 3![Refer to caption](https://arxiv.org/html/2605.15975v1/figures/mujoco.png)![Refer to caption](https://arxiv.org/html/2605.15975v1/figures/mujoco.png)![Refer to caption](https://arxiv.org/html/2605.15975v1/figures/mujoco.png)𝕋\\mathbb\{T\}Thl=\{⟨gihl,a0hl,\.\.\.,amihl⟩\}i=1nT^\{\\mathrm\{hl\}\}=\\\{\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{m\_\{i\}\}\\rangle\\\}\_\{i=1\}^\{n\}Thl=\{⟨𝑎𝑡\(obj1,loc2\)¯,…,𝑚𝑜𝑣𝑒\(loc1,loc2\),𝑝𝑙𝑎𝑐𝑒\(obj1,loc2\)⟩\}\\begin\{aligned\} T^\{\\mathrm\{hl\}\}=\\\{\\langle&\\underline\{\\mathit\{at\}\(\\mathit\{obj1\},\\mathit\{loc2\}\)\},\\ldots,\\\\ &\\mathit\{move\}\(\\mathit\{loc1\},\\mathit\{loc2\}\),\\mathit\{place\}\(\\mathit\{obj1\},\\mathit\{loc2\}\)\\rangle\\\}\\end\{aligned\}\.\.\.,𝑟𝐴𝑡\(loc2\),𝑎𝑡\(obj1,loc2\)¯↦𝑝𝑙𝑎𝑐𝑒\(obj1,loc2\)\.\.\.,𝑟𝐴𝑡\(loc1\),𝑎𝑡\(obj1,loc2\)¯↦𝑚𝑜𝑣𝑒\(loc1,loc2\)…\\begin\{aligned\} \\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{rAt\}\(\\mathit\{loc2\}\),\\underline\{\\mathit\{at\}\(\\mathit\{obj1\},\\mathit\{loc2\}\)\}&\\mapsto\\mathit\{place\}\(\\mathit\{obj1\},\\mathit\{loc2\}\)\\\\\[\-2\.0pt\] \\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{rAt\}\(\\mathit\{loc1\}\),\\underline\{\\mathit\{at\}\(\\mathit\{obj1\},\\mathit\{loc2\}\)\}&\\mapsto\\mathit\{move\}\(\\mathit\{loc1\},\\mathit\{loc2\}\)\\\\\[\-2\.0pt\] &\\ldots\\end\{aligned\}\.\.\.,𝑟𝐴𝑡\(loc2\),𝑎𝑡\(obj1,loc2\)¯↦𝑝𝑙𝑎𝑐𝑒\(obj1,loc2\)\.\.\.,𝑟𝐴𝑡\(loc1\),𝑎𝑡\(obj1,loc2\)¯↦𝑚𝑜𝑣𝑒\(loc1,loc2\)…\\begin\{aligned\} \\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{rAt\}\(\\mathit\{loc2\}\),\\underline\{\\mathit\{at\}\(\\mathit\{obj1\},\\mathit\{loc2\}\)\}&\\mapsto\\mathit\{place\}\(\\mathit\{obj1\},\\mathit\{loc2\}\)\\\\\[\-2\.0pt\] \\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{rAt\}\(\\mathit\{loc1\}\),\\underline\{\\mathit\{at\}\(\\mathit\{obj1\},\\mathit\{loc2\}\)\}&\\mapsto\\mathit\{move\}\(\\mathit\{loc1\},\\mathit\{loc2\}\)\\\\\[\-2\.0pt\] &\\ldots\\end\{aligned\}1:∃x,l\.\.\.\.,𝑟𝐴𝑡\(l\),𝑎𝑡\(x,l\)¯↦𝑝𝑙𝑎𝑐𝑒\(x,l\)2:∃x,l,l1\.\.\.\.,𝑟𝐴𝑡\(l1\),𝑎𝑡\(x,l\)¯↦𝑚𝑜𝑣𝑒\(l1,l\)…\\begin\{aligned\} 1:\\exists x,l\.\\;\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{rAt\}\(l\),\\underline\{\\mathit\{at\}\(x,l\)\}&\\mapsto\\mathit\{place\}\(x,l\)\\\\\[\-2\.0pt\] 2:\\exists x,l,l\_\{1\}\.\\;\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{rAt\}\(l\_\{1\}\),\\underline\{\\mathit\{at\}\(x,l\)\}&\\mapsto\\mathit\{move\}\(l\_\{1\},l\)\\\\\[\-2\.0pt\] &\\ldots\\end\{aligned\}LL Demonstrations with HL GoalsHL Traces with HL GoalsHL Traces with HL GoalsGrounded Condition\-Action RulesGrounded Condition\-Action RulesFirst\-Order, Condition\-Action Rulesℒ\\mathcal\{L\}Figure 2:The HL policyπhl\(ahl∣shl,ghl\)\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\\mid\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)learning process\.Step 1: we use the labelling functionℒ\\mathcal\{L\}to construct HL traces from the LL demonstrations paired with HL goals\.Step 2: we utilise goal regression to extract condition\-action rules from the HL traces and goals \(underlined\)\.Step 3: we inductively generalise the rules by replacing objects with variables to produce symbolic policies\.### 4\.1Learning HL Policies via Goal Regression and Inductive Generalisation

We construct HL policiesπhl\(ahl∣shl,ghl\)\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\\mid\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)from LL demonstrations paired with HL goals in 3 steps illustrated in[Figure˜2](https://arxiv.org/html/2605.15975#S4.F2): \(1\) we first generate HL traces that are abstractions of our LL demonstrations; \(2\) we then utilise goal regression to extract a set of condition\-action rules from such HL traces; and \(3\) we finally utilise inductive generalisation to produce interpretable symbolic policies in the form of first\-order, condition\-action rules\. Steps 2 and 3 are achieved in milliseconds\. In what follows, we first define the form of our HL policies and a means of grounding them for execution with respect to a particular problem instance\. We then elaborate on the 3 steps required to generate HL policies\.

#### HL policy representation

Our HL policiesπhl\\pi^\{\\mathrm\{hl\}\}consist of sets of first\-order, condition\-action rulesrrwith associated priority values related to goal proximity\. These first\-order, condition\-action rules are tuples of the form⟨𝑣𝑎𝑙\(r\),𝑣𝑎𝑟\(r\),𝑠𝐶𝑜𝑛𝑑\(r\),𝑔𝐶𝑜𝑛𝑑\(r\),𝑎𝑐𝑡𝑖𝑜𝑛\(r\)⟩\\langle\\mathit\{val\}\(r\),\\allowbreak\\mathit\{var\}\(r\),\\allowbreak\\mathit\{sCond\}\(r\),\\allowbreak\\mathit\{gCond\}\(r\),\\allowbreak\\mathit\{action\}\(r\)\\ranglewhere𝑣𝑎𝑙\(r\)\\mathit\{val\}\(r\)denotes the rule priority,𝑣𝑎𝑟\(r\)\\mathit\{var\}\(r\)denotes a set of variables,𝑠𝐶𝑜𝑛𝑑\(r\)\\mathit\{sCond\}\(r\)and𝑔𝐶𝑜𝑛𝑑\(r\)\\mathit\{gCond\}\(r\)are first\-order conjunctive formulas over predicates in𝒫\\mathcal\{P\}representing the state and goal conditions, and𝑎𝑐𝑡𝑖𝑜𝑛\(r\)\\mathit\{action\}\(r\)is an action schema denoting the action to take if the conditions hold\.[Example˜2](https://arxiv.org/html/2605.15975#Thmexample2)illustrates such an HL policy\.

###### Example 2\.

A first\-order, condition\-action policy for the pick and place domain in[Example˜1](https://arxiv.org/html/2605.15975#Thmexample1), where goal conditions𝑔𝐶𝑜𝑛𝑑\\mathit\{gCond\}are indicated by underlining:

1:⏞𝑣𝑎𝑙\(r\)∃x,l⏞𝑣𝑎𝑟\(r\)\.ℎ𝑜𝑙𝑑\(x\),𝑟𝐴𝑡\(l\)⏞𝑠𝐶𝑜𝑛𝑑\(r\),𝑎𝑡\(x,l\)¯⏞𝑔𝐶𝑜𝑛𝑑\(r\)\\displaystyle\\overbrace\{\\\!1\\\!:\\\!\}^\{\\mathit\{val\}\(r\)\}\\overbrace\{\\exists x,l\}^\{\\mathit\{var\}\(r\)\}\.\\\!\\;\\overbrace\{\\mathit\{hold\}\(x\),\\mathit\{rAt\}\(l\)\}^\{\\mathit\{sCond\}\(r\)\},\\overbrace\{\\underline\{\\mathit\{at\}\(x,l\)\}\}^\{\\mathit\{gCond\}\(r\)\}↦𝑝𝑙𝑎𝑐𝑒\(x,l\)⏞𝑎𝑐𝑡𝑖𝑜𝑛\(r\)\\displaystyle\\\!\\mapsto\\\!\\overbrace\{\\mathit\{place\}\(x,l\)\}^\{\\mathit\{action\}\(r\)\}3:∃x,l,l1\.𝑎𝑡\(x,l1\),𝑓𝑟𝑒𝑒\(\),𝑟𝐴𝑡\(l1\),𝑎𝑡\(x,l\)¯\\displaystyle 3\\\!:\\\!\\exists x,l,l\_\{1\}\.\\\!\\;\\mathit\{at\}\(x,l\_\{1\}\),\\mathit\{free\}\(\),\\mathit\{rAt\}\(l\_\{1\}\),\\underline\{\\mathit\{at\}\(x,l\)\}↦𝑝𝑖𝑐𝑘\(x,l1\)\\displaystyle\\\!\\mapsto\\\!\\mathit\{pick\}\(x,l\_\{1\}\)2:∃x,l,l1\.ℎ𝑜𝑙𝑑\(x\),𝑟𝐴𝑡\(l1\),𝑎𝑡\(x,l\)¯\\displaystyle 2\\\!:\\\!\\exists x,l,l\_\{1\}\.\\\!\\;\\mathit\{hold\}\(x\),\\mathit\{rAt\}\(l\_\{1\}\),\\underline\{\\mathit\{at\}\(x,l\)\}↦𝑚𝑜𝑣𝑒\(l1,l\)\\displaystyle\\\!\\mapsto\\\!\\mathit\{move\}\(l\_\{1\},l\)4:∃x,l,l1,l2\.𝑎𝑡\(x,l1\),𝑓𝑟𝑒𝑒\(\),𝑟𝐴𝑡\(l2\),𝑎𝑡\(x,l\)¯\\displaystyle 4\\\!:\\\!\\exists x,l,l\_\{1\},l\_\{2\}\.\\\!\\;\\mathit\{at\}\(x,l\_\{1\}\),\\mathit\{free\}\(\),\\mathit\{rAt\}\(l\_\{2\}\),\\underline\{\\mathit\{at\}\(x,l\)\}↦𝑚𝑜𝑣𝑒\(l2,l1\)\\displaystyle\\\!\\mapsto\\\!\\mathit\{move\}\(l\_\{2\},l\_\{1\}\)

In order to select the next action according to an HL policyπhl\\pi^\{\\mathrm\{hl\}\}, variables within the rules must be replaced with appropriate objects\. Given a rulerrofπhl\\pi^\{\\mathrm\{hl\}\}, a*grounding*ofrris an assignment of the variables𝑣𝑎𝑟\(r\)\\mathit\{var\}\(r\)in𝑠𝐶𝑜𝑛𝑑\(r\)\\mathit\{sCond\}\(r\),𝑔𝐶𝑜𝑛𝑑\(r\)\\mathit\{gCond\}\(r\), and𝑎𝑐𝑡𝑖𝑜𝑛\(r\)\\mathit\{action\}\(r\)with objects from𝐎\\mathbf\{O\}\. A ground ruler∈𝑔𝑟𝑜𝑢𝑛𝑑\(πhl\)r\\in\\mathit\{ground\}\(\\pi^\{\\mathrm\{hl\}\}\)is*applicable*in an HL stateshl\\mathit\{s\}^\{\\mathrm\{hl\}\}with goalghl\\mathit\{g\}^\{\\mathrm\{hl\}\}if𝑠𝐶𝑜𝑛𝑑\(r\)⊆shl\\mathit\{sCond\}\(r\)\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}\(i\.e\., the state condition is satisfied by the state\) and𝑔𝐶𝑜𝑛𝑑\(r\)⊆ghl∖shl\\mathit\{gCond\}\(r\)\\subseteq\\mathit\{g\}^\{\\mathrm\{hl\}\}\\setminus\\mathit\{s\}^\{\\mathrm\{hl\}\}\(i\.e\., the goal condition is satisfied by the set of unachieved goal facts\)\. A policyπhl\\pi^\{\\mathrm\{hl\}\}is executed by finding, via database query algorithms, a ground rulerrwith lowest𝑣𝑎𝑙\(r\)\\mathit\{val\}\(r\)value and ties broken arbitrarily that is applicable inshl\\mathit\{s\}^\{\\mathrm\{hl\}\}andghl\\mathit\{g\}^\{\\mathrm\{hl\}\}\. Implicitly, this defines a probability distributionπhl\(ahl∣shl,ghl\)\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\\mid\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)over the grounded HL actions, whereπhl\(ahl∣shl,ghl\)=1\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\\mid\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)=1ifahl\\mathit\{a\}^\{\\mathrm\{hl\}\}is selected byπhl\\pi^\{\\mathrm\{hl\}\}in stateshl\\mathit\{s\}^\{\\mathrm\{hl\}\}given goalghl\\mathit\{g\}^\{\\mathrm\{hl\}\}, and0otherwise\.

#### Step 1: construct HL traces

From an LL trace⟨gihl,𝐬0ll,𝐚0ll,\.\.\.,𝐬mll,𝐚mll⟩∈𝕋\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\},\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{0\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{m\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\_\{m\}\\rangle\\in\\mathbb\{T\}, the domain𝒟\\mathcal\{D\}, and the labelling functionℒ\\mathcal\{L\}, we can extract an HL trace⟨gihl,a0hl,\.\.\.,anhl⟩\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{n\}\\rangleas follows: starting with the HL abstractions0hl=ℒ\(𝐬0ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{0\}=\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{0\}\)of the initial LL state𝐬0ll\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{0\}, we iteratively determine the LL state where the HL abstraction changes, i\.e\., whereℒ\(𝐬jll\)≠ℒ\(𝐬j−1ll\)\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{j\}\)\\neq\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{j\-1\}\), which defines the next HL statesihl=ℒ\(𝐬jll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{i\}=\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{j\}\)\. We then find an applicable HL actionaihl\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{i\}which causes the change fromsi−1hl\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{i\-1\}tosihl\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{i\}, i\.e\., anaihl\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{i\}such that𝑝𝑟𝑒\(aihl\)⊆si−1hl\\mathit\{pre\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{i\}\)\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{i\-1\}and for some effectjj, we havesihl=\(si−1hl∪𝑎𝑑𝑑j\(aihl\)\)∖𝑑𝑒𝑙j\(aihl\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{i\}=\(\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{i\-1\}\\cup\\mathit\{add\}\_\{j\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{i\}\)\)\\setminus\\mathit\{del\}\_\{j\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{i\}\)\. By applying this extraction for each LL trace from𝕋\\mathbb\{T\}, we obtain a HL datasetThl=\{⟨gihl,a0hl,\.\.\.,amihl⟩\}i=1nT^\{\\mathrm\{hl\}\}=\\\{\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{m\_\{i\}\}\\rangle\\\}\_\{i=1\}^\{n\}\.

#### Step 2: extract rules

Our method converts HL action traces into ground condition\-action rules via*goal regression*\(Fikeset al\.,[1972](https://arxiv.org/html/2605.15975#bib.bib2); Waldinger,[1977](https://arxiv.org/html/2605.15975#bib.bib3); Lozano\-Perezet al\.,[1984](https://arxiv.org/html/2605.15975#bib.bib4); Reiter,[1991](https://arxiv.org/html/2605.15975#bib.bib5),[2001](https://arxiv.org/html/2605.15975#bib.bib6); Fritz and McIlraith,[2007](https://arxiv.org/html/2605.15975#bib.bib104)\)\. Goal regression computes the minimal condition under which an action achieves a goal; when applied recursively, it determines under which condition an action trace will lead to a goal state\. Formally, a set of factsghl\\mathit\{g\}^\{\\mathrm\{hl\}\}is regressable over an HL actionahl\\mathit\{a\}^\{\\mathrm\{hl\}\}if for every nondeterministic outcomeiiofahl\\mathit\{a\}^\{\\mathrm\{hl\}\}we have𝑑𝑒𝑙i\(ahl\)∩ghl=∅\\mathit\{del\}\_\{i\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)\\cap\\mathit\{g\}^\{\\mathrm\{hl\}\}=\\emptyset, in which case we define

𝑟𝑒𝑔𝑟\(ghl,ahl\)=\{\(ghl∖𝑎𝑑𝑑i\(ahl\)\)∪𝑝𝑟𝑒\(ahl\)∣i=1,…,n\},\\displaystyle\\mathit\{regr\}\(\\mathit\{g\}^\{\\mathrm\{hl\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\)=\\\{\(\\mathit\{g\}^\{\\mathrm\{hl\}\}\\setminus\\mathit\{add\}\_\{i\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)\)\\cup\\mathit\{pre\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)\\mid i=1,\\ldots,n\\\},\(4\)and𝑟𝑒𝑔𝑟\(ghl,ahl\)=∅\\mathit\{regr\}\(\\mathit\{g\}^\{\\mathrm\{hl\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\)=\\emptysetotherwise\. Notably, goal regression is also a powerful tool for enabling open\-world and generalised planning\(Gretton and Thiébaux,[2004](https://arxiv.org/html/2605.15975#bib.bib62); Sanner and Boutilier,[2009](https://arxiv.org/html/2605.15975#bib.bib93); Liuet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib95)\)\.

#### Step 3: inductive generalisation

Finally, we*lift*the \(ground\) condition\-action rules to obtain first\-order rules by replacing ground objects with fresh variables\. The resulting rules are abstracted from specific problem instances and thus generalise to arbitrary numbers of objects\. Given an HL actionahl\\mathit\{a\}^\{\\mathrm\{hl\}\}and sets of factsshl\\mathit\{s\}^\{\\mathrm\{hl\}\}andghl\\mathit\{g\}^\{\\mathrm\{hl\}\}, leto1,…,oqo\_\{1\},\\ldots,o\_\{q\}be the union of all objects from the action and facts\. We then define the set of variables𝑣𝑎𝑟=\{v1,…,vq\}\\mathit\{var\}=\\left\\\{v\_\{1\},\\ldots,v\_\{q\}\\right\\\}and mappingg:𝐎→𝑣𝑎𝑟g:\\mathbf\{O\}\\to\\mathit\{var\}defined byg\(oi\)=vig\(o\_\{i\}\)=v\_\{i\}\. Then we define the lifting operator as

𝑙𝑖𝑓𝑡\(ahl,shl,ghl\)=⟨𝑣𝑎𝑟,𝑠𝐶𝑜𝑛𝑑,𝑔𝐶𝑜𝑛𝑑,𝑎𝑐𝑡𝑖𝑜𝑛⟩\\displaystyle\\mathit\{lift\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)=\\langle\\mathit\{var\},\\mathit\{sCond\},\\mathit\{gCond\},\\mathit\{action\}\\rangle\(5\)where𝑠𝐶𝑜𝑛𝑑\\mathit\{sCond\}and𝑔𝐶𝑜𝑛𝑑\\mathit\{gCond\}are the conjunctions of the facts inshl\\mathit\{s\}^\{\\mathrm\{hl\}\}andghl\\mathit\{g\}^\{\\mathrm\{hl\}\}respectively withggapplied to their objects, and𝑎𝑐𝑡𝑖𝑜𝑛\\mathit\{action\}is the action schema ofahl\\mathit\{a\}^\{\\mathrm\{hl\}\}withggapplied on its objects\.

Input:LL demonstrations with HL goals𝕋\\mathbb\{T\}\.

Output:HL policy

πhl\\pi^\{\\mathrm\{hl\}\}\.

//Step 1: Construct HL Traces

1

Thl←𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝐹𝑟𝑜𝑚\(𝕋\)T^\{\\mathrm\{hl\}\}\\leftarrow\\mathit\{constructFrom\}\(\\mathbb\{T\}\)
2

πhl←∅\\pi^\{\\mathrm\{hl\}\}\\leftarrow\\emptyset
3for*⟨ghl,a0hl,…,amhl⟩∈𝕋hl\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{0\},\\ldots,\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{m\}\\rangle\\in\\mathbb\{T\}^\{\\mathrm\{hl\}\}*

4

S←\{ghl\}S\\leftarrow\\\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\\\}
5

6for*j=m,…,0j=m,\\ldots,0*

7

Snext←∅S^\{\\textit\{next\}\}\\leftarrow\\emptyset
8for*shl∈S\\mathit\{s\}^\{\\mathrm\{hl\}\}\\in S*

//Step 2: Extract Rules

9

S′←𝑟𝑒𝑔𝑟\(shl,ajhl\)S^\{\\prime\}\\leftarrow\\mathit\{regr\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{j\}\)
10for*shl′∈S′\{\\mathit\{s\}^\{\\mathrm\{hl\}\}\}^\{\\prime\}\\in S^\{\\prime\}*

11

ghl′←ghl∩𝑠𝑢𝑐𝑐\(shl′,⟨ajhl′,…,amhl′⟩\)\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\}^\{\\prime\}\\leftarrow\\mathit\{g\}^\{\\mathrm\{hl\}\}\\cap\\mathit\{succ\}\(\{\\mathit\{s\}^\{\\mathrm\{hl\}\}\}^\{\\prime\},\\langle\{\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{j\}\}^\{\\prime\},\\ldots,\{\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{m\}\}^\{\\prime\}\\rangle\)
//Step 3: Inductive Generalisation

12

r←𝑙𝑖𝑓𝑡\(ajhl,shl′,ghl\)r\\leftarrow\\mathit\{lift\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{j\},\{\\mathit\{s\}^\{\\mathrm\{hl\}\}\}^\{\\prime\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
13

πhl←πhl∪\{⟨r,m−j⟩\}\\pi^\{\\mathrm\{hl\}\}\\leftarrow\\pi^\{\\mathrm\{hl\}\}\\cup\\\{\\langle r,m\-j\\rangle\\\}
14

15

Snext←Snext∪S′S^\{\\textit\{next\}\}\\leftarrow S^\{\\textit\{next\}\}\\cup S^\{\\prime\}
16

17

S←SnextS\\leftarrow S^\{\\textit\{next\}\}
18

19return

πhl\\pi^\{\\mathrm\{hl\}\}

Algorithm 1HL policy learning
#### Putting the steps together

[Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)describes the procedure of learning an HL policyπhl\\pi^\{\\mathrm\{hl\}\}from LL demonstrations𝕋\\mathbb\{T\}via goal regression\. It begins by initialising an empty policy \([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\) and then iterates over the HL trajectories inThlT^\{\\mathrm\{hl\}\}\([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\) constructed from the LL demonstrations \([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\)\. For each trajectory, it initialisesSSto be the singleton set containing the goal to be regressed \([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\)\. Then it iterates over the HL actions in reverse order \([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\), and at each step regresses the current goals inSS\([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\) over the current HL actionajhl\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{j\}to obtain the next set of subgoalsS′S^\{\\prime\}\([Algorithms˜1](https://arxiv.org/html/2605.15975#algorithm1),[1](https://arxiv.org/html/2605.15975#algorithm1),[1](https://arxiv.org/html/2605.15975#algorithm1)and[1](https://arxiv.org/html/2605.15975#algorithm1)\)\. The actions and regressed goals in each stepjjare lifted into first\-order rulesrr\([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\) alongsideghl′\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\}^\{\\prime\}, the subset ofghl\\mathit\{g\}^\{\\mathrm\{hl\}\}that is achieved by the suffix of determinised actions from the trajectory \([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\)\. The rules are added to the policyπhl\\pi^\{\\mathrm\{hl\}\}with prioritym−jm\-j\(related to goal proximity\) \([Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\) and the resulting policyπhl\\pi^\{\\mathrm\{hl\}\}is returned in[Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\.

#### Generalisation to arbitrarily many objects

The following theorem establishes the conditions under which[Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)returns policies that can generalise to problems with arbitrarily many objects and hence arbitrarily long horizons\. The main conditions are that the problems \(a\) exhibit goals whose components can be achieved independently in any order, as is the case for a large proportion of real\-world problems\(Simon,[1956](https://arxiv.org/html/2605.15975#bib.bib185); Korf,[1987](https://arxiv.org/html/2605.15975#bib.bib184)\), and \(b\) the goal components can be achieved with bounded size HL policies\. These conditions are formally defined in[Appendix˜B](https://arxiv.org/html/2605.15975#A2)alongside the proof of the theorem\. Under these conditions, the proof follows by representing an infinite setSSof arbitrarily large HL problems as a finite collection of subproblems equivalent up to object renaming\. Each subproblem class exhibits a finite policy and we can compose all such policies into a single, finite policy that can solve every problem inSS\.

###### Theorem 1\.

Let𝒟=⟨𝒫,𝒜⟩\\mathcal\{D\}=\\langle\\mathcal\{P\},\\mathcal\{A\}\\ranglebe an HL domain,ℒ\\mathcal\{L\}a labelling function, andC∈ℕC\\in\\mathbb\{N\}\. There exists a finite dataset𝕋\\mathbb\{T\}such that the HL policy learned from𝕋\\mathbb\{T\}via[Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)solves any HL planning problem𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}conforming to𝒟\\mathcal\{D\}and satisfyingCC\-bounded goal independence\.

### 4\.2Learning LL Policies via Graph Neural Networks and Imitation Learning

We next realise our LL policiesπll\(𝐚ll∣𝐬ll,ahl,ghl\)\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\mid\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)with a graph neural network \(GNN\) architecture\.

#### LL policy representation

We describe how to construct a GNN model from𝐬ll\\mathbf\{s\}^\{\\mathrm\{ll\}\},ahl\\mathit\{a\}^\{\\mathrm\{hl\}\}andghl\\mathit\{g\}^\{\\mathrm\{hl\}\}, and a domain𝒟=⟨𝒫,𝒜⟩\\mathcal\{D\}=\\langle\\mathcal\{P\},\\mathcal\{A\}\\ranglethat generates an embedding for predicting the LL action𝐚ll\\mathbf\{a\}^\{\\mathrm\{ll\}\}\. Given vectors𝐡1∈ℝn,𝐡2∈ℝm\\mathbf\{h\}\_\{1\}\\in\\mathbb\{R\}^\{n\},\\mathbf\{h\}\_\{2\}\\in\\mathbb\{R\}^\{m\}, we denote their concatenation as𝐡1∥𝐡2∈ℝn\+m\\mathbf\{h\}\_\{1\}\\operatorname\*\{\{\\\|\}\}\\mathbf\{h\}\_\{2\}\\in\\mathbb\{R\}^\{n\+m\}\. Giveni,n∈Ni,n\\in N, we denote the one\-hot encoding𝐞in∈ℝn\\mathbf\{e\}\_\{i\}^\{n\}\\in\\mathbb\{R\}^\{n\}with all 0’s except 1 in theii\-th element\. We assume an enumeration of predicates and schemata, and letMMbe the maximum arity of the schemata \(i\.e\.,maxa∈𝒜⁡\|𝑣𝑎𝑟\(a\)\|\\max\_\{a\\in\\mathcal\{A\}\}\\left\|\\mathit\{var\}\(a\)\\right\|\)\.

\(a\)Initial Embeddings\(b\)Message Passing\(c\)Readout𝐡pick\(0\)\\mathbf\{h\}\_\{\\textit\{pick\}\}^\{\(0\)\}𝐡obj\(0\)\\mathbf\{h\}\_\{\\textit\{obj\}\}^\{\(0\)\}𝐡loc\(0\)\\mathbf\{h\}\_\{\\textit\{loc\}\}^\{\(0\)\}𝐡global\(0\)\\mathbf\{h\}\_\{\\textit\{global\}\}^\{\(0\)\}𝐡pick\\mathbf\{h\}^\{\\textit\{pick\}\}𝐡obj\\mathbf\{h\}^\{\\textit\{obj\}\}𝐡loc\\mathbf\{h\}^\{\\textit\{loc\}\}𝐡global\\mathbf\{h\}\_\{\\textit\{global\}\}𝐬ll\\mathbf\{s\}^\{\\mathrm\{ll\}\}ahl\\mathit\{a\}^\{\\mathrm\{hl\}\}ghl\\mathit\{g\}^\{\\mathrm\{hl\}\}…\\ldots𝐡pick\(l\)\\mathbf\{h\}\_\{\\textit\{pick\}\}^\{\(l\)\}𝐡obj\(l\)\\mathbf\{h\}\_\{\\textit\{obj\}\}^\{\(l\)\}𝐡loc\(l\)\\mathbf\{h\}\_\{\\textit\{loc\}\}^\{\(l\)\}𝐡global\(l\)\\mathbf\{h\}\_\{\\textit\{global\}\}^\{\(l\)\}𝐡pick\(l\+1\)\\mathbf\{h\}\_\{\\textit\{pick\}\}^\{\(l\+1\)\}𝐡obj\(l\+1\)\\mathbf\{h\}\_\{\\textit\{obj\}\}^\{\(l\+1\)\}𝐡loc\(l\+1\)\\mathbf\{h\}\_\{\\textit\{loc\}\}^\{\(l\+1\)\}𝐡global\(l\+1\)\\mathbf\{h\}\_\{\\textit\{global\}\}^\{\(l\+1\)\}…\\ldots𝐡pick\(L\)\\mathbf\{h\}\_\{\\textit\{pick\}\}^\{\(L\)\}𝐡obj\(L\)\\mathbf\{h\}\_\{\\textit\{obj\}\}^\{\(L\)\}𝐡loc\(L\)\\mathbf\{h\}\_\{\\textit\{loc\}\}^\{\(L\)\}𝐡global\(L\)\\mathbf\{h\}\_\{\\textit\{global\}\}^\{\(L\)\}𝐚ll\\mathbf\{a\}^\{\\mathrm\{ll\}\}Figure 3:The LL policyπll\(𝐚ll∣𝐬ll,ahl,ghl\)\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\mid\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)represented by a GNN\. In this example, the input action isahl=𝑝𝑖𝑐𝑘\(obj,loc\)\\mathit\{a\}^\{\\mathrm\{hl\}\}=\\mathit\{pick\}\(\\textit\{obj\},\\textit\{loc\}\), and the resulting output is𝐚ll\\mathbf\{a\}^\{\\mathrm\{ll\}\}\. Solid lines represent graph edges, and dashed lines represent how information is passed\.Bold fontindicates Euclidean vectors\.Letahl=a\(o1,…,on\)\\mathit\{a\}^\{\\mathrm\{hl\}\}=a\(o\_\{1\},\\ldots,o\_\{n\}\),𝐬ll=⟨𝐱e,\{𝐱o\}o∈𝐎⟩\\mathbf\{s\}^\{\\mathrm\{ll\}\}=\\langle\\mathbf\{x\}\_\{e\},\\left\\\{\\mathbf\{x\}\_\{o\}\\right\\\}\_\{o\\in\\mathbf\{O\}\}\\rangleandshl=ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}=\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)\. The GNN inputs are initially encoded as a set of embeddings𝐡𝑔𝑙𝑜𝑏𝑎𝑙,𝐡a,𝐡o1,…,𝐡on\\mathbf\{h\}\_\{\\mathit\{global\}\},\\mathbf\{h\}\_\{a\},\\mathbf\{h\}\_\{o\_\{1\}\},\\ldots,\\mathbf\{h\}\_\{o\_\{n\}\}representing a global node, an action node, and object nodes, respectively, where𝐡𝑔𝑙𝑜𝑏𝑎𝑙=𝐱e∥∑p\(\)∈shl𝐞p\|𝒫\|∥∑p\(\)∈ghl𝐞p\|𝒫\|\\mathbf\{h\}\_\{\\mathit\{global\}\}=\\mathbf\{x\}\_\{e\}\\operatorname\*\{\{\\\|\}\}\\sum\_\{p\(\)\\in\\mathit\{s\}^\{\\mathrm\{hl\}\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}\\operatorname\*\{\{\\\|\}\}\\sum\_\{p\(\)\\in\\mathit\{g\}^\{\\mathrm\{hl\}\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}encodes the agent’s state and nullary facts that are true in the HL state and goal,𝐡a=𝐞a\|𝒜\|\\mathbf\{h\}\_\{a\}=\\mathbf\{e\}\_\{a\}^\{\\left\|\\mathcal\{A\}\\right\|\}encodes the action schema, and𝐡oi=𝐱oi∥∑p\(oi\)∈shl𝐞p\|𝒫\|∥∑p\(oi\)∈ghl𝐞p\|𝒫\|∥𝐞iM\\mathbf\{h\}\_\{o\_\{i\}\}=\\mathbf\{x\}\_\{o\_\{i\}\}\\operatorname\*\{\{\\\|\}\}\\sum\_\{p\(o\_\{i\}\)\\in\\mathit\{s\}^\{\\mathrm\{hl\}\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}\\operatorname\*\{\{\\\|\}\}\\sum\_\{p\(o\_\{i\}\)\\in\\mathit\{g\}^\{\\mathrm\{hl\}\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}\\operatorname\*\{\{\\\|\}\}\\mathbf\{e\}\_\{i\}^\{M\}encodes an objectoio\_\{i\}’s state, unary facts instantiated withoio\_\{i\}in the HL state and goal, and its position in the actionahl\\mathit\{a\}^\{\\mathrm\{hl\}\}\.

The GNN algorithm then consists of the following three operations, as illustrated in[Figure˜3](https://arxiv.org/html/2605.15975#S4.F3)\. The𝐖\\mathbf\{W\}symbols denote learnable weights\.

1. \(a\)The initial encodings are embedded:𝐡𝑔𝑙𝑜𝑏𝑎𝑙\(0\)=𝐖g\(0\)𝐡𝑔𝑙𝑜𝑏𝑎𝑙\\mathbf\{h\}\_\{\\mathit\{global\}\}^\{\(0\)\}=\\mathbf\{W\}\_\{g\}^\{\(0\)\}\\mathbf\{h\}\_\{\\mathit\{global\}\},𝐡a\(0\)=𝐖a\(0\)𝐡a\\mathbf\{h\}\_\{a\}^\{\(0\)\}=\\mathbf\{W\}\_\{a\}^\{\(0\)\}\\mathbf\{h\}\_\{a\},𝐡oi\(0\)=𝐖o\(0\)𝐡oi\\mathbf\{h\}\_\{o\_\{i\}\}^\{\(0\)\}=\\mathbf\{W\}\_\{o\}^\{\(0\)\}\\mathbf\{h\}\_\{o\_\{i\}\}\.
2. \(b\)Next,LLrounds of message passing are performed\. Object embeddings are aggregated via element\-wise max𝐡o\(l\)=maxi=1,…,n⁡𝐡oi\(l\)\\mathbf\{h\}\_\{o\}^\{\(l\)\}=\\max\_\{i=1,\\ldots,n\}\\mathbf\{h\}\_\{o\_\{i\}\}^\{\(l\)\}and node embeddings are updated by𝐡𝑔𝑙𝑜𝑏𝑎𝑙\(l\+1\)=σ\(𝐖g\(l\)\(𝐡𝑔𝑙𝑜𝑏𝑎𝑙\(l\)\+𝐡a\(l\)\+𝐡o\(l\)\)\)\\mathbf\{h\}\_\{\\mathit\{global\}\}^\{\(l\+1\)\}=\\sigma\(\\mathbf\{W\}\_\{g\}^\{\(l\)\}\(\\mathbf\{h\}\_\{\\mathit\{global\}\}^\{\(l\)\}\+\\mathbf\{h\}\_\{a\}^\{\(l\)\}\+\\mathbf\{h\}\_\{o\}^\{\(l\)\}\)\),𝐡a\(l\+1\)=σ\(𝐖a\(l\)\(𝐡𝑔𝑙𝑜𝑏𝑎𝑙\(l\+1\)\+𝐡a\(l\)\+𝐡o\(l\)\)\)\\mathbf\{h\}\_\{a\}^\{\(l\+1\)\}=\\sigma\(\\mathbf\{W\}\_\{a\}^\{\(l\)\}\(\\mathbf\{h\}\_\{\\mathit\{global\}\}^\{\(l\+1\)\}\+\\mathbf\{h\}\_\{a\}^\{\(l\)\}\+\\mathbf\{h\}\_\{o\}^\{\(l\)\}\)\), and𝐡oi\(l\+1\)=σ\(𝐖o\(l\)\(𝐡𝑔𝑙𝑜𝑏𝑎𝑙\(l\+1\)\+𝐡a\(l\)\+𝐡oi\(l\)\)\)\\mathbf\{h\}\_\{o\_\{i\}\}^\{\(l\+1\)\}=\\sigma\(\\mathbf\{W\}\_\{o\}^\{\(l\)\}\(\\mathbf\{h\}\_\{\\mathit\{global\}\}^\{\(l\+1\)\}\+\\mathbf\{h\}\_\{a\}^\{\(l\)\}\+\\mathbf\{h\}\_\{o\_\{i\}\}^\{\(l\)\}\)\)for nonlinearityσ\\sigma\.
3. \(c\)Finally, a feed\-forward readout layer is then applied to the final embeddings𝐡𝑔𝑙𝑜𝑏𝑎𝑙\(L\)\+𝐡a\(L\)\+𝐡o\(L\)\\mathbf\{h\}\_\{\\mathit\{global\}\}^\{\(L\)\}\+\\mathbf\{h\}\_\{a\}^\{\(L\)\}\+\\mathbf\{h\}\_\{o\}^\{\(L\)\}to predict the LL action𝐚ll\\mathbf\{a\}^\{\\mathrm\{ll\}\}\.

#### Learning method

Following the HL learning method, we can compute a single dataset consisting of tuples⟨𝐬ll,ahl,ghl,𝐚ll⟩\\langle\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\ranglefrom the LL demonstrations corresponding to inputs and outputs ofπll\\pi^\{\\mathrm\{ll\}\}\. We optimise the GNN to minimise the mean squared error \(MSE\) between the predicted LL actions and the ground truth LL actions in the dataset𝕋\\mathbb\{T\}\. In place of MSE optimisation, we could alternatively learn a generative diffusion policy\(Hoet al\.,[2020](https://arxiv.org/html/2605.15975#bib.bib177); Chiet al\.,[2023](https://arxiv.org/html/2605.15975#bib.bib178)\)\. We leave exploration of this alternative to future work\.

## 5Experiments

We implement our approach from[Section˜4](https://arxiv.org/html/2605.15975#S4)in a system called BISON and conduct experiments on extended MetaWorld\(Yuet al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib34); McLeanet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib35)\)benchmarks\. We compare our approach to various VLA, end\-to\-end, and planning systems to demonstrate the efficacy of BISON for learning bilevel policies that can generalise to longer horizons and more objects than seen in its training demonstrations\.

#### Setup

We evaluate 7 imitation learning architectures and 2 VLA baselines across 8 MuJoCo robot simulation environments\. For each approach and environment, we train 3 models on 200 goal\-achieving trajectories from problems with 3 objects\. We train the GNN LL policy in[Section˜4\.2](https://arxiv.org/html/2605.15975#S4.SS2)with hyperparameters specified in[Section˜C\.1](https://arxiv.org/html/2605.15975#A3.SS1)\. Each trained policy is evaluated on problems with 1 to 10 objects from each environment with 10 episodes for each object and at most 2048nnsteps for each episode wherennis the number of objects\. As a result, we conduct up to9×8×3×10×10=21,6009\\times 8\\times 3\\times 10\\times 10=21,600episodes’ worth of evaluations\. Training is performed with an NVIDIA L40 GPU \(48GB memory\) while evaluation is performed on Intel Xeon Gold 6338 CPUs \(2\.00 GHz\)\.

Table 1:Environments and their uncertainty attributes\.UncertaintyBlocksS

BlocksN

ColourS

ColourN

FactoryS

FactoryN

GachaS

GachaN

Exogenous✓✓✓✓Endogenous✓✓✓✓State✓✓✓✓
#### Environments

We evaluate the aforementioned methods on 8 environments that compositionally extend the tasks from the MetaWorld benchmark suite\(Yuet al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib34); McLeanet al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib35)\)\. There are 4 stationary environments \(BlocksS, FactoryS, ColourS, GachaS\) each with a corresponding variant \(BlocksN, FactoryN, ColourN, GachaN\) where objects may randomly teleport around throughout the episode\. Additional environment semantics and visualisations are provided in[Section˜C\.3](https://arxiv.org/html/2605.15975#A3.SS3)\. Each environment induces a problem distribution with different initial states and goals, parameterised by the number of objects\.[Table˜1](https://arxiv.org/html/2605.15975#S5.T1)summarises the environments and their uncertainty attributes:exogenoustransition uncertainty that is not modelled in the HL abstraction,endogenoustransition uncertainty that is modelled in the HL abstraction, andstateuncertainty and partial observability of object attributes and locations\.

#### Methods

We evaluate and compare against the following methods to demonstrate the robustness of our approach to various forms of uncertainty\. We introduce a mix of baselines that correspond to end\-to\-end, planning, and replanning approaches, with more details in[Section˜C\.2](https://arxiv.org/html/2605.15975#A3.SS2)\.

- •Oracle: a hand\-coded policy that serves as the expert and upper bound for the experiments
- •SmolVLA: the SmolVLA open\-source 0\.45B parameter VLA model\(Shukoret al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib127)\)which was shown to outperform state\-of\-the\-art VLAs with billions of parameters\.
- •SmolVLAMW\{\}^\{\\textrm\{MW\}\}: SmolVLA fine\-tuned on 2,500 episodes of the original MetaWorld benchmark suite
- •PureNN: a GNN policy operating over LL observations with no access to HL facts and actions
- •PddlNN: a GNN policy operating over LL observations and HL facts but no access to HL actions
- •DetPlan: the GNN policy from[Section˜4\.2](https://arxiv.org/html/2605.15975#S4.SS2)with a planner\(Helmert,[2006](https://arxiv.org/html/2605.15975#bib.bib115)\)in place of an HL policy
- •NdtPlan: the GNN policy from[Section˜4\.2](https://arxiv.org/html/2605.15975#S4.SS2)with a nondeterministic planner\(Muiseet al\.,[2024](https://arxiv.org/html/2605.15975#bib.bib114)\)for generating a problem\-specific HL policy
- •DetReplan: same as DetPlan but replan \(recompute a new plan\) if no HL action is returned due to uncertainty\(Pettersson,[2005](https://arxiv.org/html/2605.15975#bib.bib106); Yoonet al\.,[2007](https://arxiv.org/html/2605.15975#bib.bib144); Littleet al\.,[2007](https://arxiv.org/html/2605.15975#bib.bib103); Fritz and McIlraith,[2007](https://arxiv.org/html/2605.15975#bib.bib104); Curtiset al\.,[2022a](https://arxiv.org/html/2605.15975#bib.bib102),[2024](https://arxiv.org/html/2605.15975#bib.bib143)\)
- •NdtReplan: same as NdtPlan but replan \(compute a new policy\) if no HL action is returned
- •BISON\(new\): the bilevel policy for bilevel planning approach using the problem\-agnostic HL policy from[Section˜4\.1](https://arxiv.org/html/2605.15975#S4.SS1)and the GNN LL policy from[Section˜4\.2](https://arxiv.org/html/2605.15975#S4.SS2)

#### \(Q1\) How do VLAs perform on the benchmarks?

From[Table˜2](https://arxiv.org/html/2605.15975#S5.T2)we observe that pretrained VLA models perform poorly on the benchmarks and fail to solve any task\. We observed that this is consistent with existing evaluations of VLAs for the original MetaWorld benchmarks\(Shukoret al\.,[2025](https://arxiv.org/html/2605.15975#bib.bib127)\), namely how SmolVLA and other VLA approaches achieve at most 45% success rate onhardMetaWorld problems\(Seoet al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib128)\), which includes then=1n=1case of Blocks\. Furthermore, we observed that SmolVLAMW\{\}^\{\\textrm\{MW\}\}can only solve Blocks forn=1n=1when the block and goal locations are in the same area as in the training set, but not outside\. However, we note that VLAs are operating on image and text inputs of the environment, while all other methods assumed processed representations of the inputs\.

#### \(Q2\) Does BISON generalise to longer horizons than seen in training?

From[Figure˜4](https://arxiv.org/html/2605.15975#S5.F4), we observe that BISON overall is able to generalise to environments beyond the number of objects that it is trained on, and often generalises better than the end\-to\-end GNN baselines PureNN and PddlNN whose performance often drops on environments beyond 6 objects\. This can be attributed to the expressivity limits of GNNs\(Morriset al\.,[2019](https://arxiv.org/html/2605.15975#bib.bib122); Xuet al\.,[2019b](https://arxiv.org/html/2605.15975#bib.bib123)\)that prohibit them from extrapolating to longer\-horizon HL planning problems as empirically and theoretically shown in previous works\(Ståhlberget al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib59); Silveret al\.,[2022](https://arxiv.org/html/2605.15975#bib.bib11),[2023](https://arxiv.org/html/2605.15975#bib.bib9)\)\. The best performing seeds often match the upper bound performance of the oracle\. From[Table˜2](https://arxiv.org/html/2605.15975#S5.T2), BISON matches or outperforms all other baselines with two exceptions\. Failures often occur in the covariate shift in the LL execution of HL actions, which suggests that further improvements can be gained from additional post\-training, e\.g\., via RL or DAgger\(Rosset al\.,[2011](https://arxiv.org/html/2605.15975#bib.bib124)\)\.

Table 2:Average success rate \(↑\\uparrow\) and standard deviation of various methods across number of objects and environments\. The top 2 scores per column are highlighted\.TypeMethodBlocksSBlocksNFactorySFactoryNColourSColourNGachaSGachaNTotalVLASmolVLA0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}SmolVLAMW\{\}^\{\\textrm\{MW\}\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}GNNPureNN0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}PddlNN0\.33±0\.40\.33\_\{\\pm 0\.4\}0\.20±0\.30\.20\_\{\\pm 0\.3\}0\.18±0\.30\.18\_\{\\pm 0\.3\}0\.07±0\.20\.07\_\{\\pm 0\.2\}0\.12±0\.20\.12\_\{\\pm 0\.2\}0\.36±0\.30\.36\_\{\\pm 0\.3\}0\.51±0\.30\.51\_\{\\pm 0\.3\}0\.60±0\.40\.60\_\{\\pm 0\.4\}0\.30±0\.20\.30\_\{\\pm 0\.2\}PlanningDetPlan0\.90±0\.10\.90\_\{\\pm 0\.1\}0\.34±0\.30\.34\_\{\\pm 0\.3\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.15±0\.30\.15\_\{\\pm 0\.3\}NdtPlan0\.99±0\.0\\mathbf\{0\.99\}\_\{\\pm 0\.0\}0\.39±0\.40\.39\_\{\\pm 0\.4\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.61±0\.2\\mathbf\{0\.61\}\_\{\\pm 0\.2\}0\.01±0\.00\.01\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.25±0\.40\.25\_\{\\pm 0\.4\}DetReplan0\.98±0\.00\.98\_\{\\pm 0\.0\}0\.97±0\.1\\mathbf\{0\.97\}\_\{\\pm 0\.1\}0\.94±0\.10\.94\_\{\\pm 0\.1\}0\.59±0\.30\.59\_\{\\pm 0\.3\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.44±0\.40\.44\_\{\\pm 0\.4\}NdtReplan0\.99±0\.0\\mathbf\{0\.99\}\_\{\\pm 0\.0\}0\.97±0\.0\\mathbf\{0\.97\}\_\{\\pm 0\.0\}0\.94±0\.10\.94\_\{\\pm 0\.1\}0\.51±0\.30\.51\_\{\\pm 0\.3\}0\.26±0\.30\.26\_\{\\pm 0\.3\}0\.17±0\.30\.17\_\{\\pm 0\.3\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.00±0\.00\.00\_\{\\pm 0\.0\}0\.48±0\.40\.48\_\{\\pm 0\.4\}\(new\)BISON0\.99±0\.0\\mathbf\{0\.99\}\_\{\\pm 0\.0\}0\.95±0\.10\.95\_\{\\pm 0\.1\}0\.97±0\.0\\mathbf\{0\.97\}\_\{\\pm 0\.0\}0\.62±0\.3\\mathbf\{0\.62\}\_\{\\pm 0\.3\}0\.54±0\.20\.54\_\{\\pm 0\.2\}0\.99±0\.0\\mathbf\{0\.99\}\_\{\\pm 0\.0\}0\.65±0\.3\\mathbf\{0\.65\}\_\{\\pm 0\.3\}0\.64±0\.3\\mathbf\{0\.64\}\_\{\\pm 0\.3\}0\.79±0\.2\\mathbf\{0\.79\}\_\{\\pm 0\.2\}![Refer to caption](https://arxiv.org/html/2605.15975v1/x1.png)Figure 4:Median \(line\) and range \(shaded\) of success rate \(↑\\uparrow\) of methods across number of objects and environments\. Results for VLA baselines are omitted as they do not complete any tasks\.
#### \(Q3\) Is BISON robust to uncertainty and open\-world environments?

As we have seen in the previous question and in[Table˜2](https://arxiv.org/html/2605.15975#S5.T2), BISON generally performs best overall, with the exception of two environments where variance in LL policy execution causes failures\. Next, we observe that the next best performing approach is NdtReplan which computes a new HL policyπhl\\pi^\{\\mathrm\{hl\}\}per environment episode to be used in the bilevel policy execution framework, and recomputesπhl\\pi^\{\\mathrm\{hl\}\}if it fails to return an action\. However, we observe that even with replanning, NdtReplan is less robust to the more complex Colour environments, and furthermore cannot support the open\-world Gacha environment\.

![Refer to caption](https://arxiv.org/html/2605.15975v1/x2.png)Figure 5:Success rate \(↑\\uparrow\) vs\. time \(↓\\downarrow\) for training and inference with 10 objects\.
#### \(Q4\) Is BISON more efficient than \(re\)planning and end\-to\-end approaches?

From[Figure˜5](https://arxiv.org/html/2605.15975#S5.F5), we observe that the end\-to\-end PureNN and PddlNN approaches take longer to train and execute as they encode more information in the LL GNN policy architecture than the one from[Section˜4\.2](https://arxiv.org/html/2605.15975#S4.SS2): PureNN encodes all LL information and PddlNN encodes all LL and HL information, while BISON’s LL policy only encodes the HL action and corresponding object information\. Single\-shot planners \(DetPlan, NdtPlan\) are comparable to BISON as planning costs are amortised over the episode, but achieve poor performance\. Replanning methods \(DetReplan, NdtReplan\) take at least as long as BISON due to replanning upon encountering unexpected uncertainty\.

![Refer to caption](https://arxiv.org/html/2605.15975v1/x3.png)Figure 6:HL solving time \(↓\\downarrow\) vs\. number of objects\.
#### \(Q5\) Do learned HL policies generalise to arbitrary numbers of objects?

We answer this question by inspecting the learned policies and evaluating their performance on HL planning problems over numbers of objects\. Learned HL policies in BISON are symbolic and hence can be interpreted manually or by an LLM to generalise over arbitrary numbers of objects, as displayed in[Section˜C\.4](https://arxiv.org/html/2605.15975#A3.SS4)\. We also perform experiments on corresponding PDDL BlocksSplanning problems ranging from 3 to 10,000 blocks\. We compare BISON’s solving time to a state\-of\-the\-art PDDL planner, LAMA\(Richter and Westphal,[2010](https://arxiv.org/html/2605.15975#bib.bib176)\), with a 100s timeout\. We observe from[Figure˜6](https://arxiv.org/html/2605.15975#S5.F6)that BISON solves problems with 10,000 blocks in a few seconds while LAMA struggles with a few hundred objects\.

## 6Discussion, Limitations, and Conclusion

We introduced BISON, a system to learn bilevel policies\(πhl,πll\)\(\\pi^\{\\mathrm\{hl\}\},\\pi^\{\\mathrm\{ll\}\}\)from LL demonstrations paired with HL goals that generalise to long\-horizon problems\. We learnπhl\\pi^\{\\mathrm\{hl\}\}policies from abstracted symbolic traces of the LL demonstrations using goal regression and inductive generalisation to produce first\-order, condition\-action rules that can generalise to arbitrary numbers of objects, and realise ourπll\\pi^\{\\mathrm\{ll\}\}policies via graph neural networks with fewer than 33k parameters for executing HL actions returned byπhl\\pi^\{\\mathrm\{hl\}\}in the LL action space\. The novelty of our approach is at least two\-fold: \(1\) in contrast to existing bilevel planning approaches that search for HL plans over symbolic abstractions before refining them in the LL environment, BISON inductively generalises HL policies from LL demonstrations without the need for search, and \(2\) the encoding of our policies as first\-order, condition\-action rules is novel in the context of bilevel planning and facilitates open\-world planning and generalisation\. Through this novel insight and approach, we can learn compact policies that demonstrate strong generalisation capabilities and efficiency compared to state\-of\-the\-art VLA, end\-to\-end, and planning baselines on extended MetaWorld benchmarks\.

#### Limitations

Our approach inherits the limitations of bilevel planning, in particular, the assumption of a learned or given HL abstraction that provides useful guidance for LL planning \(there is much research on how to find good abstractions as discussed in[Section˜3](https://arxiv.org/html/2605.15975#S3)\)\. Another limitation of our approach is that we have no guarantees of optimality and generalisation of the LL policies\.

## Acknowledgements

We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada \(NSERC\) and the Canada CIFAR AI Chairs Program\. Till Hofmann was funded by the Federal Ministry of Education and Research \(BMBF\) and the Ministry of Culture and Science of the German State of North Rhine\-Westphalia \(MKW\) under the Excellence Strategy of the Federal Government and the Länder\. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute\. We thank Frieda Rong and William Shen for helpful discussions\.

## References

- State abstractions for lifelong reinforcement learning\.InICML,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Abel, D\. E\. Hershkowitz, and M\. L\. Littman \(2016\)Near optimal behavior via approximate state abstraction\.InICML,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Ahmed, J\. B\. Tenenbaum, C\. Bates, and S\. J\. Gershman \(2025\)Synthesizing world models for bilevel planning\.Trans\. Mach\. Learn\. Res\.2025\.External Links:[Link](https://openreview.net/forum?id=m9V4JHLJrD)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- M\. Ahn, A\. Brohan, N\. Brown, Y\. Chebotar, O\. Cortes, B\. David, C\. Finn, K\. Gopalakrishnan, K\. Hausman, A\. Herzog, D\. Ho, J\. Hsu, J\. Ibarz, B\. Ichter, A\. Irpan, E\. Jang, R\. J\. Ruano, K\. Jeffrey, S\. Jesmonth, N\. J\. Joshi, R\. Julian, D\. Kalashnikov, Y\. Kuang, K\. Lee, S\. Levine, Y\. Lu, L\. Luu, C\. Parada, P\. Pastor, J\. Quiambao, K\. Rao, J\. Rettinghouse, D\. Reyes, P\. Sermanet, N\. Sievers, C\. Tan, A\. Toshev, V\. Vanhoucke, F\. Xia, T\. Xiao, P\. Xu, S\. Xu, and M\. Yan \(2022\)Do as I can, not as I say: grounding language in robotic affordances\.InCoRL,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- A\. Anand, E\. Racah, S\. Ozair, Y\. Bengio, M\. Côté, and R\. D\. Hjelm \(2019\)Unsupervised state representation learning in Atari\.InNeurIPS,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- M\. Asai and A\. Fukunaga \(2018\)Classical planning in deep latent space: bridging the subsymbolic\-symbolic boundary\.InAAAI,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- M\. Asai, H\. Kajino, A\. Fukunaga, and C\. Muise \(2022\)Classical planning in deep latent space\.J\. Artif\. Intell\. Res\.74,pp\. 1599–1686\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.1.13768),[Link](https://doi.org/10.1613/jair.1.13768)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- A\. Athalye, N\. Kumar, T\. Silver, Y\. Liang, J\. Wang, T\. Lozano\-Pérez, and L\. P\. Kaelbling \(2026\)From pixels to predicates: learning symbolic world models via pretrained VLMs\.IEEE Robotics and Automation Letters11\(4\),pp\. 4002–4009\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2026.3662533)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- F\. Bacchus and Q\. Yang \(1994\)Downward refinement and the efficiency of hierarchical problem solving\.Artif\. Intell\.71\(1\),pp\. 43–100\.External Links:[Document](https://dx.doi.org/10.1016/0004-3702%2894%2990062-0),[Link](https://doi.org/10.1016/0004-3702(94)90062-0)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px2.p1.4)\.
- P\. Bercher, R\. Alford, and D\. Höller \(2019\)A survey on hierarchical planning \- one abstract idea, many concrete realizations\.InIJCAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Bonassi, G\. D\. Giacomo, M\. Favorito, F\. Fuggitti, A\. E\. Gerevini, and E\. Scala \(2025\)Planning for temporally extended goals in pure\-past linear temporal logic\.Artif\. Intell\.348,pp\. 104409\.External Links:[Document](https://dx.doi.org/10.1016/J.ARTINT.2025.104409),[Link](https://doi.org/10.1016/j.artint.2025.104409)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- S\. A\. Bouhsain, R\. Alami, and T\. Siméon \(2023\)Learning to predict action feasibility for task and motion planning in 3d environments\.InICRA,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- S\. A\. Bouhsain, R\. Alami, and T\. Siméon \(2025\)Learning geometric reasoning networks for robot task and motion planning\.\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Boutilier and R\. Dearden \(1994\)Using abstractions for decision\-theoretic planning with time constraints\.InAAAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- A\. K\. Bozkurt, Y\. Wang, M\. M\. Zavlanos, and M\. Pajic \(2020\)Control synthesis from linear temporal logic specifications using model\-free reinforcement learning\.InICRA,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- C\. P\. Burgess, L\. Matthey, N\. Watters, R\. Kabra, I\. Higgins, M\. M\. Botvinick, and A\. Lerchner \(2019\)MONet: unsupervised scene decomposition and representation\.CoRRabs/1901\.11390\.External Links:1901\.11390,[Link](http://arxiv.org/abs/1901.11390)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- A\. Camacho, R\. Toro Icarte, T\. Q\. Klassen, R\. A\. Valenzano, and S\. A\. McIlraith \(2019\)LTL and beyond: formal languages for reward function specification in reinforcement learning\.InIJCAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Chi, S\. Feng, Y\. Du, Z\. Xu, E\. Cousineau, B\. Burchfiel, and S\. Song \(2023\)Diffusion policy: visuomotor policy learning via action diffusion\.InRSS,Cited by:[§4\.2](https://arxiv.org/html/2605.15975#S4.SS2.SSS0.Px2.p1.3)\.
- R\. Chitnis, D\. Hadfield\-Menell, A\. Gupta, S\. Srivastava, E\. Groshev, C\. Lin, and P\. Abbeel \(2016\)Guided search for task and motion plans using learned heuristics\.InICRA,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Chitnis, T\. Silver, J\. B\. Tenenbaum, T\. Lozano\-Pérez, and L\. P\. Kaelbling \(2022\)Learning neuro\-symbolic relational transition models for bilevel planning\.InIROS,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- B\. Cieslar, L\. P\. Kaelbling, T\. Lozano\-Pérez, and J\. Mendez\-Mendez \(2024\)Learning long\-horizon action dependencies in sampling\-based bilevel planning\.InCoRL,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- A\. B\. Corrêa, F\. Pommerening, M\. Helmert, and G\. Francès \(2020\)Lifted successor generation using query optimization techniques\.InICAPS,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- A\. Curtis, X\. Fang, L\. P\. Kaelbling, T\. Lozano\-Pérez, and C\. R\. Garrett \(2022a\)Long\-horizon manipulation of unknown objects via task and motion planning with estimated affordances\.InICRA,Cited by:[8th item](https://arxiv.org/html/2605.15975#S5.I1.i8.p1.1)\.
- A\. Curtis, G\. Matheos, N\. Gothoskar, V\. Mansinghka, J\. B\. Tenenbaum, T\. Lozano\-Pérez, and L\. P\. Kaelbling \(2024\)Partially observable task and motion planning with uncertainty and risk awareness\.InRSS,Cited by:[8th item](https://arxiv.org/html/2605.15975#S5.I1.i8.p1.1)\.
- A\. Curtis, T\. Silver, J\. B\. Tenenbaum, T\. Lozano\-Pérez, and L\. P\. Kaelbling \(2022b\)Discovering state and action abstractions for generalized task and motion planning\.InAAAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- N\. T\. Dantam, Z\. K\. Kingston, S\. Chaudhuri, and L\. E\. Kavraki \(2018\)An incremental constraint\-based framework for task and motion planning\.Int\. J\. Robotics Res\.37\(10\)\.External Links:[Document](https://dx.doi.org/10.1177/0278364918761570),[Link](https://doi.org/10.1177/0278364918761570)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Dayan and G\. E\. Hinton \(1992\)Feudal reinforcement learning\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- T\. L\. Dean and R\. Givan \(1997\)Model minimization in Markov decision processes\.InAAAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- T\. G\. Dietterich \(2000\)Hierarchical reinforcement learning with the MAXQ value function decomposition\.J\. Artif\. Intell\. Res\.13,pp\. 227–303\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.639),[Link](https://doi.org/10.1613/jair.639)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Driess, J\. Ha, and M\. Toussaint \(2020a\)Deep visual reasoning: learning to predict action sequences for task and motion planning from an initial scene image\.InRSS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Driess, O\. S\. Oguz, J\. Ha, and M\. Toussaint \(2020b\)Deep visual heuristics: learning feasibility of mixed\-integer programs for manipulation planning\.InICRA,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Driess, F\. Xia, M\. S\. M\. Sajjadi, C\. Lynch, A\. Chowdhery, B\. Ichter, A\. Wahid, J\. Tompson, Q\. Vuong, T\. Yu, W\. Huang, Y\. Chebotar, P\. Sermanet, D\. Duckworth, S\. Levine, V\. Vanhoucke, K\. Hausman, M\. Toussaint, K\. Greff, A\. Zeng, I\. Mordatch, and P\. Florence \(2023\)PaLM\-e: an embodied multimodal language model\.InICML,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- Q\. Du, B\. Li, Y\. Du, S\. Su, T\. Fu, Z\. Zhan, Z\. Zhao, and C\. Wang \(2026\)Fast task planning with neuro\-symbolic relaxation\.IEEE Robotics Autom\. Lett\.11\(3\),pp\. 3684–3691\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2026.3662556),[Link](https://doi.org/10.1109/LRA.2026.3662556)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Dziri, X\. Lu, M\. Sclar, X\. L\. Li, L\. Jiang, B\. Y\. Lin, S\. Welleck, P\. West, C\. Bhagavatula, R\. L\. Bras, J\. D\. Hwang, S\. Sanyal, X\. Ren, A\. Ettinger, Z\. Harchaoui, and Y\. Choi \(2023\)Faith and fate: limits of transformers on compositionality\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- K\. Erol, J\. A\. Hendler, and D\. S\. Nau \(1996\)Complexity results for HTN planning\.Ann\. Math\. Artif\. Intell\.18\(1\),pp\. 69–93\.External Links:[Document](https://dx.doi.org/10.1007/BF02136175),[Link](https://doi.org/10.1007/BF02136175)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Fikes, P\. E\. Hart, and N\. J\. Nilsson \(1972\)Learning and executing generalized robot plans\.Artif\. Intell\.3\(1\-3\),pp\. 251–288\.Cited by:[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.5)\.
- R\. Fikes and N\. J\. Nilsson \(1971\)STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving\.Artif\. Intell\.2\(3/4\),pp\. 189–208\.Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px4.p1.4),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px4.p2.11)\.
- C\. Fritz and S\. A\. McIlraith \(2007\)Monitoring plan optimality during execution\.InICAPS,Cited by:[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.5),[8th item](https://arxiv.org/html/2605.15975#S5.I1.i8.p1.1)\.
- C\. R\. Garrett, R\. Chitnis, R\. M\. Holladay, B\. Kim, T\. Silver, L\. P\. Kaelbling, and T\. Lozano\-Pérez \(2021\)Integrated task and motion planning\.Annu\. Rev\. Control\. Robotics Auton\. Syst\.4,pp\. 265–293\.External Links:[Document](https://dx.doi.org/10.1146/ANNUREV-CONTROL-091420-084139),[Link](https://doi.org/10.1146/annurev-control-091420-084139)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1),[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.15975#S3.p1.2)\.
- H\. Geffner and B\. Bonet \(2013\)A concise introduction to models and methods for automated planning\.Synthesis Lectures on Artificial Intelligence and Machine Learning,Morgan & Claypool Publishers\.External Links:ISBN 9781608459698,[Document](https://dx.doi.org/10.2200/S00513ED1V01Y201306AIM022),[Link](https://doi.org/10.2200/S00513ED1V01Y201306AIM022)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px4.p1.4)\.
- M\. Ghallab, D\. S\. Nau, and P\. Traverso \(2004\)Automated planning \- theory and practice\.Elsevier\.External Links:ISBN 978\-1\-55860\-856\-6Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px4.p1.4)\.
- J\. Gilmer, S\. S\. Schoenholz, P\. F\. Riley, O\. Vinyals, and G\. E\. Dahl \(2017\)Neural message passing for quantum chemistry\.InICML,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6)\.
- F\. Giunchiglia and T\. Walsh \(1992\)A theory of abstraction\.Artif\. Intell\.57,pp\. 323–389\.External Links:[Link](https://api.semanticscholar.org/CorpusID:18410341)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- C\. C\. Green \(1969\)Application of theorem proving to problem solving\.InIJCAI,D\. E\. Walker and L\. M\. Norton \(Eds\.\),pp\. 219–240\.External Links:[Link](http://ijcai.org/Proceedings/69/Papers/023.pdf)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px4.p2.11)\.
- C\. Gretton and S\. Thiébaux \(2004\)Exploiting first\-order regression in inductive policy selection\.InUAI,Cited by:[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.6)\.
- M\. Han, Y\. Zhu, S\. Zhu, Y\. N\. Wu, and Y\. Zhu \(2024\)INTERPRET: interactive predicate learning from language feedback for generalizable task planning\.InRSS,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- D\. Haramati, T\. Daniel, and A\. Tamar \(2024\)Entity\-centric reinforcement learning for object manipulation from pixels\.InICLR,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- P\. Haslum, N\. Lipovetzky, D\. Magazzeni, and C\. Muise \(2019\)An Introduction to the Planning Domain Definition Language\.Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px4.p1.4)\.
- M\. Hauskrecht, N\. Meuleau, L\. P\. Kaelbling, T\. L\. Dean, and C\. Boutilier \(1998\)Hierarchical solution of Markov decision processes using macro\-actions\.InUAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Helmert \(2006\)The fast downward planning system\.J\. Artif\. Intell\. Res\.26,pp\. 191–246\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.1705),[Link](https://doi.org/10.1613/jair.1705)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1),[6th item](https://arxiv.org/html/2605.15975#S5.I1.i6.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InNeurIPS,Cited by:[§4\.2](https://arxiv.org/html/2605.15975#S4.SS2.SSS0.Px2.p1.3)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- Y\. Hu, F\. Lin, T\. Zhang, L\. Yi, and Y\. Gao \(2023\)Look before you leap: unveiling the power of GPT\-4V in robotic vision\-language planning\.CoRRabs/2311\.17842\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2311.17842),2311\.17842,[Link](https://doi.org/10.48550/arXiv.2311.17842)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- J\. Huang, A\. Tao, R\. Marco, M\. Bogdanovic, J\. Kelly, and F\. Shkurti \(2025\)Automated planning domain inference for task and motion planning\.InICRA,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- W\. Huang, F\. Xia, T\. Xiao, H\. Chan, J\. Liang, P\. Florence, A\. Zeng, J\. Tompson, I\. Mordatch, Y\. Chebotar, P\. Sermanet, T\. Jackson, N\. Brown, L\. Luu, S\. Levine, K\. Hausman, and B\. Ichter \(2022\)Inner monologue: embodied reasoning through planning with language models\.InCoRL,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- L\. Illanes, X\. Yan, R\. Toro Icarte, and S\. A\. McIlraith \(2020\)Symbolic plans as high\-level instructions for reinforcement learning\.InICAPS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Intelligence, B\. Ai, A\. Amin, R\. Aniceto, A\. Balakrishna, G\. Balke, K\. Black, G\. Bokinsky, S\. Cao, T\. Charbonnier, V\. Choudhary, F\. Collins, K\. Conley, G\. Connors, J\. Darpinian, K\. Dhabalia, M\. Dhaka, J\. DiCarlo, D\. Driess, M\. Equi, A\. Esmail, Y\. Fang, C\. Finn, C\. Glossop, T\. Godden, I\. Goryachev, L\. Groom, H\. Habeeb, H\. Hancock, K\. Hausman, G\. Hussein, V\. Hwang, B\. Ichter, C\. Jacobsen, S\. Jakubczak, R\. Jen, T\. Jones, G\. Kammerer, B\. Katz, L\. Ke, M\. Khadikov, C\. Kuchi, M\. Lamb, D\. LeBlanc, B\. LeCount, S\. Levine, X\. Li, A\. Li\-Bell, V\. Lialin, Z\. Liang, W\. Lim, Y\. Lu, E\. Luo, V\. Mano, N\. Marwaha, A\. Mongush, L\. Murphy, S\. Nair, T\. Patterson, K\. Pertsch, A\. Z\. Ren, G\. Schelske, C\. Sharma, B\. Shi, L\. X\. Shi, L\. Smith, J\. T\. Springenberg, K\. Stachowicz, W\. Stoeckle, J\. Tang, J\. Tanner, S\. Tekeste, M\. Torne, K\. Vedder, Q\. Vuong, A\. Walling, H\. Wang, J\. Wang, X\. Wang, C\. Whalen, S\. Whitmore, B\. Williams, C\. Xu, S\. Yoo, L\. Yu, W\. Zhang, Z\. Zhang, and U\. Zhilinsky \(2026\)π0\.7\{\\pi\}\_\{0\.7\}: A steerable generalist robotic foundation model with emergent capabilities\.CoRRabs/2604\.15483\.Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- P\. Intelligence, K\. Black, N\. Brown, J\. Darpinian, K\. Dhabalia, D\. Driess, A\. Esmail, M\. Equi, C\. Finn, N\. Fusai, M\. Y\. Galliker, D\. Ghosh, L\. Groom, K\. Hausman, B\. Ichter, S\. Jakubczak, T\. Jones, L\. Ke, D\. LeBlanc, S\. Levine, A\. Li\-Bell, M\. Mothukuri, S\. Nair, K\. Pertsch, A\. Z\. Ren, L\. X\. Shi, L\. Smith, J\. T\. Springenberg, K\. Stachowicz, J\. Tanner, Q\. Vuong, H\. Walke, A\. Walling, H\. Wang, L\. Yu, and U\. Zhilinsky \(2025\)π\\pi0\.5: a vision\-language\-action model with open\-world generalization\.CoRRabs/2504\.16054\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2504.16054),2504\.16054,[Link](https://doi.org/10.48550/arXiv.2504.16054)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- M\. Jackermeier and A\. Abate \(2025\)DeepLTL: learning to efficiently satisfy complex LTL specifications for multi\-task RL\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- S\. James, B\. Rosman, and G\. Konidaris \(2022\)Autonomous learning of object\-centric abstractions for high\-level planning\.InICLR,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- S\. Kambhampati, K\. Valmeekam, L\. Guan, M\. Verma, K\. Stechly, S\. Bhambri, L\. Saldyt, and A\. Murthy \(2024\)Position: llms can’t plan, but can help planning in llm\-modulo frameworks\.InICML,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- M\. Khodeir, B\. Agro, and F\. Shkurti \(2023\)Learning to search in task and motion planning with streams\.IEEE Robotics Autom\. Lett\.8\(4\),pp\. 1983–1990\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2023.3242201),[Link](https://doi.org/10.1109/LRA.2023.3242201)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Kim, Z\. Wang, L\. P\. Kaelbling, and T\. Lozano\-Pérez \(2019\)Learning to guide task and motion planning using score\-space representation\.Int\. J\. Robotics Res\.38\(7\)\.External Links:[Document](https://dx.doi.org/10.1177/0278364919848837),[Link](https://doi.org/10.1177/0278364919848837)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- M\. J\. Kim, K\. Pertsch, S\. Karamcheti, T\. Xiao, A\. Balakrishna, S\. Nair, R\. Rafailov, E\. P\. Foster, P\. R\. Sanketi, Q\. Vuong, T\. Kollar, B\. Burchfiel, R\. Tedrake, D\. Sadigh, S\. Levine, P\. Liang, and C\. Finn \(2024\)OpenVLA: an open\-source vision\-language\-action model\.InCoRL,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: A method for stochastic optimization\.InICLR,Cited by:[§C\.1](https://arxiv.org/html/2605.15975#A3.SS1.p1.1)\.
- T\. N\. Kipf, E\. van der Pol, and M\. Welling \(2020\)Contrastive learning of structured world models\.InICLR,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- T\. N\. Kipf and M\. Welling \(2017\)Semi\-supervised classification with graph convolutional networks\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6)\.
- G\. D\. Konidaris and A\. G\. Barto \(2009\)Efficient skill learning using abstraction selection\.InIJCAI,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px2.p2.2)\.
- G\. D\. Konidaris, L\. P\. Kaelbling, and T\. Lozano\-Pérez \(2014\)Constructing symbolic representations for high\-level planning\.InAAAI,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- G\. D\. Konidaris, L\. P\. Kaelbling, and T\. Lozano\-Pérez \(2018\)From skills to symbols: learning symbolic representations for abstract high\-level planning\.J\. Artif\. Intell\. Res\.61,pp\. 215–289\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.5575),[Link](https://doi.org/10.1613/jair.5575)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1),[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- G\. Konidaris \(2019\)On the necessity of abstraction\.Current Opinion in Behavioral Sciences29,pp\. 1–7\.Note:Artificial IntelligenceExternal Links:ISSN 2352\-1546,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cobeha.2018.11.005),[Link](https://www.sciencedirect.com/science/article/pii/S2352154618302080)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- R\. E\. Korf \(1987\)Planning as search: A quantitative approach\.Artif\. Intell\.33\(1\),pp\. 65–88\.External Links:[Document](https://dx.doi.org/10.1016/0004-3702%2887%2990051-8),[Link](https://doi.org/10.1016/0004-3702(87)90051-8)Cited by:[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px6.p1.2)\.
- N\. Kumar, T\. Silver, W\. McClinton, L\. Zhao, S\. Proulx, T\. Lozano\-Pérez, L\. P\. Kaelbling, and J\. L\. Barry \(2024\)Practice makes perfect: planning to learning skill parameter policies\.InRSS,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- H\. M\. Le, N\. Jiang, A\. Agarwal, M\. Dudík, Y\. Yue, and H\. D\. III \(2018\)Hierarchical imitation and reinforcement learning\.InICML,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- A\. C\. Li, T\. Q\. Klassen, A\. Wang, P\. A\. Alamdari, and S\. A\. McIlraith \(2025a\)Ground\-compose\-reinforce: grounding language inagentic behaviours using limited data\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Li, T\. Silver, S\. A\. Scherer, and A\. G\. Gray \(2025b\)Bilevel learning for bilevel planning\.InRSS,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- L\. Li, T\. J\. Walsh, and M\. L\. Littman \(2006\)Towards a unified theory of state abstraction for mdps\.InInternational Symposium on Artificial Intelligence and Mathematics,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Li, C\. I\. Vasile, and C\. Belta \(2017\)Reinforcement learning with temporal logic rewards\.InIROS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Liang, N\. Kumar, H\. Tang, A\. Weller, J\. B\. Tenenbaum, T\. Silver, J\. F\. Henriques, and K\. Ellis \(2025\)VisualPredicator: learning abstract world models with neuro\-symbolic predicates for robot planning\.InICLR,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- Y\. Liang, D\. Nguyen, C\. Yang, T\. Li, J\. B\. Tenenbaum, C\. E\. Rasmussen, A\. Weller, Z\. Tavares, T\. Silver, and K\. Ellis \(2026\)ExoPredicator: learning abstract models of dynamic worlds for robot planning\.InICLR,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- B\. Y\. Lin, R\. L\. Bras, K\. Richardson, A\. Sabharwal, R\. Poovendran, P\. Clark, and Y\. Choi \(2025\)ZebraLogic: on the scaling limits of llms for logical reasoning\.InICML,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- Y\. Lin, A\. S\. Wang, E\. Undersander, and A\. Rai \(2022\)Efficient and interpretable robot manipulation with graph neural networks\.IEEE Robotics Autom\. Lett\.7\(2\),pp\. 2740–2747\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2022.3143518),[Link](https://doi.org/10.1109/LRA.2022.3143518)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- I\. Little, S\. Thiebaux,et al\.\(2007\)Probabilistic planning vs\. replanning\.InICAPS,Cited by:[8th item](https://arxiv.org/html/2605.15975#S5.I1.i8.p1.1)\.
- B\. Liu, Y\. Jiang, X\. Zhang, Q\. Liu, S\. Zhang, J\. Biswas, and P\. Stone \(2023\)LLM\+P: empowering large language models with optimal planning proficiency\.CoRRabs/2304\.11477\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2304.11477),2304\.11477,[Link](https://doi.org/10.48550/arXiv.2304.11477)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- X\. Liu, A\. Pesaranghader, H\. Li, P\. Sukcharoenchaikul, J\. Kim, T\. Sadhu, H\. Jeon, and S\. Sanner \(2025\)Open\-world planning via lifted regression with llm\-inferred affordances for embodied agents\.InACL,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6),[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.6)\.
- I\. Loshchilov and F\. Hutter \(2017\)SGDR: stochastic gradient descent with warm restarts\.InICLR,Cited by:[§C\.1](https://arxiv.org/html/2605.15975#A3.SS1.p1.1)\.
- T\. Lozano\-Pérez and L\. P\. Kaelbling \(2014\)A constraint\-based method for solving sequential manipulation planning problems\.InIROS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Lozano\-Perez, M\. T\. Mason, and R\. H\. Taylor \(1984\)Automatic synthesis of fine\-motion strategies for robots\.Int\. J\. Robot\. Res\.3\(1\),pp\. 3–24\.Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6),[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.5)\.
- A\. Mandlekar, C\. R\. Garrett, D\. Xu, and D\. Fox \(2023\)Human\-in\-the\-loop task and motion planning for imitation learning\.InCoRL,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Marthi, S\. Russell, and J\. A\. Wolfe \(2008\)Angelic hierarchical planning: optimal and online algorithms\.InICAPS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- D\. McDermott, M\. Ghallab, A\. E\. Howe, C\. A\. Knoblock, A\. Ram, M\. M\. Veloso, D\. S\. Weld, and D\. E\. Wilkins \(1998\)PDDL\-the Planning Domain Definition Language\.Technical reportCited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px4.p1.4)\.
- M\. J\. McDonald and D\. Hadfield\-Menell \(2021\)Guided imitation of task and motion planning\.InCoRL,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- R\. McLean, E\. Chatzaroulas, L\. McCutcheon, F\. Röder, T\. Yu, Z\. He, K\. R\. Zentner, R\. Julian, J\. K\. Terry, I\. Woungang, N\. Farsad, and P\. S\. Castro \(2025\)Meta\-world\+: an improved, standardized, RL benchmark\.InNeurIPS,Cited by:[§C\.3](https://arxiv.org/html/2605.15975#A3.SS3.p1.1),[§1](https://arxiv.org/html/2605.15975#S1.p5.1),[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px2.p1.8),[§5](https://arxiv.org/html/2605.15975#S5.p1.1)\.
- J\. Mendez\-Mendez, L\. P\. Kaelbling, and T\. Lozano\-Pérez \(2023\)Embodied lifelong learning for task and motion planning\.InCoRL,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Morris, M\. Ritzert, M\. Fey, W\. L\. Hamilton, J\. E\. Lenssen, G\. Rattan, and M\. Grohe \(2019\)Weisfeiler and leman go neural: higher\-order graph neural networks\.InAAAI,Cited by:[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px5.p1.1)\.
- C\. Muise, S\. A\. McIlraith, and J\. C\. Beck \(2024\)PRP rebooted: advancing the state of the art in FOND planning\.InAAAI,Cited by:[7th item](https://arxiv.org/html/2605.15975#S5.I1.i7.p1.1)\.
- D\. S\. Nau, T\. Au, O\. Ilghami, U\. Kuter, J\. W\. Murdock, D\. Wu, and F\. Yaman \(2003\)SHOP2: an HTN planning system\.J\. Artif\. Intell\. Res\.20,pp\. 379–404\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.1141),[Link](https://doi.org/10.1613/jair.1141)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Park, K\. Frans, D\. Mann, B\. Eysenbach, A\. Kumar, and S\. Levine \(2025\)Horizon reduction makes rl scalable\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- S\. Park, Y\. Kim, S\. Ryu, B\. G\. Yoo, S\. Chung, J\. Park, W\. Ahn, and M\. Lim \(2026\)Robust vision\-language\-action models via object\-centric learning and distance\-based chunk alignment\.Applied Sciences16\(7\)\.External Links:ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app16073376),[Link](https://www.mdpi.com/2076-3417/16/7/3376)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- O\. Pettersson \(2005\)Execution monitoring in robotics: a survey\.Robotics and Autonomous Systems53\(2\),pp\. 73–88\.External Links:ISSN 0921\-8890,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.robot.2005.09.004),[Link](https://www.sciencedirect.com/science/article/pii/S092188900500134X)Cited by:[8th item](https://arxiv.org/html/2605.15975#S5.I1.i8.p1.1)\.
- W\. Qiu, W\. Mao, and H\. Zhu \(2023\)Instructing goal\-conditioned reinforcement learning agents with temporal logic objectives\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InICML,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- R\. Reiter \(1991\)The frame problem in the situation calculus: a simple solution \(sometimes\) and a completeness result for goal regression\.InArtificial and Mathematical Theory of Computation,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6),[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.5)\.
- R\. Reiter \(2001\)Knowledge in action: logical foundations for specifying and implementing dynamical systems\.MIT Press\.Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6),[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.5)\.
- S\. Richter and M\. Westphal \(2010\)The LAMA planner: guiding cost\-based anytime planning with landmarks\.J\. Artif\. Intell\. Res\.39,pp\. 127–177\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.2972),[Link](https://doi.org/10.1613/jair.2972)Cited by:[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px8.p1.1)\.
- S\. Ross, G\. J\. Gordon, and D\. Bagnell \(2011\)A reduction of imitation learning and structured prediction to no\-regret online learning\.InAISTATS,Cited by:[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px5.p1.1)\.
- E\. D\. Sacerdoti \(1974\)Planning in a hierarchy of abstraction spaces\.Artif\. Intell\.5\(2\),pp\. 115–135\.External Links:[Document](https://dx.doi.org/10.1016/0004-3702%2874%2990026-5),[Link](https://doi.org/10.1016/0004-3702(74)90026-5)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Sanner and C\. Boutilier \(2009\)Practical solution techniques for first\-order mdps\.Artif\. Intell\.173\(5\-6\),pp\. 748–788\.External Links:[Document](https://dx.doi.org/10.1016/J.ARTINT.2008.11.003),[Link](https://doi.org/10.1016/j.artint.2008.11.003)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6),[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.6)\.
- E\. Scala, P\. Haslum, S\. Thiébaux, and M\. Ramírez \(2020\)Subgoaling techniques for satisficing and optimal numeric planning\.J\. Artif\. Intell\. Res\.68,pp\. 691–752\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.1.11875),[Link](https://doi.org/10.1613/jair.1.11875)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- F\. Scarselli, M\. Gori, A\. C\. Tsoi, M\. Hagenbuchner, and G\. Monfardini \(2009\)The graph neural network model\.IEEE Trans\. Neural Networks20\(1\),pp\. 61–80\.External Links:[Document](https://dx.doi.org/10.1109/TNN.2008.2005605),[Link](https://doi.org/10.1109/TNN.2008.2005605)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6)\.
- J\. Seipp, T\. Keller, and M\. Helmert \(2020\)Saturated cost partitioning for optimal classical planning\.J\. Artif\. Intell\. Res\.67,pp\. 129–167\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.1.11673),[Link](https://doi.org/10.1613/jair.1.11673)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- Y\. Seo, D\. Hafner, H\. Liu, F\. Liu, S\. James, K\. Lee, and P\. Abbeel \(2022\)Masked world models for visual control\.InCoRL,Cited by:[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px4.p1.3)\.
- N\. Shah, J\. Nagpal, P\. Verma, and S\. Srivastava \(2025\)From real world to logic and back: learning generalizable relational concepts for long horizon robot planning\.InCoRL,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- N\. Shah, D\. K\. Vasudevan, K\. Kumar, P\. Kamojjhala, and S\. Srivastava \(2020\)Anytime integrated task and motion policies for stochastic environments\.InICRA,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Shen, C\. R\. Garrett, A\. Goyal, T\. Hermans, and F\. Ramos \(2025\)Differentiable gpu\-parallelized task and motion planning\.InRSS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Shen, N\. Kumar, S\. Chintalapudi, J\. Wang, C\. Watson, E\. Hu, J\. Cao, D\. Jayaraman, L\. P\. Kaelbling, and T\. Lozano\-Pérez \(2026\)TiPToP: a modular open\-vocabulary planning system for robotic manipulation\.CoRRabs/2603\.09971\.Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- M\. Shukor, D\. Aubakirova, F\. Capuano, P\. Kooijmans, S\. Palma, A\. Zouitine, M\. Aractingi, C\. Pascal, M\. Russi, A\. Marafioti, S\. Alibert, M\. Cord, T\. Wolf, and R\. Cadène \(2025\)SmolVLA: A vision\-language\-action model for affordable and efficient robotics\.CoRRabs/2506\.01844\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2506.01844),2506\.01844,[Link](https://doi.org/10.48550/arXiv.2506.01844)Cited by:[§C\.2](https://arxiv.org/html/2605.15975#A3.SS2.SSS0.Px1.p1.1),[2nd item](https://arxiv.org/html/2605.15975#S5.I1.i2.p1.1),[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px4.p1.3)\.
- T\. Silver, A\. Athalye, J\. B\. Tenenbaum, T\. Lozano\-Pérez, and L\. P\. Kaelbling \(2022\)Learning neuro\-symbolic skills for bilevel planning\.InCoRL,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5),[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px5.p1.1)\.
- T\. Silver, R\. Chitnis, A\. Curtis, J\. B\. Tenenbaum, T\. Lozano\-Pérez, and L\. P\. Kaelbling \(2021a\)Planning with learned object importance in large problem instances using graph neural networks\.InAAAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Silver, R\. Chitnis, N\. Kumar, W\. McClinton, T\. Lozano\-Pérez, L\. P\. Kaelbling, and J\. B\. Tenenbaum \(2023\)Predicate invention for bilevel planning\.InAAAI,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6),[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5),[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px5.p1.1)\.
- T\. Silver, R\. Chitnis, J\. B\. Tenenbaum, L\. P\. Kaelbling, and T\. Lozano\-Pérez \(2021b\)Learning symbolic operators for task and motion planning\.InIROS,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- H\. A\. Simon \(1956\)Rational choice and the structure of the environment\.Psychol\. Rev\.63\(2\),pp\. 129\.Cited by:[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px6.p1.2)\.
- D\. Speck, J\. Seipp, and Á\. Torralba \(2025\)Symbolic search for cost\-optimal planning with expressive model extensions\.J\. Artif\. Intell\. Res\.82\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.1.16869),[Link](https://doi.org/10.1613/jair.1.16869)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- S\. Srivastava, E\. Fang, L\. Riano, R\. Chitnis, S\. Russell, and P\. Abbeel \(2014\)Combined task and motion planning through an extensible planner\-independent interface layer\.InICRA,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1),[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.15975#S3.p1.2)\.
- S\. Ståhlberg, B\. Bonet, and H\. Geffner \(2022\)Learning general optimal policies with graph neural networks: expressive power, transparency, and limits\.InICAPS,Cited by:[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px5.p1.1)\.
- R\. S\. Sutton, D\. Precup, and S\. Singh \(1999\)Between MDPs and semi\-MDPs: A framework for temporal abstraction in reinforcement learning\.Artif\. Intell\.112\(1\-2\),pp\. 181–211\.External Links:[Document](https://dx.doi.org/10.1016/S0004-3702%2899%2900052-1),[Link](https://doi.org/10.1016/S0004-3702(99)00052-1)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Tate \(1977\)Generating project networks\.InIJCAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Team \(2023\)Gemini: a family of highly capable multimodal models\.CoRR\.External Links:2312\.11805,[Link](https://arxiv.org/abs/2312.11805)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- S\. Team and G\. DeepMind \(2025\)SIMA 2: A generalist embodied agent for virtual worlds\.CoRRabs/2512\.04797\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2512.04797),2512\.04797,[Link](https://doi.org/10.48550/arXiv.2512.04797)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- R\. Toro Icarte, T\. Q\. Klassen, R\. A\. Valenzano, and S\. A\. McIlraith \(2018\)Teaching multiple tasks to an RL agent using LTL\.InAAMAS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Toro Icarte, T\. Q\. Klassen, R\. A\. Valenzano, and S\. A\. McIlraith \(2022\)Reward machines: exploiting reward function structure in reinforcement learning\.J\. Artif\. Intell\. Res\.73,pp\. 173–208\.External Links:[Document](https://dx.doi.org/10.1613/JAIR.1.12440),[Link](https://doi.org/10.1613/jair.1.12440)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Toussaint \(2015\)Logic\-geometric programming: an optimization\-based approach to combined task and motion planning\.InIJCAI,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Vaezipoor, A\. C\. Li, R\. Toro Icarte, and S\. A\. McIlraith \(2021\)LTL2action: generalizing LTL instructions for multi\-task RL\.InICML,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Voloshin, H\. M\. Le, S\. Chaudhuri, and Y\. Yue \(2022\)Policy optimization with linear temporal logic constraints\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- R\. J\. Waldinger \(1977\)Achieving several goals simultaneously\.Machine Intelligence8,pp\. 94–136\.Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6),[§4\.1](https://arxiv.org/html/2605.15975#S4.SS1.SSS0.Px3.p1.5)\.
- J\. Wei, Y\. Tay, R\. Bommasani, C\. Raffel, B\. Zoph, S\. Borgeaud, D\. Yogatama, M\. Bosma, D\. Zhou, D\. Metzler, E\. H\. Chi, T\. Hashimoto, O\. Vinyals, P\. Liang, J\. Dean, and W\. Fedus \(2022\)Emergent abilities of large language models\.Trans\. Mach\. Learn\. Res\.2022\.External Links:[Link](https://openreview.net/forum?id=yzkSU5zdwD)Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- A\. M\. Wells, N\. T\. Dantam, A\. Shrivastava, and L\. E\. Kavraki \(2019\)Learning feasibility for task and motion planning in tabletop environments\.IEEE Robotics Autom\. Lett\.4\(2\),pp\. 1255–1262\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2019.2894861),[Link](https://doi.org/10.1109/LRA.2019.2894861)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Wen, M\. Trepte, J\. Aribido, J\. Kautz, O\. Gallo, and S\. Birchfield \(2025\)FoundationStereo: zero\-shot stereo matching\.InCVPR,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- K\. Xi, S\. Gould, and S\. Thiébaux \(2024\)Neuro\-symbolic learning of lifted action models from visual traces\.InICAPS,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- D\. Xu, R\. Martín\-Martín, D\. Huang, Y\. Zhu, S\. Savarese, and L\. Fei\-Fei \(2019a\)Regression planning networks\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p4.6)\.
- K\. Xu, W\. Hu, J\. Leskovec, and S\. Jegelka \(2019b\)How powerful are graph neural networks?\.InICLR,Cited by:[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px5.p1.1)\.
- L\. Xu, T\. Ren, G\. Chalvatzaki, and J\. Peters \(2022\)Accelerating integrated task and motion planning with neural feasibility checking\.CoRRabs/2203\.10568\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2203.10568),2203\.10568,[Link](https://doi.org/10.48550/arXiv.2203.10568)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Yalcinkaya, N\. Lauffer, M\. Vazquez\-Chanlatte, and S\. A\. Seshia \(2024\)Compositional automata embeddings for goal\-conditioned reinforcement learning\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Yang, C\. R\. Garrett, T\. Lozano\-Pérez, L\. P\. Kaelbling, and D\. Fox \(2023\)Sequence\-based plan feasibility prediction for efficient task and motion planning\.InRSS,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Yang, B\. Hedegaard, A\. Jaafar, Y\. Wei, S\. Thompson, S\. S\. Raman, H\. Fu, S\. Tellex, G\. D\. Konidaris, D\. Paulius, and N\. Shah \(2025\)SkillWrapper: generative predicate invention for skill abstraction\.CoRRabs/2511\.18203\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2511.18203),2511\.18203,[Link](https://doi.org/10.48550/arXiv.2511.18203)Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px5.p1.5)\.
- S\. W\. Yoon, A\. Fern, and R\. Givan \(2007\)FF\-Replan: A baseline for probabilistic planning\.InICAPS,Cited by:[8th item](https://arxiv.org/html/2605.15975#S5.I1.i8.p1.1)\.
- T\. Yu, D\. Quillen, Z\. He, R\. Julian, K\. Hausman, C\. Finn, and S\. Levine \(2019\)Meta\-world: A benchmark and evaluation for multi\-task and meta reinforcement learning\.InCoRL,Cited by:[§C\.3](https://arxiv.org/html/2605.15975#A3.SS3.p1.1),[§1](https://arxiv.org/html/2605.15975#S1.p5.1),[§5](https://arxiv.org/html/2605.15975#S5.SS0.SSS0.Px2.p1.8),[§5](https://arxiv.org/html/2605.15975#S5.p1.1)\.
- W\. Yuan, A\. Murali, A\. Mousavian, and D\. Fox \(2023\)M2T2: multi\-task masked transformer for object\-centric pick and place\.InCoRL,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- A\. Zadaianchuk, M\. Seitzer, and G\. Martius \(2021\)Self\-supervised visual reinforcement learning with object\-centric representations\.InICLR,Cited by:[§3](https://arxiv.org/html/2605.15975#S3.SS0.SSS0.Px3.p1.6)\.
- W\. Zhang, B\. Terver, A\. Zholus, S\. Chitnis, H\. Sutaria, M\. Assran, R\. Balestriero, A\. Bar, A\. Bardes, Y\. LeCun, and N\. Ballas \(2026\)Hierarchical planning with latent world models\.CoRRabs/2604\.03208\.Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.
- Z\. Zhao, S\. Cheng, Y\. Ding, Z\. Zhou, S\. Zhang, D\. Xu, and Y\. Zhao \(2025\)A survey of optimization\-based task and motion planning: from classical to learning approaches\.IEEE/ASME Transactions on Mechatronics30\(4\),pp\. 2799–2825\.External Links:[Document](https://dx.doi.org/10.1109/TMECH.2024.3452509)Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhu, J\. Tremblay, S\. Birchfield, and Y\. Zhu \(2021\)Hierarchical planning for long\-horizon manipulation with geometric and symbolic scene graphs\.InICRA,Cited by:[§2](https://arxiv.org/html/2605.15975#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Zitkovich, T\. Yu, S\. Xu, P\. Xu, T\. Xiao, F\. Xia, J\. Wu, P\. Wohlhart, S\. Welker, A\. Wahid, Q\. Vuong, V\. Vanhoucke, H\. T\. Tran, R\. Soricut, A\. Singh, J\. Singh, P\. Sermanet, P\. R\. Sanketi, G\. Salazar, M\. S\. Ryoo, K\. Reymann, K\. Rao, K\. Pertsch, I\. Mordatch, H\. Michalewski, Y\. Lu, S\. Levine, L\. Lee, T\. E\. Lee, I\. Leal, Y\. Kuang, D\. Kalashnikov, R\. Julian, N\. J\. Joshi, A\. Irpan, B\. Ichter, J\. Hsu, A\. Herzog, K\. Hausman, K\. Gopalakrishnan, C\. Fu, P\. Florence, C\. Finn, K\. A\. Dubey, D\. Driess, T\. Ding, K\. M\. Choromanski, X\. Chen, Y\. Chebotar, J\. Carbajal, N\. Brown, A\. Brohan, M\. G\. Arenas, and K\. Han \(2023\)RT\-2: vision\-language\-action models transfer web knowledge to robotic control\.InCoRL,Cited by:[§1](https://arxiv.org/html/2605.15975#S1.p2.1)\.

Appendix

###### Contents

1. [1Introduction](https://arxiv.org/html/2605.15975#S1)
2. [2Related Work](https://arxiv.org/html/2605.15975#S2)
3. [3Problem Statement](https://arxiv.org/html/2605.15975#S3)
4. [4Learning Bilevel Policies for Bilevel Planning](https://arxiv.org/html/2605.15975#S4)1. [4\.1Learning HL Policies via Goal Regression and Inductive Generalisation](https://arxiv.org/html/2605.15975#S4.SS1) 2. [4\.2Learning LL Policies via Graph Neural Networks and Imitation Learning](https://arxiv.org/html/2605.15975#S4.SS2)
5. [5Experiments](https://arxiv.org/html/2605.15975#S5)
6. [6Discussion, Limitations, and Conclusion](https://arxiv.org/html/2605.15975#S6)
7. [References](https://arxiv.org/html/2605.15975#bib)
8. [ASummary of Definitions and Notation](https://arxiv.org/html/2605.15975#A1)
9. [BTheory](https://arxiv.org/html/2605.15975#A2)1. [B\.1Goal Independence](https://arxiv.org/html/2605.15975#A2.SS1) 2. [B\.2Object\-Renaming Equivalence of HL Problems](https://arxiv.org/html/2605.15975#A2.SS2) 3. [B\.3Generalisation Result](https://arxiv.org/html/2605.15975#A2.SS3)
10. [CAdditional Experimental Details](https://arxiv.org/html/2605.15975#A3)1. [C\.1GNN Hyperparameters](https://arxiv.org/html/2605.15975#A3.SS1) 2. [C\.2Baselines](https://arxiv.org/html/2605.15975#A3.SS2) 3. [C\.3Benchmarks](https://arxiv.org/html/2605.15975#A3.SS3) 4. [C\.4Visualising Learned HL Policies](https://arxiv.org/html/2605.15975#A3.SS4)

## Appendix ASummary of Definitions and Notation

We use𝐁\\mathbf\{B\}old𝐅\\mathbf\{F\}ace font to distinguish low\-level \(LL\) components from high\-level \(HL\) components denoted via𝒞\\mathcal\{C\}alligraphic andI\\mathit\{I\}talics font\.[Table˜3](https://arxiv.org/html/2605.15975#A1.T3)summarises the definitions and notation used in the paper\.

Table 3:Summary of definitions and notation\.TermDescriptionNotationproblema tuple of states, actions, initial state and goal states𝐏=⟨𝐒,𝐀,𝐬0,𝐠⟩\\mathbf\{P\}=\\langle\\mathbf\{S\},\\mathbf\{A\},\\mathbf\{s\}\_\{0\},\\mathbf\{g\}\\ranglebilevel problema LL and HL problem and a labelling functionℙ=⟨𝐏ll,𝐏hl,ℒ⟩\\mathbb\{P\}=\\langle\\mathbf\{P\}^\{\\mathrm\{ll\}\},\\mathbf\{P\}^\{\\mathrm\{hl\}\},\\mathcal\{L\}\\ranglepolicya probability distribution over actions conditioned on a state and other argumentsπ\(𝐚∣𝐬,∗\)\\pi\(\\mathbf\{a\}\\mid\\mathbf\{s\},\\ast\)LL statelow\-level object\-centric, ego\-centric state representation𝐬ll=\(𝐱e,\{𝐱o\}o∈𝐎\)\\mathbf\{s\}^\{\\mathrm\{ll\}\}=\(\\mathbf\{x\}\_\{e\},\\left\\\{\\mathbf\{x\}\_\{o\}\\right\\\}\_\{o\\in\\mathbf\{O\}\}\),
𝐱e∈ℝn,𝐱o∈ℝm\\mathbf\{x\}\_\{e\}\\in\\mathbb\{R\}^\{n\},\\mathbf\{x\}\_\{o\}\\in\\mathbb\{R\}^\{m\}LL actionlow\-level action𝐚ll∈ℝd\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\in\\mathbb\{R\}^\{d\}LL policylow\-level policyπll\(𝐚ll∣𝐬ll,ahl,ghl\)\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\mid\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)HL statehigh\-level state consisting of a set of factsshl\\mathit\{s\}^\{\\mathrm\{hl\}\}HL actionhigh\-level action induced by an action schemaahl=a\(o1,…,ona\)\\mathit\{a\}^\{\\mathrm\{hl\}\}=a\(o\_\{1\},\\ldots,o\_\{n\_\{a\}\}\)HL policyhigh\-level policyπhl\(ahl∣shl,ghl\)\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\\mid\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)domaina set of predicates and action schemata𝒟=⟨𝒫,𝒜⟩\\mathcal\{D\}=\\langle\\mathcal\{P\},\\mathcal\{A\}\\ranglelabelling functiona function mapping LL states to HL statesℒ\\mathcal\{L\}predicatesymbol with argument termsp\(x1,…,xnp\)p\(x\_\{1\},\\ldots,x\_\{n\_\{p\}\}\)factpredicate with all arguments instantiated with objectsp\(o1,…,onp\)p\(o\_\{1\},\\ldots,o\_\{n\_\{p\}\}\)action schematemporally abstracted action with argument termsa\(x1,…,xna\)a\(x\_\{1\},\\ldots,x\_\{n\_\{a\}\}\)datasetLL demonstrations paired with HL goals𝕋=\{⟨gihl,𝐬0ll,𝐚0ll,\.\.\.,𝐬mill,𝐚mill⟩\}i=1n\\mathbb\{T\}=\\\{\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\},\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{0\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathbf\{s\}^\{\\mathrm\{ll\}\}\_\{m\_\{i\}\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\_\{m\_\{i\}\}\\rangle\\\}\_\{i=1\}^\{n\}HL datasetHL action traces paired with HL goalsThl=\{⟨gihl,a0hl,\.\.\.,amihl⟩\}i=1nT^\{\\mathrm\{hl\}\}=\\\{\\langle\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{0\},\\mathinner\{\.\\mkern\-1\.0mu\.\\mkern\-1\.0mu\.\},\\mathit\{a\}^\{\\mathrm\{hl\}\}\_\{m\_\{i\}\}\\rangle\\\}\_\{i=1\}^\{n\}successionthe successors of a state under an action𝑠𝑢𝑐𝑐\(𝐬,𝐚\)\\mathit\{succ\}\(\\mathbf\{s\},\\mathbf\{a\}\)regressionthe regressors of a goal under an action𝑟𝑒𝑔𝑟\(𝐠,𝐚\)\\mathit\{regr\}\(\\mathbf\{g\},\\mathbf\{a\}\)
## Appendix BTheory

We provide additional context and the proof of[Theorem˜1](https://arxiv.org/html/2605.15975#Thmtheorem1)\. The argument requires two ingredients: \(1\) the HL component ofℙ\\mathbb\{P\}admits a solution that proceeds by sequentially achieving singleton goal facts \([Section˜B\.1](https://arxiv.org/html/2605.15975#A2.SS1)\), and \(2\)𝕋\\mathbb\{T\}exhibits enough HL abstracted demonstrationsThlT^\{\\mathrm\{hl\}\}to extract HL rules covering every relevant singleton\-goal subproblem \([Section˜B\.2](https://arxiv.org/html/2605.15975#A2.SS2)\)\.

### B\.1Goal Independence

We assume HL problems are represented in the relational form from[Section˜3](https://arxiv.org/html/2605.15975#S3)\. Given an HL problem𝐏hl=⟨𝐒hl,𝐀hl,s0hl,ghl⟩\\mathbf\{P\}^\{\\mathrm\{hl\}\}=\\langle\\mathbf\{S\}^\{\\mathrm\{hl\}\},\\mathbf\{A\}^\{\\mathrm\{hl\}\},\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{0\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\\rangle, an HL stateshl∈𝐒hl\\mathit\{s\}^\{\\mathrm\{hl\}\}\\in\\mathbf\{S\}^\{\\mathrm\{hl\}\}, and a set of factsghl′\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\}^\{\\prime\}, let𝐏shl,ghl′hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{\\mathit\{s\}^\{\\mathrm\{hl\}\},\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\}^\{\\prime\}\}denote the same HL problem with initial state replaced byshl\\mathit\{s\}^\{\\mathrm\{hl\}\}and goal replaced byghl′\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\}^\{\\prime\}, i\.e\.𝐏shl,ghl′hl=⟨𝐒hl,𝐀hl,shl,ghl′⟩\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{\\mathit\{s\}^\{\\mathrm\{hl\}\},\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\}^\{\\prime\}\}=\\langle\\mathbf\{S\}^\{\\mathrm\{hl\}\},\\mathbf\{A\}^\{\\mathrm\{hl\}\},\\mathit\{s\}^\{\\mathrm\{hl\}\},\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\}^\{\\prime\}\\rangle\. Letinit\(𝐏hl\)init\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\)andgoal\(𝐏hl\)goal\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\)denote the initial state and goal of𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}, respectively\.

###### Definition 2\(Goal Independence\)\.

An HL problem𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}exhibits*goal independence \(GI\)*if for any orderingg1hl,…,gnhl\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{1\},\\ldots,\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{n\}of the facts ingoal\(𝐏hl\)goal\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\), the following procedure constructs a solution for𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}: setπhl←∅\\pi^\{\\mathrm\{hl\}\}\\leftarrow\\emptysetandS←\{init\(𝐏hl\)\}S\\leftarrow\\\{init\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\)\\\}and for eachi=1,…,ni=1,\\ldots,n, \(a\) extendπhl\\pi^\{\\mathrm\{hl\}\}with solution policies for𝐏shl,\{gihl\}hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{\\mathit\{s\}^\{\\mathrm\{hl\}\},\\\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\}\\\}\}for allshl∈S\\mathit\{s\}^\{\\mathrm\{hl\}\}\\in Swhich do not delete any factgjhl\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{j\}fori≠ji\\not=j, and \(b\) add all states in the support of anπihl\\pi^\{\\mathrm\{hl\}\}\_\{i\}\-trajectory ending in a state containinggihl\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\}toSSthat are not yet covered byπhl\\pi^\{\\mathrm\{hl\}\}\. We further say that𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}exhibits*CC\-bounded goal independence \(GIC\\text\{GI\}\_\{C\}\)*forC∈ℕC\\in\\mathbb\{N\}if everyπihl\\pi^\{\\mathrm\{hl\}\}\_\{i\}in step \(a\) has a preimage of size at mostCC\.

A bilevel planning problemℙ=⟨𝐏ll,𝐏hl,ℒ⟩\\mathbb\{P\}=\\langle\\mathbf\{P\}^\{\\mathrm\{ll\}\},\\mathbf\{P\}^\{\\mathrm\{hl\}\},\\mathcal\{L\}\\rangleexhibits \(resp\.CC\-bounded\) goal independence whenever its HL component𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}does\. The boundCCinGIC\\text\{GI\}\_\{C\}ensures each singleton\-goal subproblem admits bounded solutions which in turn bounds the size of the rule database extracted by goal regression in[Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\.

### B\.2Object\-Renaming Equivalence of HL Problems

Goal regression on a single demonstration produces lifted rules that, by virtue of being parameterised by free variables, generalise across object\-renaming\. We make this precise via an equivalence relation on HL problems\.

###### Definition 3\(Object\-Renaming Equivalence\)\.

Two HL problems𝐏1hl,𝐏2hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\},\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}over the same domain𝒟\\mathcal\{D\}are*equivalent*, written𝐏1hl∼𝐏2hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}\\sim\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}, if there exists a bijectionf:𝐎1→𝐎2f:\\mathbf\{O\}\_\{1\}\\to\\mathbf\{O\}\_\{2\}such that the lifted map

F\(shl\):=\{p\(f\(o1\),…,f\(on\)\)∣p\(o1,…,on\)∈shl\}\\displaystyle F\(\\mathit\{s\}^\{\\mathrm\{hl\}\}\):=\\\{p\(f\(o\_\{1\}\),\\ldots,f\(o\_\{n\}\)\)\\mid p\(o\_\{1\},\\ldots,o\_\{n\}\)\\in\\mathit\{s\}^\{\\mathrm\{hl\}\}\\\}\(6\)satisfiesF\(init\(𝐏1hl\)\)=init\(𝐏2hl\)F\(init\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}\)\)=init\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}\)andF\(goal\(𝐏1hl\)\)=goal\(𝐏2hl\)F\(goal\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}\)\)=goal\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}\)\.

###### Proposition 1\.

The relation∼\\simis an equivalence relation\. Moreover, suppose𝐏1hl∼𝐏2hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}\\sim\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}via a bijectionff, and extendFFto ground actions byF\(a\(o1,…,on\)\):=a\(f\(o1\),…,f\(on\)\)F\(a\(o\_\{1\},\\ldots,o\_\{n\}\)\):=a\(f\(o\_\{1\}\),\\ldots,f\(o\_\{n\}\)\)\. Given any HL policyπ1hl\\pi^\{\\mathrm\{hl\}\}\_\{1\}on𝐏1hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}, defineF∗π1hlF\_\{\*\}\\pi^\{\\mathrm\{hl\}\}\_\{1\}on𝐏2hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}by

\(F∗π1hl\)\(F\(ahl\)∣F\(shl\),F\(ghl\)\):=π1hl\(ahl∣shl,ghl\)\.\\displaystyle\(F\_\{\*\}\\pi^\{\\mathrm\{hl\}\}\_\{1\}\)\(F\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)\\mid F\(\\mathit\{s\}^\{\\mathrm\{hl\}\}\),F\(\\mathit\{g\}^\{\\mathrm\{hl\}\}\)\):=\\pi^\{\\mathrm\{hl\}\}\_\{1\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\\mid\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)\.\(7\)Thenσ=⟨s0hl,s1hl,…⟩\\sigma=\\langle\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{0\},\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{1\},\\ldots\\rangleis aπ1hl\\pi^\{\\mathrm\{hl\}\}\_\{1\}\-trajectory in𝐏1hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}if and only ifF\(σ\):=⟨F\(s0hl\),F\(s1hl\),…⟩F\(\\sigma\):=\\langle F\(\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{0\}\),F\(\\mathit\{s\}^\{\\mathrm\{hl\}\}\_\{1\}\),\\ldots\\rangleis an\(F∗π1hl\)\(F\_\{\*\}\\pi^\{\\mathrm\{hl\}\}\_\{1\}\)\-trajectory in𝐏2hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}, and consequentlyπ1hl\\pi^\{\\mathrm\{hl\}\}\_\{1\}is a solution for𝐏1hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}if and only ifF∗π1hlF\_\{\*\}\\pi^\{\\mathrm\{hl\}\}\_\{1\}is a solution for𝐏2hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}\. In particular, a sequence of HL actionsa0\(o01,…,o0n0\),…,am−1\(om−11,…,om−1nm−1\)a\_\{0\}\(o\_\{0\}^\{1\},\\ldots,o\_\{0\}^\{n\_\{0\}\}\),\\ldots,a\_\{m\-1\}\(o\_\{m\-1\}^\{1\},\\ldots,o\_\{m\-1\}^\{n\_\{m\-1\}\}\)reaches a goal state in𝐏1hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}if and only if its image underFFreaches a goal state in𝐏2hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}\.

###### Proof\.

Reflexivity, symmetry, and transitivity follow from the analogous properties of bijective functions\. For the policy statement, action applicability \(𝑝𝑟𝑒\(ahl\)⊆shl\\mathit\{pre\}\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}\) and the successor function are determined syntactically by the names of predicates and schemata, whichffleaves unchanged; henceFFcommutes with𝑠𝑢𝑐𝑐\\mathit\{succ\}andahl\\mathit\{a\}^\{\\mathrm\{hl\}\}is applicable inshl\\mathit\{s\}^\{\\mathrm\{hl\}\}iffF\(ahl\)F\(\\mathit\{a\}^\{\\mathrm\{hl\}\}\)is applicable inF\(shl\)F\(\\mathit\{s\}^\{\\mathrm\{hl\}\}\)\. By constructionF∗π1hlF\_\{\*\}\\pi^\{\\mathrm\{hl\}\}\_\{1\}assigns identical support, soπ1hl\\pi^\{\\mathrm\{hl\}\}\_\{1\}\- and\(F∗π1hl\)\(F\_\{\*\}\\pi^\{\\mathrm\{hl\}\}\_\{1\}\)\-trajectories are in bijection viaFF, and goal states correspond byF\(goal\(𝐏1hl\)\)=goal\(𝐏2hl\)F\(goal\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{1\}\)\)=goal\(\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{2\}\)\. The action\-sequence claim is the special case whereπ1hl\\pi^\{\\mathrm\{hl\}\}\_\{1\}is the deterministic open\-loop policy that prescribesai\(oi1,…,oini\)a\_\{i\}\(o\_\{i\}^\{1\},\\ldots,o\_\{i\}^\{n\_\{i\}\}\)at stepii\. A consequence of[Proposition˜1](https://arxiv.org/html/2605.15975#Thmproposition1)is that lifted rules extracted from one HL problem transfer verbatim to all problems in the same equivalence class\. ∎

### B\.3Generalisation Result

We can now state the formal version of the informal theorem in[Section˜4](https://arxiv.org/html/2605.15975#S4)\. The proof relies on a finite covering argument: underGIC\\text\{GI\}\_\{C\}, the relevant singleton\-goal subproblems form a finite collection up to∼\\sim, so a finite dataset suffices for the HL rule database to generalise to every test problem\.

###### Theorem 1\.

Let𝒟=⟨𝒫,𝒜⟩\\mathcal\{D\}=\\langle\\mathcal\{P\},\\mathcal\{A\}\\ranglebe an HL domain,ℒ\\mathcal\{L\}a labelling function, andC∈ℕC\\in\\mathbb\{N\}\. There exists a finite dataset𝕋\\mathbb\{T\}such that the HL policy learned from𝕋\\mathbb\{T\}via[Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)solves any HL planning problem𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}conforming to𝒟\\mathcal\{D\}and satisfyingCC\-bounded goal independence \(GIC\\text\{GI\}\_\{C\}\)\.

###### Proof\.

LetN=maxa∈𝒜⁡\|𝑣𝑎𝑟\(a\)\|N=\\max\_\{a\\in\\mathcal\{A\}\}\\left\|\\mathit\{var\}\(a\)\\right\|andM=maxp∈𝒫⁡arity\(p\)M=\\max\_\{p\\in\\mathcal\{P\}\}\\mathrm\{arity\}\(p\)denote the maximum schema and predicate arities, respectively\.

Up to∼\\sim, the set of singleton\-goal HL problems on𝒟\\mathcal\{D\}is finite\. A singleton goal is a single factp\(o1,…,onp\)p\(o\_\{1\},\\ldots,o\_\{n\_\{p\}\}\)which, modulo bijective renaming of objects, is determined by a choice ofp∈𝒫p\\in\\mathcal\{P\}together with an arrangement of its at mostMMargument positions among at mostMMdistinct objects\. There are therefore at most\|𝒫\|⋅MM\\left\|\\mathcal\{P\}\\right\|\\cdot M^\{M\}singleton goals up to∼\\sim\. By theGIC\\text\{GI\}\_\{C\}assumption, every singleton\-goal subproblem admits a solution policy with sizeCCand hence cycle\-less trajectories with size less thanCC\. A length\-kkHL action sequence together with its singleton goal mentions at mostkN\+MkN\+Mdistinct objects, since each of thekkactions has at mostNNparameter positions and the goal contributes at mostMMfurther positions\. By[Proposition˜1](https://arxiv.org/html/2605.15975#Thmproposition1)it suffices to count action sequences up to∼\\sim: each of thekkaction positions contributes at most\|𝒜\|\\left\|\\mathcal\{A\}\\right\|schema choices and at most\(kN\+M\)N\\left\(kN\+M\\right\)^\{N\}parameter\-tuple choices\. Hence the number of inequivalent rules extracted by regressing singleton\-goal subproblems through length\-at\-most\-CCaction sequences is bounded by

n≤\|𝒫\|⋅MM⋅∑k=0C\(\|𝒜\|⋅\(kN\+M\)N\)k\.\\displaystyle n\\;\\leq\\;\\left\|\\mathcal\{P\}\\right\|\\cdot M^\{M\}\\cdot\\sum\_\{k=0\}^\{C\}\\bigl\(\\left\|\\mathcal\{A\}\\right\|\\cdot\\left\(kN\+M\\right\)^\{N\}\\bigr\)^\{k\}\.\(8\)Choose𝕋\\mathbb\{T\}to contain, for each equivalence class of singleton\-goal subproblem encountered as a step \(a\) instance in[Definition˜2](https://arxiv.org/html/2605.15975#Thmdefinition2), at least one demonstration whose HL projection realises a solution policy of length at mostCCfor a representative of that class\. Such a𝕋\\mathbb\{T\}is finite by[Equation˜8](https://arxiv.org/html/2605.15975#A2.E8)\.

Now fix any HL problem𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}satisfying the theorem assumptions and letg1hl,…,gnhl\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{1\},\\ldots,\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{n\}be any goal ordering\. Letπhl\\pi^\{\\mathrm\{hl\}\}denote the HL policy synthesised from𝕋\\mathbb\{T\}by[Algorithm˜1](https://arxiv.org/html/2605.15975#algorithm1)\. By construction,πhl\\pi^\{\\mathrm\{hl\}\}is the union of lifted rules obtained by goal regression of the demonstrations in𝕋\\mathbb\{T\}\. By Step 1 and[Proposition˜1](https://arxiv.org/html/2605.15975#Thmproposition1), for every HL stateshl\\mathit\{s\}^\{\\mathrm\{hl\}\}reached during execution and every singleton subgoalgihl\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\}not yet achieved inshl\\mathit\{s\}^\{\\mathrm\{hl\}\}, there exists a grounded ruler∈𝑔𝑟𝑜𝑢𝑛𝑑\(πhl\)r\\in\\mathit\{ground\}\(\\pi^\{\\mathrm\{hl\}\}\)that is applicable in stateshl\\mathit\{s\}^\{\\mathrm\{hl\}\}with goalgihl\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\}\. Executing𝑎𝑐𝑡𝑖𝑜𝑛\(r\)\\mathit\{action\}\(r\)thus prescribes a valid HL action that progresses𝐏shl,\{gihl\}hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}\_\{\\mathit\{s\}^\{\\mathrm\{hl\}\},\\\{\\mathit\{g\}^\{\\mathrm\{hl\}\}\_\{i\}\\\}\}and reduces the value of the next applicable rule in the HL policy\. Iterating the construction over thenngoal facts yields a solution for𝐏hl\\mathbf\{P\}^\{\\mathrm\{hl\}\}by definition of GI\. ∎

The bound[Equation˜8](https://arxiv.org/html/2605.15975#A2.E8)is exponential in the horizonCCand the maximum schema arityNN, mirroring the worst\-case complexity of generalised planning over relational domains\. In practice, the number of demonstrations required is far smaller, since most singleton\-goal subproblems share regression\-extracted rules that lift to many equivalence classes simultaneously, and our experiments in[Section˜5](https://arxiv.org/html/2605.15975#S5)confirm that a modest training set already suffices forπhl\\pi^\{\\mathrm\{hl\}\}to generalise to arbitrary numbers of objects\.

## Appendix CAdditional Experimental Details

### C\.1GNN Hyperparameters

We train all GNN models using the Adam optimiser\[[65](https://arxiv.org/html/2605.15975#bib.bib36)\]for 200 iterations with an initial learning rate of10−310^\{\-3\}, batch size of 128, and a cosine annealing learning rate scheduler\[[86](https://arxiv.org/html/2605.15975#bib.bib37)\]\. Each GNN has a hidden and output dimension of 64 and 2 message passing layers\.

### C\.2Baselines

We provide more details on the baselines of our experiments\.

#### SmolVLA

#### SmolVLAMW\{\}^\{\\textrm\{MW\}\}

SmolVLAMW\{\}^\{\\textrm\{MW\}\}is SmolVLA that is finetuned on 2500 episodes of the original MetaWorld benchmark suite \(50 episodes from each of the 50 tasks\), accessed from[https://huggingface\.co/jadechoghari/smolvla\_metaworld](https://huggingface.co/jadechoghari/smolvla_metaworld)

#### PureNN

PureNN is a special case of PddlNN in which all HL information is not encoded by assuming thatshl=ghl=∅\\mathit\{s\}^\{\\mathrm\{hl\}\}=\\mathit\{g\}^\{\\mathrm\{hl\}\}=\\emptysetand by not encoding an HL action node\.

#### PddlNN

PddlNN is a similar GNN architecture to the one in[Section˜4\.2](https://arxiv.org/html/2605.15975#S4.SS2)which encodes all HL state information including all facts and all other objects but no HL action\. We can view PddlNN as a purely end\-to\-end approach for handling bilevel planning that aims to learn the HL policy in place of using a separately learned, symbolic HL policy\. We sketch the heterogeneous GNN architecture for PddlNN as follows\.

The inputs are𝐬ll,ghl\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}andshl=ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}=\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)and encoded as a graph with a global node with initial embedding𝐡𝑔𝑙𝑜𝑏𝑎𝑙\\mathbf\{h\}\_\{\\mathit\{global\}\}, no action node, object nodes with embeddings\{𝐡o∣o∈𝐎\}\\\{\\mathbf\{h\}\_\{o\}\\mid o\\in\\mathbf\{O\}\\\}and fact nodes with embeddings\{𝐡f∣f∈shl∪ghl,arity\(f\)≥2\}\\\{\\mathbf\{h\}\_\{f\}\\mid f\\in\\mathit\{s\}^\{\\mathrm\{hl\}\}\\cup\\mathit\{g\}^\{\\mathrm\{hl\}\},\\mathrm\{arity\}\(f\)\\geq 2\\\}\. The initial embeddings are now given by

- •𝐡𝑔𝑙𝑜𝑏𝑎𝑙=𝐱e∥∑p\(\)∈shl𝐞p\|𝒫\|∥∑p\(\)∈ghl𝐞p\|𝒫\|\\mathbf\{h\}\_\{\\mathit\{global\}\}=\\mathbf\{x\}\_\{e\}\\operatorname\*\{\{\\\|\}\}\\sum\_\{p\(\)\\in\\mathit\{s\}^\{\\mathrm\{hl\}\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}\\operatorname\*\{\{\\\|\}\}\\sum\_\{p\(\)\\in\\mathit\{g\}^\{\\mathrm\{hl\}\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}
- •𝐡o=𝐱o∥∑p\(o\)∈shl𝐞p\|𝒫\|∥∑p\(o\)∈ghl𝐞p\|𝒫\|\\mathbf\{h\}\_\{o\}=\\mathbf\{x\}\_\{o\}\\operatorname\*\{\{\\\|\}\}\\sum\_\{p\(o\)\\in\\mathit\{s\}^\{\\mathrm\{hl\}\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}\\operatorname\*\{\{\\\|\}\}\\sum\_\{p\(o\)\\in\\mathit\{g\}^\{\\mathrm\{hl\}\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}and
- •𝐡f=\[f∈shl,f∈ghl\]∥𝐞p\|𝒫\|\\mathbf\{h\}\_\{f\}=\[f\\in\\mathit\{s\}^\{\\mathrm\{hl\}\},f\\in\\mathit\{g\}^\{\\mathrm\{hl\}\}\]\\operatorname\*\{\{\\\|\}\}\\mathbf\{e\}\_\{p\}^\{\\left\|\\mathcal\{P\}\\right\|\}whereppis the predicate offf\.

The GNN algorithm for PddlNN then consists of the following three operations\. All𝐖\\mathbf\{W\}symbols denote learnable weights\.

1. \(a\)The initial encodings are embedded𝐡g\(0\)=𝐖g\(0\)𝐡g\\mathbf\{h\}\_\{g\}^\{\(0\)\}=\\mathbf\{W\}\_\{g\}^\{\(0\)\}\\mathbf\{h\}\_\{g\},𝐡o\(0\)=𝐖o\(0\)𝐡o\\mathbf\{h\}\_\{o\}^\{\(0\)\}=\\mathbf\{W\}\_\{o\}^\{\(0\)\}\\mathbf\{h\}\_\{o\}, and𝐡f\(0\)=𝐖f\(0\)𝐡f\\mathbf\{h\}\_\{f\}^\{\(0\)\}=\\mathbf\{W\}\_\{f\}^\{\(0\)\}\\mathbf\{h\}\_\{f\}\.
2. \(b\)Next,LLrounds of heterogeneous message passing111[https://pytorch\-geometric\.readthedocs\.io/en/latest/tutorial/heterogeneous\.html](https://pytorch-geometric.readthedocs.io/en/latest/tutorial/heterogeneous.html)are performed\. Let⟨U,e,V⟩\\langle U,e,V\\rangledenote a directed edge type from node typeUUto node typeVVvia relationee\. We overload the notationsUUand⟨U,e,V⟩\\left<U,e,V\\right\>to also mean a set that contains the nodes of node typeUU, and pairs of nodes⟨u,v⟩\\langle u,v\\ranglewhich have an edge of type⟨U,e,V⟩\\langle U,e,V\\rangle, respectively\. For each fact nodef=p\(o1,…,om\)∈shl∪ghlf=p\(o\_\{1\},\\ldots,o\_\{m\}\)\\in\\mathit\{s\}^\{\\mathrm\{hl\}\}\\cup\\mathit\{g\}^\{\\mathrm\{hl\}\}withm≥2m\\geq 2, we include bidirectional edge types⟨F,fctj,O⟩\\langle F,\\mathrm\{fct\}\_\{j\},O\\rangleand⟨O,fctj−1,F⟩\\langle O,\\mathrm\{fct\}\_\{j\}^\{\-1\},F\\ranglefor eachj=1,…,mj=1,\\ldots,m\. Node embeddings are updated per edge type by aggregating neighbour messages and the global node to give the updated embedding 𝐡v\(l\+1\)=σ\(𝐖v𝐡v\(l\)\+𝐖𝑔𝑙𝑜𝑏𝑎𝑙𝐡𝑔𝑙𝑜𝑏𝑎𝑙\(l\)\+∑⟨U,e,V⟩max⟨u,v⟩∈⟨U,e,V⟩⁡𝐖⟨U,e,V⟩\(l\)𝐡u\(l\)\)\\displaystyle\\mathbf\{h\}\_\{v\}^\{\(l\+1\)\}=\\sigma\\\!\\left\(\\mathbf\{W\}\_\{v\}\\mathbf\{h\}\_\{v\}^\{\(l\)\}\+\\mathbf\{W\}\_\{\\mathit\{global\}\}\\mathbf\{h\}\_\{\\mathit\{global\}\}^\{\(l\)\}\+\\sum\_\{\\langle U,e,V\\rangle\}\\max\_\{\\langle u,v\\rangle\\in\\langle U,e,V\\rangle\}\\mathbf\{W\}\_\{\\left<U,e,V\\right\>\}^\{\(l\)\}\\mathbf\{h\}\_\{u\}^\{\(l\)\}\\right\)for nonlinearityσ\\sigma\. We then update the global node 𝐡𝑔𝑙𝑜𝑏𝑎𝑙\(l\+1\)=σ\(∑Umaxu∈U⁡𝐖U𝐡u\(l\+1\)\)\.\\displaystyle\\mathbf\{h\}\_\{\\mathit\{global\}\}^\{\(l\+1\)\}=\\sigma\\\!\\left\(\\sum\_\{U\}\\max\_\{u\\in U\}\\mathbf\{W\}\_\{U\}\\mathbf\{h\}\_\{u\}^\{\(l\+1\)\}\\right\)\.
3. \(c\)Finally, a feed\-forward readout layer is applied to the sum of all final embeddings to predict the LL action𝐚ll\\mathbf\{a\}^\{\\mathrm\{ll\}\}\.

#### DetPlan

[Algorithm˜2](https://arxiv.org/html/2605.15975#algorithm2)illustrates the DetPlan baseline in more detail\. It firstly computes an HL plana1,…,ana\_\{1\},\\ldots,a\_\{n\}\([Algorithm˜2](https://arxiv.org/html/2605.15975#algorithm2)\) and aims to naively and faithfully follow it throughout the simulation\. If the precondition of the current HL action, which is tracked by the indexii, is satisfied in the current HL state \([Algorithm˜2](https://arxiv.org/html/2605.15975#algorithm2)\), then we keep it\. Otherwise we check if the next action exists and its precondition is satisfied \([Algorithm˜2](https://arxiv.org/html/2605.15975#algorithm2)\)\. Otherwise, we determine that the HL plan is broken and terminate with failure as no HL action can be computed \([Algorithm˜2](https://arxiv.org/html/2605.15975#algorithm2)\)\.

Input:Problem𝐏\\mathbf\{P\}domain𝒟\\mathcal\{D\}, labelling functionℒ\\mathcal\{L\}, HL goalghl\\mathit\{g\}^\{\\mathrm\{hl\}\}, and LL policyπll\\pi^\{\\mathrm\{ll\}\}\.

1

𝐬ll←𝐬0\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\leftarrow\\mathbf\{s\}\_\{0\}
2

shl←ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\\leftarrow\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)
3

a1,…,an←findPlan\(shl,ghl\)a\_\{1\},\\ldots,a\_\{n\}\\leftarrow\\mathrm\{findPlan\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
4

i←1i\\leftarrow 1
5while*shl∉ghl\\mathit\{s\}^\{\\mathrm\{hl\}\}\\notin\\mathit\{g\}^\{\\mathrm\{hl\}\}andi≤ni\\leq n*

6

shl←ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\\leftarrow\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)
7if*𝑝𝑟𝑒\(ai\)⊆shl\\mathit\{pre\}\(a\_\{i\}\)\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}*then

8

ahl←ai\\mathit\{a\}^\{\\mathrm\{hl\}\}\\leftarrow a\_\{i\}
9

10else if*𝑝𝑟𝑒\(ai\)⊈shl\\mathit\{pre\}\(a\_\{i\}\)\\not\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}andi\+1≤ni\+1\\leq nand𝑝𝑟𝑒\(ai\+1\)⊆shl\\mathit\{pre\}\(a\_\{i\+1\}\)\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}*then

11

i←i\+1i\\leftarrow i\+1
12

ahl←ai\\mathit\{a\}^\{\\mathrm\{hl\}\}\\leftarrow a\_\{i\}
13

14else

15return*Failure*

16

17

𝐚ll←πll\(𝐬ll,ahl,ghl\)\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\leftarrow\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
18

𝐬ll←𝐞𝐱𝐞𝐜\(𝐬ll,𝐚ll\)\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\leftarrow\\mathbf\{exec\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\)
19

Algorithm 2DetPlan Baseline
#### DetReplan

[Algorithm˜3](https://arxiv.org/html/2605.15975#algorithm3)illustrates the DetReplan baseline in more detail\. It is the same as[Algorithm˜2](https://arxiv.org/html/2605.15975#algorithm2)except in how it handles the failure case \([Algorithm˜3](https://arxiv.org/html/2605.15975#algorithm3)\) by replanning\. It computes a new HL plan and resets the indexiiand length of the plannn\.

Input:Problem𝐏\\mathbf\{P\}domain𝒟\\mathcal\{D\}, labelling functionℒ\\mathcal\{L\}, HL goalghl\\mathit\{g\}^\{\\mathrm\{hl\}\}, and LL policyπll\\pi^\{\\mathrm\{ll\}\}\.

1

𝐬ll←𝐬0\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\leftarrow\\mathbf\{s\}\_\{0\}
2

shl←ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\\leftarrow\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)
3

a1,…,an←findPlan\(shl,ghl\)a\_\{1\},\\ldots,a\_\{n\}\\leftarrow\\mathrm\{findPlan\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
4

i←1i\\leftarrow 1
5while*shl∉ghl\\mathit\{s\}^\{\\mathrm\{hl\}\}\\notin\\mathit\{g\}^\{\\mathrm\{hl\}\}andi≤ni\\leq n*

6

shl←ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\\leftarrow\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)
7if*𝑝𝑟𝑒\(ai\)⊆shl\\mathit\{pre\}\(a\_\{i\}\)\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}*then

8

ahl←ai\\mathit\{a\}^\{\\mathrm\{hl\}\}\\leftarrow a\_\{i\}
9

10else if*𝑝𝑟𝑒\(ai\)⊈shl\\mathit\{pre\}\(a\_\{i\}\)\\not\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}andi\+1≤ni\+1\\leq nand𝑝𝑟𝑒\(ai\+1\)⊆shl\\mathit\{pre\}\(a\_\{i\+1\}\)\\subseteq\\mathit\{s\}^\{\\mathrm\{hl\}\}*then

11

i←i\+1i\\leftarrow i\+1
12

ahl←ai\\mathit\{a\}^\{\\mathrm\{hl\}\}\\leftarrow a\_\{i\}
13

14else

15

a1,…,an←findPlan\(shl,ghl\)a\_\{1\},\\ldots,a\_\{n\}\\leftarrow\\mathrm\{findPlan\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
16

i←1i\\leftarrow 1
17continue

18

19

𝐚ll←πll\(𝐬ll,ahl,ghl\)\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\leftarrow\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
20

𝐬ll←𝐞𝐱𝐞𝐜\(𝐬ll,𝐚ll\)\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\leftarrow\\mathbf\{exec\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\)
21

Algorithm 3DetReplan Baseline
#### NdtPlan

Input:Problem𝐏\\mathbf\{P\}domain𝒟\\mathcal\{D\}, labelling functionℒ\\mathcal\{L\}, HL goalghl\\mathit\{g\}^\{\\mathrm\{hl\}\}, and LL policyπll\\pi^\{\\mathrm\{ll\}\}\.

1

𝐬ll←𝐬0\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\leftarrow\\mathbf\{s\}\_\{0\}
2

shl←ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\\leftarrow\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)
3

πhl←findPolicy\(shl,ghl\)\\pi^\{\\mathrm\{hl\}\}\\leftarrow\\mathrm\{findPolicy\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
4while*shl∉ghl\\mathit\{s\}^\{\\mathrm\{hl\}\}\\notin\\mathit\{g\}^\{\\mathrm\{hl\}\}*

5

shl←ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\\leftarrow\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)
6

ahl←πhl\(shl,ghl\)\\mathit\{a\}^\{\\mathrm\{hl\}\}\\leftarrow\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
7if*ahl=⊥\\mathit\{a\}^\{\\mathrm\{hl\}\}=\\bot*then

8return*Failure*

9

10

𝐚ll←πll\(𝐬ll,ahl,ghl\)\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\leftarrow\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
11

𝐬ll←𝐞𝐱𝐞𝐜\(𝐬ll,𝐚ll\)\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\leftarrow\\mathbf\{exec\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\)
12

Algorithm 4NdtPlan Baseline[Algorithm˜4](https://arxiv.org/html/2605.15975#algorithm4)illustrates the NdtPlan baseline in more detail\. It firstly aims to compute an HL policy specific to the problem and tries to execute it\. The major difference is that the HL policy may return nothing for a given state, in which case we terminate with failure\.

#### NdtReplan

[Algorithm˜5](https://arxiv.org/html/2605.15975#algorithm5)illustrates the NdtReplan baseline in more detail\. It is the same as[Algorithm˜4](https://arxiv.org/html/2605.15975#algorithm4)except it now aims to replan by recomputing an HL policy for the failure case \([Algorithm˜3](https://arxiv.org/html/2605.15975#algorithm3)\)\.

Input:Problem𝐏\\mathbf\{P\}domain𝒟\\mathcal\{D\}, labelling functionℒ\\mathcal\{L\}, HL goalghl\\mathit\{g\}^\{\\mathrm\{hl\}\}, and LL policyπll\\pi^\{\\mathrm\{ll\}\}\.

1

𝐬ll←𝐬0\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\leftarrow\\mathbf\{s\}\_\{0\}
2

shl←ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\\leftarrow\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)
3

πhl←findPolicy\(shl,ghl\)\\pi^\{\\mathrm\{hl\}\}\\leftarrow\\mathrm\{findPolicy\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
4while*shl∉ghl\\mathit\{s\}^\{\\mathrm\{hl\}\}\\notin\\mathit\{g\}^\{\\mathrm\{hl\}\}*

5

shl←ℒ\(𝐬ll\)\\mathit\{s\}^\{\\mathrm\{hl\}\}\\leftarrow\\mathcal\{L\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\}\)
6

ahl←πhl\(shl,ghl\)\\mathit\{a\}^\{\\mathrm\{hl\}\}\\leftarrow\\pi^\{\\mathrm\{hl\}\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
7if*ahl=⊥\\mathit\{a\}^\{\\mathrm\{hl\}\}=\\bot*then

8

πhl←findPolicy\(shl,ghl\)\\pi^\{\\mathrm\{hl\}\}\\leftarrow\\mathrm\{findPolicy\}\(\\mathit\{s\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
9continue

10

11

𝐚ll←πll\(𝐬ll,ahl,ghl\)\\mathbf\{a\}^\{\\mathrm\{ll\}\}\\leftarrow\\pi^\{\\mathrm\{ll\}\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathit\{a\}^\{\\mathrm\{hl\}\},\\mathit\{g\}^\{\\mathrm\{hl\}\}\)
12

𝐬ll←𝐞𝐱𝐞𝐜\(𝐬ll,𝐚ll\)\\mathbf\{s\}^\{\\mathrm\{ll\}\}\\leftarrow\\mathbf\{exec\}\(\\mathbf\{s\}^\{\\mathrm\{ll\}\},\\mathbf\{a\}^\{\\mathrm\{ll\}\}\)
13

Algorithm 5NdtReplan Baseline

### C\.3Benchmarks

The benchmarks compositionally extend the MetaWorld\[[147](https://arxiv.org/html/2605.15975#bib.bib34),[93](https://arxiv.org/html/2605.15975#bib.bib35)\]benchmarks which consist of 3D MuJoCo simulations of a robot arm\. The following are descriptions of the environment tasks and also serve as the natural language inputs to the VLA baseline\. The environments are illustrated in[Figure˜7](https://arxiv.org/html/2605.15975#A3.F7)\.

![Refer to caption](https://arxiv.org/html/2605.15975v1/figures/env-1-c.png)\(a\)Blocks
![Refer to caption](https://arxiv.org/html/2605.15975v1/figures/env-2-c.png)\(b\)Factory
![Refer to caption](https://arxiv.org/html/2605.15975v1/figures/env-3-c.png)\(c\)Colour
![Refer to caption](https://arxiv.org/html/2605.15975v1/figures/env-4-c.png)\(d\)Gacha

Figure 7:Visualisations of benchmark environments\.#### BlocksS

Place thennblocks to goal locations of the same colour\.

#### BlocksN

Place thennblocks to goal locations of the same colour\. Blocks that are currently at their goal location may teleport to an unoccupied location at any time\.

#### FactoryS

Place thennblocks to goal locations of the same colour\. Every time each of the original blocks is placed, a new block is introduced with a new goal location\.

#### FactoryN

Place thennblocks to goal locations of the same colour\. Every time each of the original blocks is placed, a new block is introduced with a new goal location\. Blocks that are currently at their goal location may also teleport to an unoccupied location at any time\.

#### ColourS

Place thennbrown blocks onto the pink platform, which assigns blocks a random colour when activated\. The goal is to colour each of the brown blocks and then place them on a tray with the same colour\.

#### ColourN

Place thennbrown blocks onto the pink platform, which assigns blocks a random colour when activated\. The goal is to colour each of the brown blocks and then place them on a tray with the same colour\. Blocks that are currently at their goal location may also teleport to an unoccupied location at any time\.

#### GachaS

This environment consists ofnntarget colours, a gallery of trays with different colours, and agachabox which when its lid is closed and activated produces a block of a random colour\. The goal is to match each target colour by placing a block of that colour on its corresponding tray\.

#### GachaN

This environment consists ofnntarget colours, a gallery of trays with different colours, and agachabox which when its lid is closed and activated produces a block of a random colour\. The goal is to match each target colour by placing a block of that colour on its corresponding tray\. Blocks that are currently at their goal location may teleport back into the gacha box whenever it is unoccupied\.

### C\.4Visualising Learned HL Policies

We display the learned HL policies for each of the environments which can generalise to arbitrary numbers of objects\. We refer to[Example˜2](https://arxiv.org/html/2605.15975#Thmexample2)for interpreting the form of the rules\. We automatically generate LLM interpretations of the policies via Claude Sonnet 4\.6\.

#### BlocksSand BlocksN

1:∃x,l\.ℎ𝑜𝑙𝑑𝑖𝑛𝑔\(x\),𝑐𝑙𝑒𝑎𝑟\(l\),𝑎𝑡\(x,l\)¯\\displaystyle 1:\\exists x,l\.\\;\\mathit\{holding\}\(x\),\\mathit\{clear\}\(l\),\\underline\{\\mathit\{at\}\(x,l\)\}↦𝑝𝑙𝑎𝑐𝑒\(x,l\)\\displaystyle\\mapsto\\mathit\{place\}\(x,l\)2:∃x,l\.𝑐𝑙𝑒𝑎𝑟\(x\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑎𝑡\(x,l\)¯\\displaystyle 2:\\exists x,l\.\\;\\mathit\{clear\}\(x\),\\mathit\{gripperFree\}\(\),\\underline\{\\mathit\{at\}\(x,l\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)
LLM\-interpretation: The policy has two rules, applied by priority\. Rule 1 says: if the gripper is holding blockxxand its goal locationllis clear, placexxatll\. Rule 2 says: if blockxxis clear \(nothing on top\), the gripper is free, andxxhas a goal location, pick upxx\. Together these encode a simple pick\-and\-place loop: pick the next block with an unmet goal, then immediately place it\. The policy also handles BlocksNreactively, since the goal predicate𝑎𝑡𝒢\(x,l\)\\mathit\{at\}^\{\\mathcal\{G\}\}\(x,l\)remains true for any block teleported away from its goal, so it gets re\-picked on the next step\.

#### FactorySand FactoryN

1:∃x,l\.ℎ𝑜𝑙𝑑𝑖𝑛𝑔\(x\),𝑐𝑙𝑒𝑎𝑟\(l\),𝑎𝑡\(x,l\)¯\\displaystyle 1:\\exists x,l\.\\;\\mathit\{holding\}\(x\),\\mathit\{clear\}\(l\),\\underline\{\\mathit\{at\}\(x,l\)\}↦𝑝𝑙𝑎𝑐𝑒\(x,l\)\\displaystyle\\mapsto\\mathit\{place\}\(x,l\)2:∃x,l\.𝑐𝑙𝑒𝑎𝑟\(x\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑎𝑡\(x,l\)¯\\displaystyle 2:\\exists x,l\.\\;\\mathit\{clear\}\(x\),\\mathit\{gripperFree\}\(\),\\underline\{\\mathit\{at\}\(x,l\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)
LLM\-interpretation: The rules are identical to those of BlocksS\. This is notable: even though new blocks are introduced dynamically when existing ones are placed, the policy generalises automatically because its rules are existentially quantified over objects\. Any newly introduced blockxxwith goal locationllimmediately satisfies Rule 2 and gets picked, then placed by Rule 1, with no change to the policy itself\.

#### ColourSand ColourN

1:∃c,x,t\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(t,c\),ℎ𝑜𝑙𝑑𝑖𝑛𝑔\(x\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(x\)¯\\displaystyle 1:\\exists c,x,t\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{colourOf\}\(t,c\),\\mathit\{holding\}\(x\),\\underline\{\\mathit\{achievedGoal\}\(x\)\}↦𝑝𝑙𝑎𝑐𝑒𝐺𝑜𝑎𝑙\(x,t,c\)\\displaystyle\\mapsto\\mathit\{placeGoal\}\(x,t,c\)2:∃c,x,t\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(t,c\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑜𝑛\(x,t\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(x\)¯\\displaystyle 2:\\exists c,x,t\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{colourOf\}\(t,c\),\\mathit\{gripperFree\}\(\),\\mathit\{on\}\(x,t\),\\underline\{\\mathit\{achievedGoal\}\(x\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)2:∃c,x,t\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(t,c\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(x\)¯\\displaystyle 2:\\exists c,x,t\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{colourOf\}\(t,c\),\\mathit\{gripperFree\}\(\),\\underline\{\\mathit\{achievedGoal\}\(x\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)3:∃b,c,x,t,t′\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(t,c\),𝑐𝑜𝑙𝑜𝑢𝑟𝑒𝑟\(t′\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑜𝑛\(x,t′\),𝑢𝑛𝑘𝑛𝑜𝑤𝑛\(x\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(x\)¯\\displaystyle 3:\\exists b,c,x,t,t^\{\\prime\}\.\\;\\mathit\{colourOf\}\(t,c\),\\mathit\{colourer\}\(t^\{\\prime\}\),\\mathit\{gripperFree\}\(\),\\mathit\{on\}\(x,t^\{\\prime\}\),\\mathit\{unknown\}\(x\),\\underline\{\\mathit\{achievedGoal\}\(x\)\}↦𝑜𝑝𝑒𝑛\(x,b,t′\)\\displaystyle\\mapsto\\mathit\{open\}\(x,b,t^\{\\prime\}\)4:∃c,x,t,t′\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(t,c\),𝑐𝑜𝑙𝑜𝑢𝑟𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑐𝑜𝑙𝑜𝑢𝑟𝑒𝑟\(t′\),ℎ𝑜𝑙𝑑𝑖𝑛𝑔\(x\),𝑢𝑛𝑘𝑛𝑜𝑤𝑛\(x\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(x\)¯\\displaystyle 4:\\exists c,x,t,t^\{\\prime\}\.\\;\\mathit\{colourOf\}\(t,c\),\\mathit\{colourerFree\}\(\),\\mathit\{colourer\}\(t^\{\\prime\}\),\\mathit\{holding\}\(x\),\\mathit\{unknown\}\(x\),\\underline\{\\mathit\{achievedGoal\}\(x\)\}↦𝑝𝑙𝑎𝑐𝑒\(x,t′\)\\displaystyle\\mapsto\\mathit\{place\}\(x,t^\{\\prime\}\)5:∃c,x,t,t′\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(t,c\),𝑐𝑜𝑙𝑜𝑢𝑟𝑒𝑟\(t′\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑜𝑛\(x,t′\),𝑢𝑛𝑘𝑛𝑜𝑤𝑛\(x\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(x\)¯\\displaystyle 5:\\exists c,x,t,t^\{\\prime\}\.\\;\\mathit\{colourOf\}\(t,c\),\\mathit\{colourer\}\(t^\{\\prime\}\),\\mathit\{gripperFree\}\(\),\\mathit\{on\}\(x,t^\{\\prime\}\),\\mathit\{unknown\}\(x\),\\underline\{\\mathit\{achievedGoal\}\(x\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)5:∃c,x,t,t′\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(t,c\),𝑐𝑜𝑙𝑜𝑢𝑟𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑐𝑜𝑙𝑜𝑢𝑟𝑒𝑟\(t′\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑢𝑛𝑘𝑛𝑜𝑤𝑛\(x\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(x\)¯\\displaystyle 5:\\exists c,x,t,t^\{\\prime\}\.\\;\\mathit\{colourOf\}\(t,c\),\\mathit\{colourerFree\}\(\),\\mathit\{colourer\}\(t^\{\\prime\}\),\\mathit\{gripperFree\}\(\),\\mathit\{unknown\}\(x\),\\underline\{\\mathit\{achievedGoal\}\(x\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)
LLM\-interpretation: The task requires colouring brown \(unknown\) blocks on a pink platform before placing them on matching trays, and the policy decomposes the workflow into five priority levels\. Rule 1: if holding a blockxxwhose colourccmatches a trayttof the same colour, placexxontt\. Rule 2 \(two variants\): if a coloured blockxxalready matches a tray colour, pick it up; the two variants differ only in whetherxxis currently resting on a surface\. Rule 3: if an unknown blockxxis sitting on the colourert′t^\{\\prime\}, activate the colourer \(𝑜𝑝𝑒𝑛\\mathit\{open\}\) to assignxxa colour\. Rule 4: if the gripper is holding an unknown block and the colourer is free, place the block on the colourer\. Rule 5 \(two variants\): if a block has no colour yet, pick it up so it can be brought to the colourer\. The priority ordering encodes the full pipeline: fetch unknown block, place it on the colourer, activate the colourer, then pick up the now\-coloured block and place it on the matching tray\. The same policy handles ColourNreactively, as in BlocksS\.

#### GachaSand GachaN

1:∃c,x,t\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),ℎ𝑜𝑙𝑑𝑖𝑛𝑔\(x\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c\)¯\\displaystyle 1:\\exists c,x,t\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{holding\}\(x\),\\mathit\{trayColour\}\(t,c\),\\underline\{\\mathit\{achievedGoal\}\(c\)\}↦𝑝𝑙𝑎𝑐𝑒𝐺𝑜𝑎𝑙\(x,t,c\)\\displaystyle\\mapsto\\mathit\{placeGoal\}\(x,t,c\)2:∃c,x,d,t\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑖𝑛\(x,d\),𝑜𝑝𝑒𝑛𝑒𝑑\(d\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c\)¯\\displaystyle 2:\\exists c,x,d,t\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{gripperFree\}\(\),\\mathit\{in\}\(x,d\),\\mathit\{opened\}\(d\),\\mathit\{trayColour\}\(t,c\),\\underline\{\\mathit\{achievedGoal\}\(c\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)3:∃c,x,d,t\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑖𝑛\(x,d\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c\)¯\\displaystyle 3:\\exists c,x,d,t\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{gripperFree\}\(\),\\mathit\{in\}\(x,d\),\\mathit\{trayColour\}\(t,c\),\\underline\{\\mathit\{achievedGoal\}\(c\)\}↦𝑜𝑝𝑒𝑛\(d\)\\displaystyle\\mapsto\\mathit\{open\}\(d\)4:∃b,c,d,t\.𝑐𝑙𝑒𝑎𝑟\(d\),𝑐𝑙𝑜𝑠𝑒𝑑\(d\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c\)¯\\displaystyle 4:\\exists b,c,d,t\.\\;\\mathit\{clear\}\(d\),\\mathit\{closed\}\(d\),\\mathit\{gripperFree\}\(\),\\mathit\{trayColour\}\(t,c\),\\underline\{\\mathit\{achievedGoal\}\(c\)\}↦𝑟𝑜𝑙𝑙\(d,b\)\\displaystyle\\mapsto\\mathit\{roll\}\(d,b\)5:∃c,d,t\.𝑐𝑙𝑒𝑎𝑟\(d\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑜𝑝𝑒𝑛𝑒𝑑\(d\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c\)¯\\displaystyle 5:\\exists c,d,t\.\\;\\mathit\{clear\}\(d\),\\mathit\{gripperFree\}\(\),\\mathit\{opened\}\(d\),\\mathit\{trayColour\}\(t,c\),\\underline\{\\mathit\{achievedGoal\}\(c\)\}↦𝑐𝑙𝑜𝑠𝑒\(d\)\\displaystyle\\mapsto\\mathit\{close\}\(d\)6:∃c,c′,x,d,t,t′\.𝑐𝑙𝑒𝑎𝑟\(d\),𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),ℎ𝑜𝑙𝑑𝑖𝑛𝑔\(x\),𝑜𝑝𝑒𝑛𝑒𝑑\(d\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t′,c′\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c′\)¯\\displaystyle 6:\\exists c,c^\{\\prime\},x,d,t,t^\{\\prime\}\.\\;\\mathit\{clear\}\(d\),\\mathit\{colourOf\}\(x,c\),\\mathit\{holding\}\(x\),\\mathit\{opened\}\(d\),\\mathit\{trayColour\}\(t,c\),\\mathit\{trayColour\}\(t^\{\\prime\},c^\{\\prime\}\),\\underline\{\\mathit\{achievedGoal\}\(c^\{\\prime\}\)\}↦𝑝𝑙𝑎𝑐𝑒𝐺𝑜𝑎𝑙\(x,t,c\)\\displaystyle\\mapsto\\mathit\{placeGoal\}\(x,t,c\)7:∃c,c′,x,d,t,t′\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑖𝑛\(x,d\),𝑜𝑝𝑒𝑛𝑒𝑑\(d\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t′,c′\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c′\)¯\\displaystyle 7:\\exists c,c^\{\\prime\},x,d,t,t^\{\\prime\}\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{gripperFree\}\(\),\\mathit\{in\}\(x,d\),\\mathit\{opened\}\(d\),\\mathit\{trayColour\}\(t,c\),\\mathit\{trayColour\}\(t^\{\\prime\},c^\{\\prime\}\),\\underline\{\\mathit\{achievedGoal\}\(c^\{\\prime\}\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)7:∃c,c′,x,d,t,t′\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑖𝑛\(x,d\),𝑜𝑝𝑒𝑛𝑒𝑑\(d\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c′\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t′,c\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c′\)¯\\displaystyle 7:\\exists c,c^\{\\prime\},x,d,t,t^\{\\prime\}\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{gripperFree\}\(\),\\mathit\{in\}\(x,d\),\\mathit\{opened\}\(d\),\\mathit\{trayColour\}\(t,c^\{\\prime\}\),\\mathit\{trayColour\}\(t^\{\\prime\},c\),\\underline\{\\mathit\{achievedGoal\}\(c^\{\\prime\}\)\}↦𝑝𝑖𝑐𝑘\(x\)\\displaystyle\\mapsto\\mathit\{pick\}\(x\)8:∃c,c′,x,d,t,t′\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑖𝑛\(x,d\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c′\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t′,c\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c′\)¯\\displaystyle 8:\\exists c,c^\{\\prime\},x,d,t,t^\{\\prime\}\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{gripperFree\}\(\),\\mathit\{in\}\(x,d\),\\mathit\{trayColour\}\(t,c^\{\\prime\}\),\\mathit\{trayColour\}\(t^\{\\prime\},c\),\\underline\{\\mathit\{achievedGoal\}\(c^\{\\prime\}\)\}↦𝑜𝑝𝑒𝑛\(d\)\\displaystyle\\mapsto\\mathit\{open\}\(d\)8:∃c,c′,x,d,t,t′\.𝑐𝑜𝑙𝑜𝑢𝑟𝑂𝑓\(x,c\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑖𝑛\(x,d\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t′,c′\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c′\)¯\\displaystyle 8:\\exists c,c^\{\\prime\},x,d,t,t^\{\\prime\}\.\\;\\mathit\{colourOf\}\(x,c\),\\mathit\{gripperFree\}\(\),\\mathit\{in\}\(x,d\),\\mathit\{trayColour\}\(t,c\),\\mathit\{trayColour\}\(t^\{\\prime\},c^\{\\prime\}\),\\underline\{\\mathit\{achievedGoal\}\(c^\{\\prime\}\)\}↦𝑜𝑝𝑒𝑛\(d\)\\displaystyle\\mapsto\\mathit\{open\}\(d\)9:∃b,c,c′,d,t,t′\.𝑐𝑙𝑒𝑎𝑟\(d\),𝑐𝑙𝑜𝑠𝑒𝑑\(d\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t′,c′\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c\)¯\\displaystyle 9:\\exists b,c,c^\{\\prime\},d,t,t^\{\\prime\}\.\\;\\mathit\{clear\}\(d\),\\mathit\{closed\}\(d\),\\mathit\{gripperFree\}\(\),\\mathit\{trayColour\}\(t,c\),\\mathit\{trayColour\}\(t^\{\\prime\},c^\{\\prime\}\),\\underline\{\\mathit\{achievedGoal\}\(c\)\}↦𝑟𝑜𝑙𝑙\(d,b\)\\displaystyle\\mapsto\\mathit\{roll\}\(d,b\)9:∃b,c,c′,d,t,t′\.𝑐𝑙𝑒𝑎𝑟\(d\),𝑐𝑙𝑜𝑠𝑒𝑑\(d\),𝑔𝑟𝑖𝑝𝑝𝑒𝑟𝐹𝑟𝑒𝑒\(\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t,c\),𝑡𝑟𝑎𝑦𝐶𝑜𝑙𝑜𝑢𝑟\(t′,c′\),𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑𝐺𝑜𝑎𝑙\(c′\)¯\\displaystyle 9:\\exists b,c,c^\{\\prime\},d,t,t^\{\\prime\}\.\\;\\mathit\{clear\}\(d\),\\mathit\{closed\}\(d\),\\mathit\{gripperFree\}\(\),\\mathit\{trayColour\}\(t,c\),\\mathit\{trayColour\}\(t^\{\\prime\},c^\{\\prime\}\),\\underline\{\\mathit\{achievedGoal\}\(c^\{\\prime\}\)\}↦𝑟𝑜𝑙𝑙\(d,b\)\\displaystyle\\mapsto\\mathit\{roll\}\(d,b\)
LLM\-interpretation: The task requires producing blocks of specific colours from a gacha box \(close lid, roll, open lid, then extract block\) and placing them on matching trays\. The policy has nine priority levels\. Rule 1: if holding a blockxxof the needed colourcc, place it on the matching traytt\. Rule 2: if the gachaddis open and contains a block of the needed colour, pick it\. Rule 3: if the gacha contains a block of the needed colour but is closed, open it\. Rule 4: if the gacha is closed and empty, roll it to produce a new block of random colour\. Rule 5: if the gacha is open but does not contain a useful block, close it so that it can be rolled again\. Rules 6–8 cover*opportunistic*actions when the currently pursued goal colour isc′c^\{\\prime\}but a block of a different needed colourccbecomes available: place it on its matching tray \(Rule 6\), pick it from an open gacha \(Rule 7, two variants differing in the binding ofccandc′c^\{\\prime\}to the trays\), or open the gacha to retrieve it \(Rule 8, two variants\)\. Rule 9 \(two variants\) rolls the gacha when at least two unachieved goal colours remain and the gacha is empty\. The policy is more complex than BlocksSbecause it must reason about both the stochastic production mechanism \(rolls produce random colours\) and opportunity costs across multiple simultaneous colour goals; Rules 6–9 in particular ensure that lucky rolls are never wasted, even when they do not match the colour the agent is currently chasing\.
Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

Similar Articles

Milestone-Guided Policy Learning for Long-Horizon Language Agents

RoboTALES: Learning Reasoning-Guided Robot Policies via Task-Aligned Simulated Futures

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

Submit Feedback

Similar Articles

Milestone-Guided Policy Learning for Long-Horizon Language Agents
RoboTALES: Learning Reasoning-Guided Robot Policies via Task-Aligned Simulated Futures
Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents