Neuro-Inspired Inverse Learning for Planning and Control

arXiv cs.AI Papers

Summary

This paper introduces a neuro-inspired framework called Inverter that uses Inverse Learning (IL) for fast and efficient planning and control, achieving significant improvements on D4RL benchmarks and quantum gate synthesis with orders of magnitude less inference computation.

arXiv:2605.24152v1 Announce Type: new Abstract: We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Figure of Merit (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:05 AM

# Neuro-Inspired Inverse Learning for Planning and Control
Source: [https://arxiv.org/html/2605.24152](https://arxiv.org/html/2605.24152)
Tonio BallCorrespondence:[tonio\.ball@neuromentum\.ai](https://arxiv.org/html/2605.24152v1/mailto:[email protected])NeuroMentum AIIMBIT, University of Freiburg, Germany

\(May 2026\)

###### Abstract

We present a neuro\-inspired framework for embodied planning and control\. Building on three principles that enable fast and highly effective goal\-directed behavior in the mammalian brain — paired forward/inverse internal models, open\-loop multi\-step motor commands, and sequential, hierarchicalorganization of action — our*Inverter*frameworkuses learned components, trained end\-to\-end through*[Inverse Learning](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)\([IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)\)*and supplemented where natural by analytic or algorithmic modules; we formalize[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)and delineate it from supervised, reinforcement, and imitation learning\.[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)bridges[Reinforcement Learning](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)\([RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)\)\-style amortization, which runs in a single forward pass but emits only one action at a time, and[Optimal Control](https://arxiv.org/html/2605.24152#A1.SS1.48.48.48)\([OC](https://arxiv.org/html/2605.24152#A1.SS1.48.48.48)\)\-style sequence planning over whole trajectories, but with iterative test\-time computation\. Single Inverters or hierarchicaln=2n\{=\}2Inverter stacks match or improve on offline\-[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)and diffusion\-planner baselines on all 3maze2dand 6antmaze[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)variants by an average of\+24\.2%\+24\.2\\%\(range−1\.9%\-1\.9\\%to\+78\.2%\+78\.2\\%\), at one\-to\-two orders of magnitude less inference computetime\. Distinctively, optimizing through the[Forward model](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\([FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\) over the entireTT\-step action sequence – rather than per step – lets Inverters produce smooth, goal\-coherent, trajectory\-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself\. We also identify a failure mode of[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22):[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)hacking under narrow training\-data coverage, which we mitigate by using*random*training data with broader coverage\. As an application example, a Pulse Inverter synthesizes arbitrary single\-qubit quantum gates with fidelity matching the standard iterative numerical baseline \([GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)\), at more than1000×1000\{\\times\}lower per\-gate computetime\. In summary, we conclude that[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)enables a versatile class of world\-interfaces, especially for latency\- and resource\-critical embodied AI\.

## 1 Introduction

Humans are able to generate explicit plans on multiple timescales even before we start acting on them\. Imagine, for example, planning a trip to Paris\. Even before we lift a finger, we might already think about booking a train and finding a hotel\. We also might already know how we will approach each of these steps — for example, how to navigate the booking site of our choice to find a suitable accommodation\. And when we reach for our laptop, eager to begin, our hand may sweep along a smooth, pre\-shaped trajectory — a*“ballistic”*movement planned as a whole\. Human goal\-directed behavior is thus fundamentally organized as a hierarchy of planning and control across widely separated timescales, often laid out, in some form,*before*any action is taken\.

Such multi\-timescale, pre\-optimized plans over whole sequences of actions are not the central focus of dominant learning\-based paradigms for continuous control such as[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59), which is designed to emit a single reactive action at a time, or[OC](https://arxiv.org/html/2605.24152#A1.SS1.48.48.48), which aims to optimize over whole trajectories but typically iteratively at runtime — a property also shared by related frameworks such as active inference\. Here we present a paradigm for learning planning and control whose principles are inspired by the functional organization of the mammalian brain, and whose conceptual center is the fast and effective*pre\-generation*of such hierarchical, multi\-timescale plans and action sequences for goal\-directed, embodied behavior\. We start by considering the brain as an inversion machine\.

![Refer to caption](https://arxiv.org/html/2605.24152v1/x1.png)Figure 1:The Inverter planning and control framework\.\(A\)Schematic with two paired Inverters at different abstraction levels\. The*Level 1 Inverter \(Control\)*gϕ​\(sk−1,ck\)→a1:T\(k\)g\_\{\\phi\}\(s\_\{k\-1\},c\_\{k\}\)\\\!\\to\\\!a^\{\(k\)\}\_\{1:T\}is trained by[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)through a[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)and produces eachTT\-step action chunk in a single feedforward pass per chunk—no inner optimization loop, no autoregressive decoding, no iterative denoising\. The*Level 2 Inverter \(Planning\)*hψ​\(s0,G\)→c1:Kh\_\{\\psi\}\(s\_\{0\},G\)\\\!\\to\\\!c\_\{1:K\}emits a sequence ofKKsubgoals, one per chunk\. The framework extends naturally to deeper hierarchies \(n≥3n\\\!\\geq\\\!3, as discussed in the Outlook\)\. The chunks tile the time axis and ultimately drive the agent to the goalGG\.\(B\)Illustrative single\-shot example\. From the start state \(∙\\bullet\), the Inverter emits a full action sequence in one feedforward pass; the green arrows are the planned per\-step actions; the curve is the trajectory that emerges from executing those actions through the environment dynamics, with instantaneous speed color\-coded\. Mirroring the optimal\-feedback\-control account of neurobiological motor control\[[94](https://arxiv.org/html/2605.24152#bib.bib27)\], our framework thus rejects the notion of a “desired trajectory”: The properties of the observed motion emerge from the action sequence\.#### The brain as an inversion machine\.

The dominant Bayesian\-brain view has conceptualized the brain as an inference machine\[[58](https://arxiv.org/html/2605.24152#bib.bib3),[28](https://arxiv.org/html/2605.24152#bib.bib4)\]\. We conceptualize the brain first and foremost as an*inversion*machine: goal\-directed behavior poses the inverse problem*“given a desired outcome, which actions realize it?”*; solving this inverse problem minimizes the discrepancy between desired and actual outcomes\. The computational core is inversion of a model of how the world responds to action — a problem admitting a rich landscape of implementations: iterative or amortized \(a learned direct inverse mapping in one feedforward pass\); probabilistic or deterministic; over fully or partially observable, single\- or multi\-agent dynamics; with continuous, discrete, or hybrid state and action spaces; and realized through closed\-form, algorithmic, or learned components\. We presume the brain flexibly uses whichever solution fits the task, timescale, and resources\[[32](https://arxiv.org/html/2605.24152#bib.bib5),[66](https://arxiv.org/html/2605.24152#bib.bib94),[33](https://arxiv.org/html/2605.24152#bib.bib95)\]\. Here we focus on the amortized, deterministic, single\-agent, and fully observable direct mapping variant, following the classical motor control concepts of Jordan & Rumelhart’s distal\-teacher framework\[[50](https://arxiv.org/html/2605.24152#bib.bib6)\]and Kawato’s internal\-model theory\[[53](https://arxiv.org/html/2605.24152#bib.bib7),[98](https://arxiv.org/html/2605.24152#bib.bib8)\]\. The combination of three essential principles organizes our approach:

- •*Paired[Forward models](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\([FoMs](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\) and[Inverse models](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\([IMs](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\)*, with the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)111Abbreviations used throughout the paper are listed in App\.[A\.1](https://arxiv.org/html/2605.24152#A1.SS1)\.providing the instructive error signal\[[50](https://arxiv.org/html/2605.24152#bib.bib6),[53](https://arxiv.org/html/2605.24152#bib.bib7),[98](https://arxiv.org/html/2605.24152#bib.bib8)\]\.
- •*Open\-loop multi\-step motor commands*ballistically executable in pre\-planned chunks too fast for sensory correction \([101](https://arxiv.org/html/2605.24152#bib.bib107),[23](https://arxiv.org/html/2605.24152#bib.bib9)\)\.
- •*Sequential, hierarchical organization of action*, segmenting behaviors into sequential sub\-plans and nesting them across levels of timescales and abstraction, with higher\-order areas issuing subgoals to lower\-level loops\[[35](https://arxiv.org/html/2605.24152#bib.bib10),[24](https://arxiv.org/html/2605.24152#bib.bib11),[7](https://arxiv.org/html/2605.24152#bib.bib12),[44](https://arxiv.org/html/2605.24152#bib.bib106)\]\.

Table 1:The five Inverters in this paper\.Each level inverts a different forward process and emits its multi\-step output sequence in one feedforward pass; four are neural Inverters trained by[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22), one is a simple algorithmic[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)Path Inverter\.
#### The Inverter planning and control framework\.

We organize the three principles above into a planning/control framework \(Fig\.[1](https://arxiv.org/html/2605.24152#S1.F1)\) consisting of a hierarchy of Inverters atn=1,2,3,…n\\\!=\\\!1,2,3,\\ldotsabstraction levels, all sharing the same building block \(an inverse\-learning network through a[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\)\. In this paper we focus on the two levels we evaluate empirically \(Tab\.[1](https://arxiv.org/html/2605.24152#S1.T1)\):

- •*Level 1 Inverter \(Control\)*: emits sequences of actions – continuous and physical in our experiments \(motor torques, control pulses\), or discrete signals \(e\.g\., API calls\) in other domains\.
- •*Level 2 Inverter \(Planning\)*: emits sequences of subgoals consumed by Level 1\.

The same recursive shape extends naturally ton≥3n\\\!\\geq\\\!3Inverters, each specifying how the level immediately below generates its output – for instance by selecting a subgoal\-emission strategy, by composing sequences\-of\-sequences of subgoals, or by other context\-dependent forms; we do not evaluaten≥3n\\\!\\geq\\\!3here \(see Outlook\)\.Neural architectures used within the framework can vary by domain — all maze[FoMs](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)and Inverters in this paper are transformers, while the quantum Pulse Inverter is an[Multilayer perceptron](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)\([MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)\), which suits its compact target without a temporal context\. Where an analytic or algorithmic solution is natural and more useful, we substitute it for the[Neural network](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\([NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\):

- •*Lindblad channel as[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)*\(Pulse Inverter, Sec\.[4\.5](https://arxiv.org/html/2605.24152#S4.SS5)\): the noisy\-transmon dynamics are governed by a known master equation, so a learned[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)would only approximate physics we already have in closed form\.
- •*[Breadth\-first search](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)\([BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)\)\-based Path Inverter on the offline\-data occupancy grid*\(maze2d \-medium/large, Sec\.[4\.2](https://arxiv.org/html/2605.24152#S4.SS2)\): waypoint routing through long corridors is a discrete shortest\-path subproblem that a simple algorithmic[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)already solves on the data’s support, so a learned Level 2 Inverter would add complexity without benefit\.

This strategy corresponds to the task\-, timescale\-, and resource\-dependent flexibility we presume in the brain\[[32](https://arxiv.org/html/2605.24152#bib.bib5),[66](https://arxiv.org/html/2605.24152#bib.bib94),[33](https://arxiv.org/html/2605.24152#bib.bib95)\]\. Interestingly, in practice, this implementation approach gave rise to neurosymbolic patterns across several of our Inverters – which we revisit as a fourth, emergent organizing principle in the Discussion \(Sec\.[5](https://arxiv.org/html/2605.24152#S5)\)\.Next, we formalize the[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)paradigm including training rules\.

## 2 Neuro\-Inspired Inverse Learning

As a*paradigm*here we formalize[Inverse Learning](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)–training a network by differentiating a task objective through a learned[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\(as a form of self\-supervised learning through a trained[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\), delineated from supervised, reinforcement, and imitation learning \(Table[2](https://arxiv.org/html/2605.24152#S2.T2)\)\. At the*algorithm*level we then formalize the concrete training rule \(Eq\.[2](https://arxiv.org/html/2605.24152#S2.E2)\) that this paradigm yields when applied to the Inverter’s role specification\. Acting to achieve a goal admits a direct formulation as action optimization: given states0s\_\{0\}and a task conditioningcc\(the framework’s per\-chunk subgoalckc\_\{k\}at Level 1, the overall goalGGat Level 2\), find

a1:T∗\\displaystyle a\_\{1:T\}^\{\*\}=arg⁡mina1:T⁡𝒥​\(s0:T,a1:T,c\)\\displaystyle=\\;\\arg\\min\_\{a\_\{1:T\}\}\\;\\mathcal\{J\}\(s\_\{0:T\},\\,a\_\{1:T\},\\,c\)\(1\)s\.t\.​s1:T=f​\(s0,a1:T\),\\displaystyle\\text\{s\.t\.\}\\;\\;\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@gray@stroke\{0\}\\pgfsys@color@gray@fill\{0\}s\_\{1:T\}=f\(s\_\{0\},\\,a\_\{1:T\}\)\},a discrete\-time Bolza problem of classical[OC](https://arxiv.org/html/2605.24152#A1.SS1.48.48.48)\[[77](https://arxiv.org/html/2605.24152#bib.bib28),[12](https://arxiv.org/html/2605.24152#bib.bib29)\]parameterized by\(s0,c\)\(s\_\{0\},c\),in chunked form \(equivalent to the per\-step recursionst=f​\(st−1,at\)s\_\{t\}=f\(s\_\{t\-1\},a\_\{t\}\)\),with𝒥\\mathcal\{J\}any differentiable combination of terminal cost, running reward, and action regularizer\.Each term in𝒥\\mathcal\{J\}may be aclosed\-formfunction of the predicted state\-action trajectory \(e\.g\., an analytic terminal cost, a support\-region indicator, an action regularizer\), anadditional learned reward model– a separate differentiable critic, or dedicated reward heads of the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)itself – or any sum of such terms\. Closed\-form and learned components*compose freely within a single𝒥\\mathcal\{J\}*; the only requirement is differentiability ina1:Ta\_\{1:T\}\. This compositional freedom lets a single Inverter target arbitrary differentiable multi\-axis specifications\.Pontryagin’s costate equation – backpropagation throughff, predating its neural use – solves Eq\. \([1](https://arxiv.org/html/2605.24152#S2.E1)\) iteratively per\(s0,c\)\(s\_\{0\},c\)query, with no learned amortization across queries\. In contrast, the value\-recursive Bellman equationV∗​\(s\)=maxa⁡\[r​\(s,a\)\+γ​V∗​\(f​\(s,a\)\)\]V^\{\*\}\(s\)=\\max\_\{a\}\[r\(s,a\)\+\\gamma V^\{\*\}\(f\(s,a\)\)\]\[[10](https://arxiv.org/html/2605.24152#bib.bib30)\]does not produce action sequences, but one reactive action at one step at a time\.

#### Amortizing the planner\.

[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)amortizes classical per\-query trajectory optimization into a two\-component*learned solver*: \(1\) a[Forward model](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\([FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\)fθf\_\{\\theta\}, the learned approximation offfin Eq\.[1](https://arxiv.org/html/2605.24152#S2.E1), trained as a chunked sequence model with chunk lengthL≤TL\\leq Tand stitched across chunks whenL<TL<T\(per\-taskLLin Apps\.[A\.2](https://arxiv.org/html/2605.24152#A1.SS2),[A\.9](https://arxiv.org/html/2605.24152#A1.SS9)\), and \(2\) an[Inverse model](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\([IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\)gϕ​\(s0,c\)→a1:Tg\_\{\\phi\}\(s\_\{0\},c\)\\to a\_\{1:T\}that emits the fullTT\-step action sequence in one feedforward pass\. In the work presented here, the two are trained sequentially: firstfθf\_\{\\theta\}*supervised*on offline\(st,at,st\+1\)\(s\_\{t\},a\_\{t\},s\_\{t\+1\}\)transitions \(or supplied analytically when the physics is known, e\.g\., the Lindbladian in Sec\.[4\.5](https://arxiv.org/html/2605.24152#S4.SS5)\); then, withfθf\_\{\\theta\}frozen,gϕg\_\{\\phi\}is trained by backpropagating a Bolza objective through it \(joint\(fθ,gϕ\)\(f\_\{\\theta\},g\_\{\\phi\}\)training, as well as joint training across hierarchy levels, is equally natural and is discussed in Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\):

minϕ⁡𝔼\(s0,c\)​\[𝒥​\(fθ\(1:T\)​\(s0,gϕ​\(s0,c\)\),gϕ​\(s0,c\),c\)\]\\min\_\{\\phi\}\\;\\mathbb\{E\}\_\{\(s\_\{0\},c\)\}\\\!\\left\[\\,\\mathcal\{J\}\\\!\\Big\(f\_\{\\theta\}^\{\(1:T\)\}\\\!\\big\(s\_\{0\},\\;g\_\{\\phi\}\(s\_\{0\},c\)\\big\),\\;\\;g\_\{\\phi\}\(s\_\{0\},c\),\\;\\;c\\Big\)\\,\\right\]\(2\)Jordan & Rumelhart\[[50](https://arxiv.org/html/2605.24152#bib.bib6)\]introduced this training pattern for single\-step distal control;[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)extends it to multi\-step,H\>1H\{\>\}1level hierarchical planning and control through learned forward and inverse models for embodied control \(Secs\.[4\.1](https://arxiv.org/html/2605.24152#S4.SS1)–[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\)\. AtT=1T\{=\}1andH=1H\{=\}1the Bolza objective collapses to a terminal cost and the recipe reduces to a non\-hierarchical Jordan–Rumelhart\-style distal teacher; the framework’s richness lives atT\>1T\{\>\}1andH\>1H\{\>\}1, where running cost and global trajectory structure enter𝒥\\mathcal\{J\}, sequence\-level optimization over the whole chunk becomes possible, hierarchy allows solving more complex tasks, and chunked open\-loop emission yields per\-episode inference compute time reductions \(see Sec\.[5](https://arxiv.org/html/2605.24152#S5)on the empirical consequences\)\.

We refer to theT\>1T\{\>\}1regime of this paradigm as*[Inverse Sequence Learning](https://arxiv.org/html/2605.24152#A1.SS1.26.26.26)\([ISL](https://arxiv.org/html/2605.24152#A1.SS1.26.26.26)\)*\. All experiments and architectures in this paper operate in the[ISL](https://arxiv.org/html/2605.24152#A1.SS1.26.26.26)regime; we retain the broader “[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)” term for both theT=1T\{=\}1andT\>1T\{\>\}1cases\. The practical advantage of the ISL regime is the gradient structure: where[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)attributes a scalar reward across allda×Td\_\{a\}\\\!\\times\\\!Taction dimensions through estimators like[Policy Gradient](https://arxiv.org/html/2605.24152#A1.SS1.52.52.52)\([PG](https://arxiv.org/html/2605.24152#A1.SS1.52.52.52)\) and[Temporal Difference](https://arxiv.org/html/2605.24152#A1.SS1.68.68.68)\([TD](https://arxiv.org/html/2605.24152#A1.SS1.68.68.68)\),[ISL](https://arxiv.org/html/2605.24152#A1.SS1.26.26.26)backpropagates throughfθf\_\{\\theta\}to deliver an exact gradient∂𝒥/∂at,i\\partial\\mathcal\{J\}/\\partial a\_\{t,i\}at every action dimensioniiand every timestepttacross the wholeTT\-step sequence\.

Table[2](https://arxiv.org/html/2605.24152#S2.T2)delineates[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)from supervised, reinforcement, and imitation learning, indexed by the training signal each paradigm relies on\. Inverters combine[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)’s training\-time amortization with the multi\-step, sequence\-level scope of optimal control, without iterative deployment\-time optimization\. Unlike imitation learning, they learn predictable task structure \(e\.g\., physics\) through a[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)and invert it, rather than cloning behavior\.

Table 2:Inverse learning delineated from the three major established learning paradigms in their typical form: supervised, reinforcement, and imitation of which[Behavior Cloning](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\([BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\) is the supervised special case\. The defining axis – and the key requirement of each paradigm – is the nature of the training signal\.

## 3 Related work

#### [Deep Learning](https://arxiv.org/html/2605.24152#A1.SS1.14.14.14)\([DL](https://arxiv.org/html/2605.24152#A1.SS1.14.14.14)\) for inverse problems\.

[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)belongs to the broader research field of amortizing inverse problems with neural networks\[[5](https://arxiv.org/html/2605.24152#bib.bib23),[75](https://arxiv.org/html/2605.24152#bib.bib24)\], in which an[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)is trained to map measurements \(or, more generally, conditioning\) to a solution of a forward operator equation;[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)sits specifically in the*self\-supervised / measurement\-only*branch of that taxonomy, with backpropagation through the forward operator providing the training signal \(no ground\-truth solution labels\)\. This tradition is most developed in imaging and physics: linear inverse problems \([CT](https://arxiv.org/html/2605.24152#A1.SS1.12.12.12)/[MRI](https://arxiv.org/html/2605.24152#A1.SS1.42.42.42)reconstruction, compressed sensing, super\-resolution\), unrolled iterative solvers \([LISTA](https://arxiv.org/html/2605.24152#A1.SS1.30.30.30)\-style\)\[[36](https://arxiv.org/html/2605.24152#bib.bib25)\], and more recently non\-amortized diffusion\-prior posterior sampling\[[17](https://arxiv.org/html/2605.24152#bib.bib26)\]\.[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)specializes this pattern to embodied planning and control: the \(itself learned\) “forward operator” is, e\.g\., a dynamics model unrolled in time, the “solution” is a multi\-step action sequence rather than a static signal, and the objective is a Bolza functional \(terminal cost \+ running cost \+ regularizer\)\. In this view,[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)is the planning\-and\-control instance of the broader[DL](https://arxiv.org/html/2605.24152#A1.SS1.14.14.14)\-for\-inverse\-problems agenda, extending it from analytical to*learned*forward operators and from static signals to time\-extended action sequences\.

#### Amortized optimal control\.

[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)is an instance of*amortized optimal control*that allows training a network to emit a multistep[OC](https://arxiv.org/html/2605.24152#A1.SS1.48.48.48)solution in a single feedforward pass\. Related instances differ in either output shape or in whether deployment\-time iteration is avoided:explicit[Model Predictive Control](https://arxiv.org/html/2605.24152#A1.SS1.39.39.39)\([MPC](https://arxiv.org/html/2605.24152#A1.SS1.39.39.39)\)\[[11](https://arxiv.org/html/2605.24152#bib.bib56)\]offline\-precomputes[MPC](https://arxiv.org/html/2605.24152#A1.SS1.39.39.39)into a piecewise\-affine state\-feedback function for linear\-quadratic problems \(with modern[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-approximate variants for nonlinear settings\), but emits a single action per call in receding\-horizon fashion;guided\-policy\-search / coupledtrajopt distillation into a policy \([65](https://arxiv.org/html/2605.24152#bib.bib36);[71](https://arxiv.org/html/2605.24152#bib.bib39);[14](https://arxiv.org/html/2605.24152#bib.bib38)\)supervises policy training withthe trajectories of a separate, non\-differentiable offline[OC](https://arxiv.org/html/2605.24152#A1.SS1.48.48.48)solver; value\-gradient backprop through learned dynamics\[[45](https://arxiv.org/html/2605.24152#bib.bib40)\]differentiates the value through model rollouts but emits one action at a time; differentiableoptimization and[MPC](https://arxiv.org/html/2605.24152#A1.SS1.39.39.39)layers \([2](https://arxiv.org/html/2605.24152#bib.bib37);[1](https://arxiv.org/html/2605.24152#bib.bib41)\) embed a constrained[QP](https://arxiv.org/html/2605.24152#A1.SS1.56.56.56)solver as a differentiable layer and so retain test\-time iterative optimization\.

#### Differentiable world\-model control\.

Backpropagating a control objective through a learnedworld model \(the agent\-in\-environment composite, not just body kinematics\)has a long lineage\[[84](https://arxiv.org/html/2605.24152#bib.bib31),[50](https://arxiv.org/html/2605.24152#bib.bib6),[63](https://arxiv.org/html/2605.24152#bib.bib49)\]– Schmidhuber even explicitly proposed using the learned model to plan multi\-step action sequences via simulated gradient descent, while flagging its high inference cost;concrete algorithmicvariants –[PILCO](https://arxiv.org/html/2605.24152#A1.SS1.53.53.53)\[[22](https://arxiv.org/html/2605.24152#bib.bib32)\], Dreamer\[[39](https://arxiv.org/html/2605.24152#bib.bib33)\], Universal Planning Networks\[[89](https://arxiv.org/html/2605.24152#bib.bib34)\],[TD\-MPC2](https://arxiv.org/html/2605.24152#A1.SS1.69.69.69)\[[41](https://arxiv.org/html/2605.24152#bib.bib35)\]– retain test\-time iteration, value bootstrapping, or one\-step outputs\.[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)differs in output shape and query cost as it allows inverse networks to emits the fullTT\-step action sequence in one feedforward pass\.

#### Trajectory\-level sequence modelsand iterative planners\.

A parallel line treats the action sequence as the object: autoregressive sequence modeling\[[48](https://arxiv.org/html/2605.24152#bib.bib42)\], return\-conditionedautoregressive policies\[[15](https://arxiv.org/html/2605.24152#bib.bib43)\], iterative trajectory denoising\[[47](https://arxiv.org/html/2605.24152#bib.bib44)\], online tree search over a learned model\[[86](https://arxiv.org/html/2605.24152#bib.bib45)\], test\-time\-iterative latent planners\[[88](https://arxiv.org/html/2605.24152#bib.bib47),[8](https://arxiv.org/html/2605.24152#bib.bib48)\], and—outside[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)—learned step\-by\-step motion planners such as[MPNet](https://arxiv.org/html/2605.24152#A1.SS1.40.40.40)\[[80](https://arxiv.org/html/2605.24152#bib.bib53)\], which autoregressively emits next configurations from an imitation\-trained network; broader hierarchical\-planning agendas with learned predictive world models\[[63](https://arxiv.org/html/2605.24152#bib.bib49)\]share these commitments\.[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)collapses both step\-by\-step autoregression and test\-time iterative optimization into a single feedforward emission of theHH\-step action chunk, folding trajectory optimization into training\. Action Chunking Transformers\[[102](https://arxiv.org/html/2605.24152#bib.bib51)\]and Diffusion Policy\[[16](https://arxiv.org/html/2605.24152#bib.bib52)\]share our chunked output shape but are supervised on expert demonstrations, restricting data and objective to imitation\.

#### Hierarchical control and hybrid symbolic\-continuous planning\.

Hierarchical[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)builds on the options framework\[[90](https://arxiv.org/html/2605.24152#bib.bib54)\], typically mixing level\-specific mechanisms \(e\.g\., high\-level policy gradient over low\-level actor\-critic\); hierarchical latent world models have been used for visual humanoid control but still rely on test\-time planning\[[42](https://arxiv.org/html/2605.24152#bib.bib55)\]\.The amortization perspective extends beyond purely continuous problems: hybrid symbolic\-continuous control is traditionally handled by iterative task\-and\-motion planning\[[31](https://arxiv.org/html/2605.24152#bib.bib19)\], mixed\-integer[MPC](https://arxiv.org/html/2605.24152#A1.SS1.39.39.39)over hybrid systems\[[69](https://arxiv.org/html/2605.24152#bib.bib20)\], or signal\-temporal\-logic[MPC](https://arxiv.org/html/2605.24152#A1.SS1.39.39.39)compiled to[MILP](https://arxiv.org/html/2605.24152#A1.SS1.34.34.34)\[[81](https://arxiv.org/html/2605.24152#bib.bib21)\]\. Our AntMan Game Inverter \(Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\) is in this hybrid regime but emits its symbolic plan in a single feedforward pass through a differentiable[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19), positioning[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)as an amortized counterpart in the neurosymbolic[OC](https://arxiv.org/html/2605.24152#A1.SS1.48.48.48)space currently dominated by iterative methods\.

#### [Reinforcement Learning](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)of generative models via frozen learned critics\.

A recent line of work in generative modeling performs[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)of pretrained samplers against a learned reward model: DRaFT\[[18](https://arxiv.org/html/2605.24152#bib.bib89)\], and ReFL\[[99](https://arxiv.org/html/2605.24152#bib.bib91)\]for image diffusion; DRAKES\[[96](https://arxiv.org/html/2605.24152#bib.bib92)\]for discrete diffusion via soft\-token embeddings; and Adjoint Matching\[[25](https://arxiv.org/html/2605.24152#bib.bib93)\], which casts fine\-tuning as stochastic optimal control over the denoising trajectory\. Following the classical Jordan & Rumelhart distal\-teacher pattern\[[50](https://arxiv.org/html/2605.24152#bib.bib6)\], these methods replace the variance\-prone gradient estimators of classical[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)with an*exact*gradient obtained by backpropagating through a frozen, differentiable critic \(Gumbel\-relaxed or straight\-through for discrete cases\)\. The objective, however, remains single\-step \(T=1T\{=\}1\) reward maximization on the generator’s terminal output \(one image / molecule / text sample\) as[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)methods rather than[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)in the sense used here\.

#### [Inverse Reinforcement Learning](https://arxiv.org/html/2605.24152#A1.SS1.25.25.25)\([IRL](https://arxiv.org/html/2605.24152#A1.SS1.25.25.25)\)

addresses a different inverse problem: recovering the reward function from expert demonstrations\[[73](https://arxiv.org/html/2605.24152#bib.bib22)\]\.[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)in the sense of this paper instead amortizes the goal→\\toaction inversion through a forward model, in the Jordan & Rumelhart distal\-teacher lineage\[[50](https://arxiv.org/html/2605.24152#bib.bib6)\]extended toT\>1T\{\>\}1\. The two paradigms are complementary and could in principle compose – e\.g\., an[IRL](https://arxiv.org/html/2605.24152#A1.SS1.25.25.25)\-recovered reward used as the terminal cost in[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)’s Bolza objective\.

## 4 Planning and control with Inverters: properties, performance, failure modes

We organize the experiments in three steps of increasing structural complexity: a single Level 1 Inverter \(maze2d \-umaze \-v1; Sec\.[4\.1](https://arxiv.org/html/2605.24152#S4.SS1)\), a Level 1 Inverter coupled with a \(simple\) algorithmic Path Inverter at Level 2 \(largermaze2dlayouts and all sixantmaze \-v2variants; Secs\.[4\.2](https://arxiv.org/html/2605.24152#S4.SS2)–[4\.3](https://arxiv.org/html/2605.24152#S4.SS3)\), and finally the fulln=2n\\\!=\\\!2hierarchical Setup with two paired learned Inverters \(AntMan onantmaze \-large \-diverse \-v2; Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\)\.

### 4\.1 Single Motor Inverter \(maze2d \-umaze\): best[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)at less inference computetime

![Refer to caption](https://arxiv.org/html/2605.24152v1/x2.png)Figure 2:Trajectory comparison onmaze2d \-umaze \-v1\.Top\-left: training data heatmap \(blue to red: low to high density\) showing the U\-shaped corridor coverage\. Remaining panels: 100 evaluation trajectories per method, ordered by[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score\.★\\bigstar= goal; red×\\times= timeout \(episode did not reach the goal\)\. The Inverter \(ours\) produces smooth, direct paths that are not restricted to the angular data support\. Baselines trained with[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)\[[93](https://arxiv.org/html/2605.24152#bib.bib57)\]\.We evaluatemaze2d \-umaze \-v1with a frozen causal Transformer[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)paired with a causal Transformer[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)that emits a128128\-step action sequence from the current state and goal in one forward pass \(architectures, dataset, baselines: App\.[A\.2](https://arxiv.org/html/2605.24152#A1.SS2)\)\.

Performance and inferencecompute time:Table[3](https://arxiv.org/html/2605.24152#S4.T3)summarizes[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)and per\-episode computetimeacross all threemaze2dvariants\.Throughout, “inference compute time” refers to per\-episode wall time on a single device at batch 1; in a launch\-overhead\-limited small\-model regime this is the deployment\-relevant metric, distinct from FLOPs \(App\.[A\.3](https://arxiv.org/html/2605.24152#A1.SS3)\)\.Onumaze \-v1a one\-shotK=128K\\\!=\\\!128plan delivers161\.6±2\.2\\mathbf\{161\.6\\\!\\pm\\\!2\.2\}in just33[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)forward passes \(one to reach the goal, two to stay at the goal until the end of the fixed 300 time step evaluation window\) and11\.4\\mathbf\{11\.4\}ms total per episode –𝟑𝟕×\\mathbf\{37\\times\}less than Diffuser,𝟒𝟕×\\mathbf\{47\\times\}less than[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62), and estimated nearly three orders of magnitude less than DecisionLLM\[[68](https://arxiv.org/html/2605.24152#bib.bib58)\]\. The advantage is*not*a faster single forward pass \(the Inverter transformer sits at the same∼2\\sim\\\!2ms\-per\-pass floor as the baseline[MLPs](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)\) but emitting a full128128\-step action sequence per forward, not just one at a time, reducing per\-episode passes by3030–100×100\\times\(App\.[A\.3](https://arxiv.org/html/2605.24152#A1.SS3)\)\.

Table 3:Maze2d summary: Motor Inverter vs\. best[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)baseline per maze and fastest PyTorch baseline \([BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\-10%\)\.Top:[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score \(mean±\\,\\pm\\,std over 100 episodes with 4 seeds\)\.Bottom:[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)forward passes and total wall time per episode \([NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)passes plus algorithmic overhead, single A40 GPU, PyTorch, batch 1, CUDA\-synced\)\. Best\-per\-column bolded; the Inverter row reports theK=128K\\\!=\\\!128one\-shot configuration on umaze \(Tab\.[5](https://arxiv.org/html/2605.24152#A1.T5), App\.[A\.4](https://arxiv.org/html/2605.24152#A1.SS4)\) andK=16K\\\!=\\\!16on medium/large \(Tabs\.[7](https://arxiv.org/html/2605.24152#A1.T7),[8](https://arxiv.org/html/2605.24152#A1.T8)\)\. Full per\-method tables \(10\+ baselines, including JAX\+JIT[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)and Diffuser\): App\.[A\.4](https://arxiv.org/html/2605.24152#A1.SS4)\.#### Smooth, coherent trajectories\.

Figure[2](https://arxiv.org/html/2605.24152#S4.F2)shows that the motor Inverter produces smooth, direct paths through the maze, while every baseline yields visibly noisier, more angular, or more constrained motion\. Figure[3](https://arxiv.org/html/2605.24152#S4.F3)confirms this quantitatively via per\-sequence displacement directions and curvature variance: the Inverter achieves the lowest peak curvature and, by a large margin, the lowest curvature variance, demonstrating the by\-design ability of Inverters for more holistic sequence\-level optimization\. The[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)achieves this not by mimicking any individual training trajectory, but by learning to invert the forward dynamics globally: gradients flow through the frozen[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)across the entire horizon, enabling the network to discover action sequences whose*integrated*trajectory has favorable geometric properties that local, step\-wise optimization cannot guarantee\.

![Refer to caption](https://arxiv.org/html/2605.24152v1/x3.png)Figure 3:Directional movement spectrum andsequence\-level optimization\.Panels A–F show polar histograms ofT=16T\{=\}16\-step chunk\-displacementdirections for the training data and a representative subset of methods; angles indicate0∘=\+x0^\{\\circ\}\\\!=\\\!\+x,90∘=\+y90^\{\\circ\}\\\!=\\\!\+yin raw maze data coordinates, based on each method’s per\-episode rollout until the first step where the agent enters a goal\-ball of radius0\.50\.5, or time out\. Panel G and H shows median peak curvatureκmax\\kappa\_\{\\max\}, and median curvature variance \(log scale\) over the same trajectory data, respectively\. Whiskers: \+/\-[Standard Error of the Mean](https://arxiv.org/html/2605.24152#A1.SS1.63.63.63)\([SEM](https://arxiv.org/html/2605.24152#A1.SS1.63.63.63)\)\.The Inverter produces the lowest peak curvature and lowest curvature variance \(Panels G, H\)\. Similar to Diffusor, and unlike the other baseline methods, the inverter produces all of the major movement directions required to solve the task \(up, down, and leftwards\), but no movements with a dominating rightward component, which are not needed to reach the goal \(Panel A\-E\)\. These observations together quantitatively confirm the*sequence\-level optimization*that the smooth trajectories of Fig\.[2](https://arxiv.org/html/2605.24152#S4.F2)suggested qualitatively: optimizing over the*entire*action sequence – rather than per step – yield smooth trajectories which allow to reach the goal faster\.![Refer to caption](https://arxiv.org/html/2605.24152v1/x4.png)Figure 4:Action\-space structureand control optimality\.Per\-step action scatter plots\(ax,ay\)\(a\_\{x\},a\_\{y\}\)for the training data and a representative subset of methods \(Panels A–F\)\. Note that[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\-10%\[[60](https://arxiv.org/html/2605.24152#bib.bib70)\]lands*below*the training data on action saturation \(Panel H\), even though it is trained to imitate it: a unimodal Gaussian policy head contracts the action distribution toward the interior – the multimodal\-action[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)failure mode\[[27](https://arxiv.org/html/2605.24152#bib.bib77),[16](https://arxiv.org/html/2605.24152#bib.bib52)\]\. Panel H:★\\bigstarat\(1,1\)\(1,1\)marks the analytically optimal action saturation – pure bang\-bang, derived from Pontryagin’s maximum principle for the damped point\-mass minimum\-time problem \(App\.[A\.5](https://arxiv.org/html/2605.24152#A1.SS5)\)\. Panel H: average speed to goal, using the same trajectory data as in Figure[2](https://arxiv.org/html/2605.24152#S4.F2)\.The Inverter shows both joint signatures of the analytic time\-optimal control: high action saturation, beyond the training data – and high average speed to goal \(Panel G\)\. Note that[CQL](https://arxiv.org/html/2605.24152#A1.SS1.11.11.11)outputs highly saturated actions, but missing action switching times, it fails the task at a negative[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score, showing that action saturation alone is a necessary but not sufficient condition for approaching control optimality\.
#### Approaching the analytic optimum outside the data support\.

Under viscous damping with a bounded action box, the minimum\-time control problem onmaze2d \-umaze \-v1is approximately bang\-bang \(App\.[A\.5](https://arxiv.org/html/2605.24152#A1.SS5)\): the time\-optimal policy saturates each actuator to±umax\\pm u\_\{\\max\}– an optimum that lies*outside*the training data’s action\-space support\. Figure[4](https://arxiv.org/html/2605.24152#S4.F4)reveals this structure directly: every baseline populates the interior with unsaturated actions, while the Inverter concentrates mass at the four edges and corners\.Its action saturation substantially exceeds that of the training data – a necessary structural component of bang\-bang time\-optimal control\. This is not sufficient on its own \(a misdirected bang\-bang policy can saturate without speed gain, as shown by[CQL](https://arxiv.org/html/2605.24152#A1.SS1.11.11.11)\), but the Inverter*also*reaches the highest average speed and consecutively the highest[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score \(Table[3](https://arxiv.org/html/2605.24152#S4.T3)\); together an empirical signature of approach to the analytic optimum\.This illustrates that the Inverter uses learned physics through gradients flowing through the frozen[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19), rather than to clone dataset actions, leaving the Inverter structurally free to leave the data’s support and to find better solutions beyond it\.

### 4\.2 Motor Inverter\+\+simple algorithmic Path Inverter onmaze2d \-medium/large: confirming best[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)at substantially less computetime

Onmaze2d \-medium \-v1andmaze2d \-large \-v1, the dataset’s longest corridor paths exceed the Level 1 Inverter’s128128\-step horizon, so a single feedforward plan is insufficient\. At the same time, waypoint routing through the maze is a discrete shortest\-path sub\-problem on which a full\-fledged learned Planning Inverter would be overkill\. We therefore add a simplealgorithmic Path Inverter \([BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)built exclusively on the offline\-data density\)at Level 2: it builds a free\-space occupancy grid from the training\-data density alone, runs a 4\-connected[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5), and emits one sub\-goal per corridor elbow; the Level 1 Inverter then executes chunk by chunk toward the current sub\-goal\. No maze geometry, simulator, or[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)call enters the planner\. App\.[A\.6](https://arxiv.org/html/2605.24152#A1.SS6)gives the full construction; Fig\.[9](https://arxiv.org/html/2605.24152#A1.F9)shows a representative plan overlaid on the data heatmap\.

Main numbers for medium and large atK=16K\\\!=\\\!16\(highest[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)across ourKK\-sweep on medium and large; Suppl\. Tab\.[10](https://arxiv.org/html/2605.24152#A1.T10)\) appear in Table[3](https://arxiv.org/html/2605.24152#S4.T3); full per\-method breakdowns are in App\.[A\.4](https://arxiv.org/html/2605.24152#A1.SS4)\. The stacked Inverter reaches166\.8±1\.2\\mathbf\{166\.8\\\!\\pm\\\!1\.2\}[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)on medium and220\.7±0\.2\\mathbf\{220\.7\\\!\\pm\\\!0\.2\}on large at𝟏𝟎𝟎/𝟏𝟎𝟎\\mathbf\{100/100\}success across44seed[IMs](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)– strictly ahead of every[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)baseline and Diffuser – spending72\.9\\mathbf\{72\.9\}and93\.7\\mathbf\{93\.7\}ms of[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)compute per episode, roughly𝟏𝟎\\mathbf\{10\}–𝟑𝟎×\\mathbf\{30\\times\}faster than step\-wise offline\-[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baselines and𝟑𝟗\\mathbf\{39\}–𝟓𝟏×\\mathbf\{51\\times\}faster than Diffuser \(DecisionLLM reference for this task not available\)\. The only compute\-competitive method is[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)†\(JAX\+JIT,∼0\.2\\sim\\\!0\.2ms/pass\), which still lags the Inverter on[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)by6363points on medium and137137on large\. SweepingK∈\{16,32,64,128,256\}K\\\!\\in\\\!\\\{16,32,64,128,256\\\}exposes a clean speed–accuracy trade\-off \(App\.[A\.7](https://arxiv.org/html/2605.24152#A1.SS7)\): smallerKKgives higher[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)at higher per\-episode wall\-time, and accuracy collapses onceKKcrosses the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)’s128128\-step training horizon – interestingly reminiscent of Fitts’ law and submovement / iterative\-correction motor psychophysics\[[26](https://arxiv.org/html/2605.24152#bib.bib59),[19](https://arxiv.org/html/2605.24152#bib.bib60),[70](https://arxiv.org/html/2605.24152#bib.bib61)\]\.

### 4\.3antmaze \-v2: Scaling to locomotion and first encounter with[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)hacking

Table 4:Antmaze summary: Inverter vs\. strongest[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)competitor \([ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)\) and fastest PyTorch baseline \([BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\-10%\)\.Top:[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score \(%\) over 100 episodes on eachantmaze \-v2variant; both[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)and Inverter are mean±\\,\\pm\\,std over44seeds\.Bottom: ms per environment step \(single A40 GPU, PyTorch, batch 1, cuda\-synced\)\. Best\-per\-column bolded\. Full 10\-baseline breakdown including JAX\+JIT[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)timing: Tab\.[9](https://arxiv.org/html/2605.24152#A1.T9), App\.[A\.4](https://arxiv.org/html/2605.24152#A1.SS4)\.We find that the same setup \(Level 1 Locomotion Inverter \+ simple algorithmic Path Inverter at Level 2\) successfully transfers to all sixantmaze \-v2variants – 29\-dim[Multi\-Joint dynamics with Contact](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)\([MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)\) ant locomotion with 8\-dim torque actions on the same topologies plus play/diverse goal distributions – and matches within error bars or improves the strongest[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)baseline \([ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)\) on every variant, with the gap widening on the large mazes \(Tab\.[4](https://arxiv.org/html/2605.24152#S4.T4)and App\.[A\.4](https://arxiv.org/html/2605.24152#A1.SS4); for comparison to published model\-based offline[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baselines see App\.[A\.8](https://arxiv.org/html/2605.24152#A1.SS8)\)\. In compute, the Locomotion Inverter emits a1616\-step chunk per call at∼2\\sim\\\!2ms, so the total per\-env\-step cost \(∼0\.13\\sim\\\!0\.13–0\.140\.14ms on every variant\) amortizes a small number of chunk\-level passes over the∼600\\sim\\\!600–700700steps to the goal –∼9×\\sim\\\!9\\timesfaster per step than the fastest PyTorch baseline \([BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\-10%\) on every antmaze variant\.

One important new observation from antmaze suggested that a plain task\-reward Locomotion Inverter on antmaze may be*forward\-model hackable*– while the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)predicts successful task completion, in simulation the ant may tip, jam, or fall – a failure we suspected to be related to the fact that the offline training data contains relatively few failure cases and the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)is therefore uncalibrated off\-support\. We compensated with two additive, per\-time\-step auxiliary losses – a[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)action\-fidelity anchor and a body\-yaw regularizer – which compose linearly with the task gradient \(ablations in App\.[A\.9](https://arxiv.org/html/2605.24152#A1.SS9), Suppl\. Tab\.[12](https://arxiv.org/html/2605.24152#A1.T12)\)\. These additions work in practice but are conceptually unsatisfactory: anchoring toward dataset actions partially compromises[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)’s defining property of being[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-gradient\-driven rather than data\-anchored\. These observations motivated designing the new AntMan task \(Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\), which lets us \(i\) control the offline training data ourselves and therefore study[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)hacking and its mitigations systematically, and \(ii\) demonstrate a fulln=2n\\\!=\\\!2Hierarchical Inverter – two trained Inverters, one for planning and one for control\.

![Refer to caption](https://arxiv.org/html/2605.24152v1/x5.png)Figure 5:Simultaneous planning and control with a coupled Game and Locomotion Inverter on the AntMan task\.Panel A shows a representative rollout in the[MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)maze with an abstract rendering of the game situation and 3 snapshots showing the underlying motion of the[MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)Ant through the maze\. Successful planning and control was achieved when training on random training data, but not on training data only reflecting a heuristic game controller \(Panels B–D\)\. Metrics are averaged over 100 evaluation games \(25 each for no, 1, 2, and 3 ghosts\)\. Only random training yields above\-chance pellet\-collection efficiency and pellet counts at every difficulty, while replanning is comparable or lower\.![Refer to caption](https://arxiv.org/html/2605.24152v1/x6.png)Figure 6:Random training data yields calibrated[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)rewards; narrow expert data induces[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)hacking\.Each panel compares[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-predicted reward to realized game reward over 1000 sampled start states for the final high\-level[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)checkpoints; point color encodes the[Out\-of\-Distribution](https://arxiv.org/html/2605.24152#A1.SS1.50.50.50)\([OOD](https://arxiv.org/html/2605.24152#A1.SS1.50.50.50)\) scoreOOD​\(x\)=d1​\(x,𝒯\)/d~1​\(𝒯\)\\mathrm\{OOD\}\(x\)=d\_\{1\}\(x,\\mathcal\{T\}\)\\,/\\,\\widetilde\{d\}\_\{1\}\(\\mathcal\{T\}\), whered1​\(x,𝒯\)=miny∈𝒯⁡‖x−y‖d\_\{1\}\(x,\\mathcal\{T\}\)=\\min\_\{y\\in\\mathcal\{T\}\}\\\|x\-y\\\|is each sampled start statexx’s nearest\-neighbor distance to the training set𝒯\\mathcal\{T\}, andd~1​\(𝒯\)=medianx′∈𝒯⁡miny∈𝒯∖\{x′\}⁡‖x′−y‖\\widetilde\{d\}\_\{1\}\(\\mathcal\{T\}\)=\\operatorname\{median\}\_\{x^\{\\prime\}\\in\\mathcal\{T\}\}\\,\\min\_\{y\\in\\mathcal\{T\}\\setminus\\\{x^\{\\prime\}\\\}\}\\\|x^\{\\prime\}\-y\\\|is the typical nearest\-neighbor distance between training points themselves – so[OOD](https://arxiv.org/html/2605.24152#A1.SS1.50.50.50)==1 means the sample is no further from the training set than a typical training point is from its own nearest neighbor, and[OOD](https://arxiv.org/html/2605.24152#A1.SS1.50.50.50)\>\>1 indicates an increasingly off\-support sample, the pink diagonal marks perfect calibration, and dashed red lines show linear fits with 95% confidence bands\. Panel A \(random training data\) stays near the training support, has low[OOD](https://arxiv.org/html/2605.24152#A1.SS1.50.50.50)values, and shows strong correlation between[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)reward and actual game reward\. Panel B \(expert training data, without additional stabilizers such as a[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\-style loss\) shows the opposite failure mode: the Inverter generates highly[OOD](https://arxiv.org/html/2605.24152#A1.SS1.50.50.50)plans and the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)assigns them large rewards that are no longer predictive of actual outcomes\. The green and gray x\-axis projections in Panel B show the realized game\-reward ranges for expert\-data and random\-data plans, with dashed connectors and circles marking the maximum game\-reward sample from each group; these ranges are very similar\. Thus the failure is not that expert\-data training removes high\-game\-reward plans from the sample, but that[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)reward becomes decoupled from actual reward\. For example, the highest[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-reward sample in Panel B \(red circle\) achieves zero game reward\.
### 4\.4 Hierarchical Inverse Learning for planning and control

To study a fully learnedn=2n\\\!=\\\!2Hierarchical Inverter, we designed the new AntMan task, which requires pure offline learning of both low\-level locomotion control of the[MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)Ant and a Pac\-Man\-style game embodied through the same Ant placed in theantmaze \-large \-v2setup\.

We solve this challenge with a*Level 1 Locomotion Inverter*which controls the Ant inantmaze \-large \-diverse \-v2, while a paired*Level 2 Game Inverter*controls the Locomotion Inverter by outputting3232\-step waypoint directions from a 58\-dim game state \(ant pose, ghost states, pellet map, mode indicators\)\. A causal\-transformer[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)is trained on simulated AntMan games; the Game Inverter is then trained purely through[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)reward to collect as many pellets as possible\.Structurally, the Game[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)instantiates the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-with\-reward\-heads option introduced in Sec\.[2](https://arxiv.org/html/2605.24152#S2): a dynamics head for ghost positions plus per\-step reward heads for pellet\-eaten and alive probabilities\. The Game Inverter’s Bolza loss backpropagates through the reward heads,ℒ=−𝔼​\[∑tσ​\(ℓpel,t\)​σ​\(ℓalive,t\)\]\\mathcal\{L\}=\-\\mathbb\{E\}\\big\[\\sum\_\{t\}\\sigma\(\\ell\_\{\\text\{pel\},t\}\)\\,\\sigma\(\\ell\_\{\\text\{alive\},t\}\)\\big\]over theH=32H\{=\}32chunk\. In ourn=2n\{=\}2AntMan stack, the Bolza loss therefore flows through next\-state dynamics at the low level and through learned reward heads at the high level – two instances of the same[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)/[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)building block\.To isolate the training\-distribution effect, two matched[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)/[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)pairs differ only in the offline training data generating behavior policy: \(i\) heuristic \(chases pellets, avoids ghosts\) and \(ii\) random \(uniform random neighbor\-cell moves\)\.

#### End\-to\-end hierarchical Inverse Learning for planning and control\.

The AntMan task demonstrates that a fulln=2n\\\!=\\\!2Hierarchical Inverter, trained using[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-predicted game reward, successfully plays AntMan games well above chance – proof of principle for increasing hierarchical depth and for extending the framework from continuous physics to symbolic/discrete planning\.Here we train each level separately, but the Inverter stack is end\-to\-end differentiable: jointly optimizing\(gϕ\(1\),gϕ\(2\)\)\(g\_\{\\phi\}^\{\(1\)\},g\_\{\\phi\}^\{\(2\)\}\)– propagating Level\-2 task loss through Level\-1, and conversely allowing Level\-1 execution constraints to reshape Level\-2 waypoint emission – is a natural extension for regimes in which optimal plans depend on what the body can execute, or optimal motor commands depend on the upcoming plan\.

#### Forward\-model hacking and how to mitigate it\.

A second observation reverses the offline\-learning intuition that more competent demonstrations make better training data \(Fig\.[5](https://arxiv.org/html/2605.24152#S4.F5)\)\. With heuristic training data the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)is severely hacked – predicted37\.9837\.98pellets vs\.4\.484\.48realized \(8\.48×8\.48\\times\) – and the Game Inverter drops to chance at 2–3 ghosts\. With random training data the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)is calibrated and the Inverter stays above chance across all difficulty levels\. The failure mechanism \(Fig\.[6](https://arxiv.org/html/2605.24152#S4.F6); extended discussion in App\.[A\.10](https://arxiv.org/html/2605.24152#A1.SS10)\) is that under expert\-skewed data, the Inverter generates off\-support plans for which the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-predicted reward*decorrelates*from the actual game outcome\. The implication: inverse\-learning data has*different*optimality criteria than[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)/imitation – not demonstrator competence but*safely diverse dynamics coverage*\. Our findings thus suggest that Inverters are particularly suited to settings where learning must proceed from random rather than expert data; the converse – combining Inverters with imitation learning beyond plain[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)– is an interesting open future direction\.In this context, as another topic for future work, joint\(fθ,gϕ\)\(f\_\{\\theta\},g\_\{\\phi\}\)training \(Sec\.[2](https://arxiv.org/html/2605.24152#S2)\) would be an adaptive alternative to broad\-coverage data: it might keep the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)accurate on the Inverter’s evolving action distribution as the two co\-evolve\.

![Refer to caption](https://arxiv.org/html/2605.24152v1/x7.png)Figure 7:Single\-shot quantum gate synthesis on a 3\-level transmon under a known Lindbladian with a Pulse Inverter\.\(A\)PulseΩx​\(t\),Ωy​\(t\)\\Omega\_\{x\}\(t\),\\Omega\_\{y\}\(t\)for one Haar\-sampled targetUU; Inverter: solid orange,[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21): dashed blue\.\(B\)Bloch trajectories from four axis\-aligned input states; stars markU​\|ψ⟩U\|\\psi\\rangle\.\(C\)Over 250 Haar U\(2\) targets:1−F¯avg1\{\-\}\\bar\{F\}\_\{\\mathrm\{avg\}\}, per\-input fidelity uniformity, and median wall\-clock per pulse on the same 128\-core CPU –∼2700×\\sim\\\!2700\{\\times\}speedup over[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)\(App\.[A\.11](https://arxiv.org/html/2605.24152#A1.SS11)\)\.

### 4\.5 Application example: Quantum gate synthesis

The standard iterative\-numerical baseline for quantum gate synthesis,[Gradient Ascent Pulse Engineering](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)\([GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)\)\[[54](https://arxiv.org/html/2605.24152#bib.bib81)\], is normally run*offline*to precompute a fixed gate library \(X,H,T,[CNOT](https://arxiv.org/html/2605.24152#A1.SS1.7.7.7), …\)\. An Inverter that emits pulses in one feedforward would be helpful for quantum computing applications where arbitrary unitaries are needed atμ\\mus timescales – variational/parameterized algorithms,[Quantum Error Correction](https://arxiv.org/html/2605.24152#A1.SS1.55.55.55)\([QEC](https://arxiv.org/html/2605.24152#A1.SS1.55.55.55)\), or for adaptive feedback\.

Quantum gate synthesis exemplifies a regime where the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)is available in closed form \(the Lindblad channel of a given device\)\. We train an Invertergϕ​\(U\)→Ωg\_\{\\phi\}\(U\)\{\\to\}\\Omegato synthesize an8080\-slice\(Ωx,Ωy\)\(\\Omega\_\{x\},\\Omega\_\{y\}\)pulse for arbitrary single\-qubit targetsU∈U​\(2\)U\\\!\\in\\\!\\mathrm\{U\}\(2\)on a noisy 3\-level transmon, by minimizing1−F¯avg1\{\-\}\\bar\{F\}\_\{\\mathrm\{avg\}\}through the analytic Lindblad channel as in Eq\.[2](https://arxiv.org/html/2605.24152#S2.E2)\(no[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)\-pulse supervision; full setup, baselines, per\-target statistics, and theU​\(2\)\\mathrm\{U\}\(2\)/SU​\(2\)\\mathrm\{SU\}\(2\)sampling convention in App\.[A\.11](https://arxiv.org/html/2605.24152#A1.SS11)\)\. On 250 held\-out Haar U\(2\) targets \(Fig\.[7](https://arxiv.org/html/2605.24152#S4.F7)\), the Inverter ties[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)at the dissipation floor– the irreducible infidelity set by Lindbladian decoherence over the gate duration –\(1−F¯avg=4\.26×10−41\{\-\}\\bar\{F\}\_\{\\mathrm\{avg\}\}\{=\}4\.26\{\\times\}10^\{\-4\}vs\.4\.69×10−44\.69\{\\times\}10^\{\-4\}\) and matches its per\-input fidelity uniformity \(within1\.12×1\.12\{\\times\}\), at∼2700×\\sim\\\!2700\{\\times\}lower per\-gate cost \(2\.12\.1ms vs\.5\.65\.6s\)\.The[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)wall\-time here reflects the open\-system dynamical map \(vectorized9×99\{\\times\}9propagator on the3×33\{\\times\}3density matrix\), 4\-input fidelity averaging, and convergence to the dissipation floor; closed\-system unitary\-synthesis benchmarks on small Hilbert spaces typically run faster\[[21](https://arxiv.org/html/2605.24152#bib.bib109)\]\.First two\-qubit results on Haar SU\(4\) reachF¯avg=0\.957\\bar\{F\}\_\{\\mathrm\{avg\}\}\{=\}0\.957\(vs\. a[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)floor at0\.9980\.998\) at the same∼4×104×\\sim\\\!4\{\\times\}10^\{4\}\{\\times\}inference speedup; closing the remaining fidelity gap is open and is discussed in App\.[A\.12](https://arxiv.org/html/2605.24152#A1.SS12)\.Concurrent work\[[67](https://arxiv.org/html/2605.24152#bib.bib108)\]reports a related[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)pulse compiler on a closed\-system[NMR](https://arxiv.org/html/2605.24152#A1.SS1.45.45.45)platform with an added risk\-averse re\-optimization layer, but with a restricted axis\-angle gate family and without a[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)accuracy / compute\-time comparison\.

## 5 Discussion, limitations, and outlook

The mammalian brain achieves fast, highly effective goal\-directed behavior leveraging paired forward/inverse internal models, open\-loop multi\-step motor commands, and the hierarchical organization of action\. Our findings show that the Inverter framework, built on the same three principles, enables fast and effective planning and control through a feedforward, sequence\-level[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-and\-[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)core that emits entire action sequences in single forward passes\. We find that Inverters offer consistently high task performance at a fraction of the inference computetimeused by step\-wise[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)or iterative planners \(Suppl\. Fig\.[8](https://arxiv.org/html/2605.24152#A1.F8)\)\. Across the 9[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)maze variants, Inverters closely matches or improves the strongest reported baseline on every task, by an average of\+24\.2%\+24\.2\\%\(\+19\.9\+19\.9[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)points; range−1\.9%\-1\.9\\%to\+78\.2%\+78\.2\\%\); per\-task summary in Suppl\. Tab\.[6](https://arxiv.org/html/2605.24152#A1.T6)\. First, we discuss the contribution of the three initial,*a priori*brain\-inspired principles to these results\.

\(1\) Paired[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)/[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)with exact gradients\.Because the Inverter is trained by backpropagating an objective*through*a frozen[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)rather than imitating dataset actions, the action\-space gradient∂𝒥/∂at,i\\partial\\mathcal\{J\}/\\partial a\_\{t,i\}is exact in every dimension \(Tab\.[2](https://arxiv.org/html/2605.24152#S2.T2)\)\. This gradient quality is what allows the Inverter to leave the data’s action support and find more optimal solutions \(Fig\.[4](https://arxiv.org/html/2605.24152#S4.F4)\), such as synthesize match\-to\-dissipation\-floor pulses on arbitrary Haar U\(2\) gates without ever seeing a[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)solution \(Sec\.[4\.5](https://arxiv.org/html/2605.24152#S4.SS5)\)\. The matched failure mode –[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)hacking under narrow expert data \(Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\) – is the same mechanism turning counter\-productive, and motivates the data\-coverage strategy \(App\.[A\.10](https://arxiv.org/html/2605.24152#A1.SS10)\)\.

\(2\) Open\-loop multi\-step action sequences \(T\>1T\{\>\}1\)\.ATT\-step Inverter optimizing a Bolza objective propagates gradients across the whole action sequence and can optimize it holistically, and emits the whole action sequence in a single forward pass\. Holistic*"gestalt\-level"*optimization is reflected in the lowest curvature variance and lowest peak curvature trajectories \(Fig\.[3](https://arxiv.org/html/2605.24152#S4.F3), Panels G, H\), and the smoothest, most goal\-coherent rollouts \(Fig\.[2](https://arxiv.org/html/2605.24152#S4.F2)\)\. Second, ballistic execution of action sequences allows an inference compute time reduction from𝒪​\(horizon\)\\mathcal\{O\}\(\\text\{horizon\}\)to𝒪​\(horizon/T\)\\mathcal\{O\}\(\\text\{horizon\}/T\)[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)forward passes per episode, yielding the observed substantial per\-episode speedup\.

\(3\) Sequenced chunks \(hierarchical composition\)\.Chaining chunks lets a higher\-level Inverter target a lower\-level one\. Empirically this is what makes Secs\.[4\.2](https://arxiv.org/html/2605.24152#S4.SS2)–[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)work: a simple algorithmic Path Inverter routes chunks through long corridors onmaze2d \-medium/largeonce a single chunk no longer spans the path; and the fully learnedn=2n\{=\}2AntMan stack – Game Inverter atop Locomotion Inverter, communicating via3232\-step waypoint plans on a 58\-dimensional symbolic game state – carries the framework from continuous physics into hybrid symbolic/discrete planning\.

In summary,all 3 of our*a\-priori*principles proved to be indispensable, each with a distinct failure mode when dropped: dropping \(1\) collapses to behavior cloning; dropping \(2\) reduces to Jordan & Rumelhart’s single\-step distal teacher, paying the fullTT\-fold inference overhead and losing access to sequence\-level beyond\-data structure; dropping \(3\) caps the framework at a single Inverter, losing the hierarchical setups required to solve more complex, long\-horizon tasks\.

##### Neurosymbolic composition\.

A*fourth principle*also emerged*post\-hoc*from our implementations:*neurosymbolic composition*\. Three of our five Inverters pair a symbolic substrate with a neural amortized inverter: \(i\) the algorithmic Path Inverter discretizes the data\-support occupancy into a cardinal grid and routes through it by[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)over a finite cell vocabulary \(Sec\.[4\.2](https://arxiv.org/html/2605.24152#S4.SS2)\); \(ii\) the AntMan Game Inverter emits a sequence over a finite cardinal\-direction alphabet\{\\\{U, D, L, R\}\\\}via Gumbel\-softmax composed with a precomputed cell×\\,\\times\\,direction transition table, with wall\-validity entering as a hard logical mask \(Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\); \(iii\) the Pulse Inverter is conditioned on a discrete unitary\-gate identity and inverts through a closed\-form Lindblad master equation rather than a learned[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\(Sec\.[4\.5](https://arxiv.org/html/2605.24152#S4.SS5)\)\. Therefore, we add*neurosymbolic composition*as a fourth, candidate organizing principle to the framework \(Suppl\. Tab\.[14](https://arxiv.org/html/2605.24152#A1.T14)\)\. Notably, our emergent use of symbolic structure is on the*representation\-and\-constraint*end of the neurosymbolic spectrum \(discrete alphabets,[FSM](https://arxiv.org/html/2605.24152#A1.SS1.20.20.20)transitions,[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)search\) rather than the*inference\-and\-synthesis*end \(differentiable logical deduction, theorem proving, or program synthesis; e\.g\. DeepProbLog,[NS\-CL](https://arxiv.org/html/2605.24152#A1.SS1.47.47.47)\)\. The Inverter framework is in principle compatible with the latter – a Level\-3 meta\-Inverter performing differentiable program synthesis over Level\-2 subgoal specifications is a natural extension\. Neurosymbolic components emerged for engineering reasons, and algorithmic\-level similarity does not by itself imply a deeper neuro\-analogy\[[43](https://arxiv.org/html/2605.24152#bib.bib18)\]; yet it seems interesting to note that the neurosymbolic components which turned out to be useful here all have parallels in symbolic processing the mammalian brain: discretization of continuous space by place and grid cells\[[74](https://arxiv.org/html/2605.24152#bib.bib13),[40](https://arxiv.org/html/2605.24152#bib.bib14)\], sequential motor primitives organized in higher\-order motor areas and reflected in muscle synergies and option\-level chunking\[[7](https://arxiv.org/html/2605.24152#bib.bib12),[91](https://arxiv.org/html/2605.24152#bib.bib15),[20](https://arxiv.org/html/2605.24152#bib.bib16),[35](https://arxiv.org/html/2605.24152#bib.bib10),[90](https://arxiv.org/html/2605.24152#bib.bib54)\], and categorical, invariant single\-cell representations in medial temporal cortex\[[79](https://arxiv.org/html/2605.24152#bib.bib17)\]\.

##### Task\-specificity vs\. generalization\.

In addition to neurosymbolic composition, a complementary axis along which the implementations of different Inverters in our framework vary is task\-specific adaptation vs\. architectural generality\. Across the 9maze2d/antmaze \-v2benchmarks alone, the strongest offline\-[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)scores require switching among three structurally distinct algorithms \([AWAC](https://arxiv.org/html/2605.24152#A1.SS1.1.1.1),[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62),[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)\), each selected per benchmark\. Our framework uses a single universal architecture \(universal causal\-transformer[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)and Inverter PyTorch classes plus a simple Path Inverter\) across all 9 maze variants, but with the auxiliary\-loss setup as the main per\-task adaptation\. The quantum case required an[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)\-based rather than transformer\-based Pulse Inverter\. A full list is given in Suppl\. Tab\.[15](https://arxiv.org/html/2605.24152#A1.T15)\. Such per\-task adaptations may be useful and practical, but moving the framework toward more fully*learnable*,*differentiable*, and*neural\-substrate\-unified*versions – replacing remaining algorithmic, analytic, or non\-neural components with learned differentiable neural alternatives – is a natural direction for future work, paralleled in[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)by meta\-[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59), learned[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)algorithms, and reformulations as conditional sequence modeling\.[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)is a well\-suited substrate for this aim: its Inverter core is differentiable end\-to\-end and its slots \([FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19),[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23), objective𝒥\\mathcal\{J\}, hierarchy depth\) are modular and well\-defined\. Suppl\. Tab\.[13](https://arxiv.org/html/2605.24152#A1.T13)outlines four complementary strategy classes – Data, Model, Objective, Deployment – along which this direction can be approached\. This way, shared organizational principles may be balanced with specialized instances, as in mammalian motor organization: The same paired forward/inverse\-internal\-model brain architecture supports radically different motor niches: primates carry a direct corticomotor neuronal pathway for independent finger control\[[64](https://arxiv.org/html/2605.24152#bib.bib62),[83](https://arxiv.org/html/2605.24152#bib.bib63)\], echolocating bats overlay a ms\-scale audio\-motor loop on the same paired forward/inverse architecture\[[85](https://arxiv.org/html/2605.24152#bib.bib64)\], rodents add a dedicated brainstem[Central Pattern Generator](https://arxiv.org/html/2605.24152#A1.SS1.10.10.10)\([CPG](https://arxiv.org/html/2605.24152#A1.SS1.10.10.10)\) for∼\\sim5–12 Hz whisking\[[57](https://arxiv.org/html/2605.24152#bib.bib65)\], and elephants control a hydrostatic trunk via a massively expanded facial motor nucleus\[[87](https://arxiv.org/html/2605.24152#bib.bib66)\]– without departing from the unifying planning and control framework\.

##### Compute\-time\.

The3030–100×100\\timesper\-episode compute\-time reduction that we observed reflects a reduction in the*number*of[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)forward passes per episode; whether and how this transfers to practical advantages depends on the deployment regime\. At batch11on a single device – our measured setup, and the relevant one for many edge applications or embedded targets – small models are kernel\-launch\-limited rather than FLOPs\-limited\[[97](https://arxiv.org/html/2605.24152#bib.bib101),[78](https://arxiv.org/html/2605.24152#bib.bib100)\]\(App\.[A\.3](https://arxiv.org/html/2605.24152#A1.SS3)\), so the reduction in[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)invocations transfers directly into wall time\. E\.g\., in batched cloud serving the regime flips to FLOPs\-bound, and our1\.51\.5M\-parameterT=128T\{=\}128transformer performs∼4\\sim\\\!4–10×10\\times*more*FLOPs per episode than a small step\-wise[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)\. The practically interesting scenarios therefore include: \(i\)*saving energy*per episode on edge devices, dominated by[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)invocations rather than raw FLOPs at our model size; \(ii\)*resource sharing*on a real robot, where most control ticks become[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-free action playback, freeing the GPU/CPU for perception,[Simultaneous Localization and Mapping](https://arxiv.org/html/2605.24152#A1.SS1.64.64.64)\([SLAM](https://arxiv.org/html/2605.24152#A1.SS1.64.64.64)\), or online learning; and \(iii\)*iterative\-numerics replacement*, where the comparison is to non\-amortized solvers and the speedups are largest \(e\.g\.∼4×104\\sim\\\!4\{\\times\}10^\{4\}for the Pulse Inverter over[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21), Sec\.[4\.5](https://arxiv.org/html/2605.24152#S4.SS5)\)\. Across these regimes, the framework’s deployment\-relevant compute\-time advantage is carried by the per\-episode reduction in[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)invocations – a lever that can be compounded by the engineering directions outlined next\.

##### Accelerating real\-time applications\.

Two natural directions could push the framework toward sub\-millisecond per\-invocation inference on embedded targets\.*First, inference\-engine compilation*:torch \.compile\[[4](https://arxiv.org/html/2605.24152#bib.bib99)\], CUDA graphs, ONNX/TensorRT export, or porting to JAX withjit\[[13](https://arxiv.org/html/2605.24152#bib.bib83)\]– the latter being what makes[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)the fastest per\-pass entry in Tab\.[5](https://arxiv.org/html/2605.24152#A1.T5)– can collapse the kernel\-launch floor by another55–10×10\\times, architecture\-agnostically\.*Second, long\-horizon\-friendly architectures*: structured state\-space models\[[38](https://arxiv.org/html/2605.24152#bib.bib97)\]such as Mamba\[[37](https://arxiv.org/html/2605.24152#bib.bib96)\]or linear\-attention transformers\[[52](https://arxiv.org/html/2605.24152#bib.bib98)\]replace attention’s𝒪​\(T2\)\\mathcal\{O\}\(T^\{2\}\)cost with linear\-in\-TTscaling and admit streaming inference at constant per\-step state\. At the chunk lengths explored here \(T=16T\{=\}16–128128\) the asymptotic advantage of linear\-scan architectures is small, but it may become crucial once[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)is scaled toT≫103T\\gg 10^\{3\}– the regime relevant for long\-horizon symbolic planning, dexterous manipulation, or long\-range locomotion\.

##### Ballistic execution\.

A second implication of chunked emission, distinct from the per\-call latency budget above, is that the emitted action sequence can be*played out*at the actuator’s native rate, independently of[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-inference latency*or*the sensory\-feedback loop rate\. This open\-loop ballistic regime mirrors how mammalian motor control handles actions too fast for closed\-loop correction — saccades, ballistic reaching, drumming, the kHz audio–motor loop of bats, and defensive reflexes\. On the engineering side, the same decoupling becomes crucial whenever the closed loop is the bottleneck rather than the actuator: high\-speed manipulation or insect\-scale flight, where flight dynamics do not admit a full perception–[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)–actuation round\-trip; surgical robotics combining kHz haptic loops with∼100\\sim\\\!100Hz vision; and reactive collision avoidance requiring sub\-tens\-of\-ms responses\. Closed\-loop replanning with Inverters would enter at the sequence boundary rather than at every tick, leveraging the same coarse\-feedback / fine\-feedforward division biological motor systems use for fast skilled behavior\.

##### Extensions\.

Our experiments in the present work are deterministic, fully\-observable, and single\-agent; when extending to stochastic, partially\-observable, and multi\-agent settings, Inverters must accommodate the added uncertainty\. A natural first direction therefore isprobabilistic Inverters,in which any of the four Inverter components can independently become stochastic – the forward modelfθf\_\{\\theta\}, the inverse modelgϕg\_\{\\phi\}, the task contextcc, or the input states0s\_\{0\}\(the[POMDP](https://arxiv.org/html/2605.24152#A1.SS1.54.54.54)/ belief case\) – with two cross\-cutting axes parameterizing the objective: where preferences live \(Bolza cost vs\. prior over outcomes\) and how uncertainty enters \(pragmatic, risk\-sensitive, or epistemic\-aware\)\. Stochastic optimal control, control\-as\-inference, Bayes\-adaptive / posterior\-sampling, and variational Inverters then appear as cells of that grid; active\-inference Inverters would occupy the specific outcome\-prior×\\timesexpected\-free\-energy corner\[[28](https://arxiv.org/html/2605.24152#bib.bib4)\]\.A second direction islatent Inverters, e\.g\., with bidirectional latent world models\[[88](https://arxiv.org/html/2605.24152#bib.bib47),[42](https://arxiv.org/html/2605.24152#bib.bib55)\]as a more abstract planning substrate\. A third direction isdeeper hierarchies: For example, a Level 3 Inverter above Level 2, treating the choice of subgoal specification for Level 2 as itself an inverse problem given a task description\.On the neuro\-side, among others, two natural next candidate principles for integration into our framework would bepredictive coding\[[82](https://arxiv.org/html/2605.24152#bib.bib102),[9](https://arxiv.org/html/2605.24152#bib.bib103)\]andactive sensing\[[100](https://arxiv.org/html/2605.24152#bib.bib104),[34](https://arxiv.org/html/2605.24152#bib.bib105)\]– both particularly relevant to the move from the offline regime studied here to online learning and control: predictive coding as an error\-correction signal driven by[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)prediction errors, and active sensing as an alternative principled remedy for the narrow\-data[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-hacking failure mode \(Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\)\. A third, complementary candidate isdual\-process organizationalong the lines of the fast/slow distinction\[[51](https://arxiv.org/html/2605.24152#bib.bib110)\]: having established[FoMs](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)/[IMs](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)for a fast ‘System 1’ that emits a plan in a single feedforward pass, the same trained modules could be repurposed as differentiable substrates for a slower, deliberative ‘System 2’ — e\.g\., iterative rollouts, inner\-loop search, or test\-time refinement through the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)— when accuracy or safety budgets justify the additional cost\. The same neural infrastructure would then span a continuum from amortized reactive control to iterative deliberation\.

##### Inverse World Models\.

Together these extensions form a trajectory scaling the present[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)/[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)core up to pairedforward and inverse world models\. Such inverse world models would apply the[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)paradigm not to a deterministic, fully\-observable dynamics model but to a world model in the modern sense – latent, stochastic, partially observable, possibly multi\-agent, language\-conditioned where useful – yielding an Inverter that, in a single feedforward pass, emits an action plan that inverts the forward world model to guide goal\-directed behavior\. In summary, we propose that the Inverter framework, based on the principles of neuro\-inspired Inverse Learning, paths a way to a versatile class of world\-interfaces, particularly for latency\- and resource\-critical embodied AI\.

## References

- \[1\]B\. Amos, I\. Jimenez, J\. Sacks, B\. Boots, and J\. Z\. Kolter\(2018\)Differentiable mpc for end\-to\-end planning and control\.Advances in neural information processing systems31\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px2.p1.1)\.
- \[2\]B\. Amos and J\. Z\. Kolter\(2017\)OptNet: differentiable optimization as a layer in neural networks\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px2.p1.1)\.
- \[3\]G\. An, S\. Moon, J\. Kim, and H\. O\. Song\(2021\)Uncertainty\-based offline reinforcement learning with diversified Q\-ensemble\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.SSS0.Px2.p3.1),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.60.44.6),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.65.49.6)\.
- \[4\]J\. Ansel, E\. Yang, H\. He, N\. Gimelshein, A\. Jain, M\. Voznesensky, B\. Bao, P\. Bell, D\. Berard, E\. Burovski,et al\.\(2024\)Pytorch 2: faster machine learning through dynamic python bytecode transformation and graph compilation\.InProceedings of the 29th ACM international conference on architectural support for programming languages and operating systems, volume 2,pp\. 929–947\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx4.p1.7.7)\.
- \[5\]S\. Arridge, P\. Maass, O\. Öktem, and C\. Schönlieb\(2019\)Solving inverse problems using data\-driven models\.Acta numerica28,pp\. 1–174\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px1.p1.1.1)\.
- \[6\]M\. Athans and P\. L\. Falb\(1966\)Optimal control: an introduction to the theory and its applications\.McGraw\-Hill\.Cited by:[§A\.5](https://arxiv.org/html/2605.24152#A1.SS5.p2.1)\.
- \[7\]T\. Ball, A\. Schreiber, B\. Feige, M\. Wagner, C\. H\. Lücking, and R\. Kristeva\-Feige\(1999\)The role of higher\-order motor areas in voluntary movement as revealed by high\-resolution eeg and fmri\.Neuroimage10\(6\),pp\. 682–694\.Cited by:[3rd item](https://arxiv.org/html/2605.24152#S1.I1.i3.p1.1.2),[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[8\]A\. Bar, G\. Zhou, D\. Tran, T\. Darrell, and Y\. LeCun\(2024\)Navigation world models\.External Links:2412\.03572,[Link](https://arxiv.org/abs/2412.03572)Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1)\.
- \[9\]A\. M\. Bastos, W\. M\. Usrey, R\. A\. Adams, G\. R\. Mangun, P\. Fries, and K\. J\. Friston\(2012\)Canonical microcircuits for predictive coding\.Neuron76\(4\),pp\. 695–711\.External Links:[Document](https://dx.doi.org/10.1016/j.neuron.2012.10.038)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx6.p1.5.9)\.
- \[10\]R\. E\. Bellman\(1957\)Dynamic programming\.Princeton University Press\.Cited by:[§2](https://arxiv.org/html/2605.24152#S2.p1.13)\.
- \[11\]A\. Bemporad, M\. Morari, V\. Dua, and E\. N\. Pistikopoulos\(2002\)The explicit linear quadratic regulator for constrained systems\.Automatica38\(1\),pp\. 3–20\.External Links:[Document](https://dx.doi.org/10.1016/S0005-1098%2801%2900174-1)Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px2.p1.1.2)\.
- \[12\]O\. Bolza\(1909\)Vorlesungen über variationsrechnung\.B\. G\. Teubner,Leipzig and Berlin\.Cited by:[§2](https://arxiv.org/html/2605.24152#S2.p1.13)\.
- \[13\]J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne,et al\.\(2018\)JAX: composable transformations of python\+ numpy programs\.Cited by:[§A\.11](https://arxiv.org/html/2605.24152#A1.SS11.SSS0.Px2.p1.12),[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx4.p1.7.7)\.
- \[14\]J\. Carius, F\. Farshidian, and M\. Hutter\(2020\)Mpc\-net: a first principles guided policy search\.IEEE Robotics and Automation Letters5\(2\),pp\. 2897–2904\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px2.p1.1)\.
- \[15\]L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch\(2021\)Decision Transformer: reinforcement learning via sequence modeling\.InAdvances in Neural Information Processing Systems,Vol\.34\.Cited by:[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.70.54.6),[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1)\.
- \[16\]C\. Chi, Z\. Xu, S\. Feng, E\. Cousineau, Y\. Du, B\. Burchfiel, R\. Tedrake, and S\. Song\(2025\)Diffusion policy: visuomotor policy learning via action diffusion\.The International Journal of Robotics Research44\(10\-11\),pp\. 1684–1704\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1.1),[Figure 4](https://arxiv.org/html/2605.24152#S4.F4),[Figure 4](https://arxiv.org/html/2605.24152#S4.F4.6.3.3)\.
- \[17\]H\. Chung, J\. Kim, M\. T\. Mccann, M\. L\. Klasky, and J\. C\. Ye\(2022\)Diffusion posterior sampling for general noisy inverse problems\.arXiv preprint arXiv:2209\.14687\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px1.p1.1.1)\.
- \[18\]K\. Clark, P\. Vicol, K\. Swersky, and D\. Fleet\(2024\)Directly fine\-tuning diffusion models on differentiable rewards\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 4793–4822\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px6.p1.1.1)\.
- \[19\]E\. R\. F\. W\. Crossman and P\. J\. Goodeve\(1983\)Feedback control of hand\-movement and Fitts’ law\.The Quarterly Journal of Experimental Psychology Section A35\(2\),pp\. 251–278\.External Links:[Document](https://dx.doi.org/10.1080/14640748308402133)Cited by:[§4\.2](https://arxiv.org/html/2605.24152#S4.SS2.p2.20)\.
- \[20\]A\. d’Avella, P\. Saltiel, and E\. Bizzi\(2003\)Combinations of muscle synergies in the construction of a natural motor behavior\.Nature neuroscience6\(3\),pp\. 300–308\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[21\]M\. Dall’Ara, M\. Koppenhöfer, F\. Reiter, T\. Wellens, S\. Montangero, and W\. Hahn\(2026\)Random layers for quantum optimal control with exponential expressivity\.arXiv preprint arXiv:2603\.08948\.Cited by:[§4\.5](https://arxiv.org/html/2605.24152#S4.SS5.p2.15.2)\.
- \[22\]M\. P\. Deisenroth and C\. E\. Rasmussen\(2011\)PILCO: a model\-based and data\-efficient approach to policy search\.InProceedings of the 28th International Conference on Machine Learning,pp\. 465–472\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px3.p1.1)\.
- \[23\]M\. Desmurget and S\. Grafton\(2000\)Forward modeling allows feedback control for fast reaching movements\.Trends in Cognitive Sciences4\(11\),pp\. 423–431\.Cited by:[2nd item](https://arxiv.org/html/2605.24152#S1.I1.i2.p1.1)\.
- \[24\]J\. Diedrichsen and K\. Kornysheva\(2015\)Motor skill learning between selection and execution\.Trends in Cognitive Sciences19\(4\),pp\. 227–233\.Cited by:[3rd item](https://arxiv.org/html/2605.24152#S1.I1.i3.p1.1.2)\.
- \[25\]C\. Domingo i Enrich, M\. Drozdzal, B\. Karrer, and R\. T\. Chen\(2025\)Adjoint matching: fine\-tuning flow and diffusion generative models with memoryless stochastic optimal control\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 53791–53846\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px6.p1.1.1)\.
- \[26\]P\. M\. Fitts\(1954\)The information capacity of the human motor system in controlling the amplitude of movement\.Journal of Experimental Psychology47\(6\),pp\. 381–391\.Cited by:[§4\.2](https://arxiv.org/html/2605.24152#S4.SS2.p2.20)\.
- \[27\]P\. Florence, C\. Lynch, A\. Zeng, O\. A\. Ramirez, A\. Wahid, L\. Downs, A\. Wong, J\. Lee, I\. Mordatch, and J\. Tompson\(2021\)Implicit behavioral cloning\.In5th Annual Conference on Robot Learning,External Links:[Link](https://openreview.net/forum?id=rif3a5NAxU6)Cited by:[Figure 4](https://arxiv.org/html/2605.24152#S4.F4),[Figure 4](https://arxiv.org/html/2605.24152#S4.F4.6.3.3)\.
- \[28\]K\. Friston\(2010\)The free\-energy principle: a unified brain theory?\.Nature Reviews Neuroscience11\(2\),pp\. 127–138\.External Links:[Document](https://dx.doi.org/10.1038/nrn2787)Cited by:[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx6.p1.5.5)\.
- \[29\]J\. Fu, A\. Kumar, O\. Nachum, G\. Tucker, and S\. Levine\(2020\)D4RL: datasets for deep data\-driven reinforcement learning\.External Links:2004\.07219,[Link](https://arxiv.org/abs/2004.07219)Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.p1.1),[§A\.9](https://arxiv.org/html/2605.24152#A1.SS9.SSS0.Px1.p1.1)\.
- \[30\]S\. Fujimoto and S\. S\. Gu\(2021\)A minimalist approach to offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 20132–20145\.Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.SSS0.Px2.p3.1),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.48.32.6)\.
- \[31\]C\. R\. Garrett, R\. Chitnis, R\. Holladay, B\. Kim, T\. Silver, L\. P\. Kaelbling, and T\. Lozano\-Pérez\(2021\)Integrated task and motion planning\.Annual review of control, robotics, and autonomous systems4\(1\),pp\. 265–293\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px5.p1.1.1)\.
- \[32\]S\. J\. Gershman, E\. J\. Horvitz, and J\. B\. Tenenbaum\(2015\)Computational rationality: a converging paradigm for intelligence in brains, minds, and machines\.Science349\(6245\),pp\. 273–278\.External Links:[Document](https://dx.doi.org/10.1126/science.aac6076)Cited by:[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px2.p1.4.1)\.
- \[33\]G\. Gigerenzer and H\. Brighton\(2009\)Homo heuristicus: why biased minds make better inferences\.Topics in cognitive science1\(1\),pp\. 107–143\.Cited by:[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px2.p1.4.1)\.
- \[34\]J\. Gottlieb, P\. Oudeyer, M\. Lopes, and A\. Baranes\(2013\)Information\-seeking, curiosity, and attention: computational and neural mechanisms\.Trends in Cognitive Sciences17\(11\),pp\. 585–593\.External Links:[Document](https://dx.doi.org/10.1016/j.tics.2013.09.001)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx6.p1.5.9)\.
- \[35\]A\. M\. Graybiel\(1998\)The basal ganglia and chunking of action repertoires\.Neurobiology of Learning and Memory70\(1–2\),pp\. 119–136\.External Links:[Document](https://dx.doi.org/10.1006/nlme.1998.3843)Cited by:[3rd item](https://arxiv.org/html/2605.24152#S1.I1.i3.p1.1.2),[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[36\]K\. Gregor and Y\. LeCun\(2010\)Learning fast approximations of sparse coding\.InProceedings of the 27th international conference on international conference on machine learning,pp\. 399–406\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px1.p1.1.1)\.
- \[37\]A\. Gu and T\. Dao\(2023\)Mamba: linear\-time sequence modeling with selective state spaces\.arXiv preprint arXiv:2312\.00752\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx4.p1.7.7)\.
- \[38\]A\. Gu, K\. Goel, and C\. Ré\(2021\)Efficiently modeling long sequences with structured state spaces\.arXiv preprint arXiv:2111\.00396\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx4.p1.7.7)\.
- \[39\]D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi\(2020\)Dream to control: learning behaviors by latent imagination\.InInternational Conference on Learning Representations,Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px3.p1.1)\.
- \[40\]T\. Hafting, M\. Fyhn, S\. Molden, M\. Moser, and E\. I\. Moser\(2005\)Microstructure of a spatial map in the entorhinal cortex\.Nature436\(7052\),pp\. 801–806\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[41\]N\. Hansen, H\. Su, and X\. Wang\(2024\)TD\-MPC2: scalable, robust world models for continuous control\.InInternational Conference on Learning Representations,Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px3.p1.1)\.
- \[42\]N\. Hansen, J\. S\. V, V\. Sobal, Y\. LeCun, X\. Wang, and H\. Su\(2025\)Hierarchical world models as visual whole\-body humanoid controllers\.InInternational Conference on Learning Representations,Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx6.p1.5)\.
- \[43\]D\. Hassabis, D\. Kumaran, C\. Summerfield, and M\. Botvinick\(2017\)Neuroscience\-inspired artificial intelligence\.Neuron95\(2\),pp\. 245–258\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[44\]U\. Hasson\(2025\)Uncovering a timescale hierarchy by studying the brain in a natural context\.The Journal of Neuroscience45\(12\),pp\. e2368242025\.Cited by:[3rd item](https://arxiv.org/html/2605.24152#S1.I1.i3.p1.1.2)\.
- \[45\]N\. Heess, G\. Wayne, D\. Silver, T\. Lillicrap, T\. Erez, and Y\. Tassa\(2015\)Learning continuous control policies by stochastic value gradients\.InAdvances in Neural Information Processing Systems,Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px2.p1.1)\.
- \[46\]M\. T\. Jackson, U\. Berdica, J\. Liesen, S\. Whiteson, and J\. N\. Foerster\(2025\)A clean slate for offline reinforcement learning\.External Links:2504\.11453,[Link](https://arxiv.org/abs/2504.11453)Cited by:[§A\.15](https://arxiv.org/html/2605.24152#A1.SS15.SSS0.Px3),[§A\.15](https://arxiv.org/html/2605.24152#A1.SS15.SSS0.Px3.p1.2)\.
- \[47\]M\. Janner, Y\. Du, J\. B\. Tenenbaum, and S\. Levine\(2022\-17–23 Jul\)Planning with diffusion for flexible behavior synthesis\.InProceedings of the 39th International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.162,pp\. 9902–9915\.Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.SSS0.Px2.p3.1),[§A\.4](https://arxiv.org/html/2605.24152#A1.SS4.SSS0.Px1.p1.7),[§A\.8](https://arxiv.org/html/2605.24152#A1.SS8.p1.1),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.76.60.7),[Supplementary Table 7](https://arxiv.org/html/2605.24152#A1.T7),[Supplementary Table 8](https://arxiv.org/html/2605.24152#A1.T8),[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1)\.
- \[48\]M\. Janner, Q\. Li, and S\. Levine\(2021\)Offline reinforcement learning as one big sequence modeling problem\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 1273–1286\.Cited by:[§A\.8](https://arxiv.org/html/2605.24152#A1.SS8.SSS0.Px4.p1.6),[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1)\.
- \[49\]Z\. Jiang, T\. Zhang, M\. Janner, Y\. Li, T\. Rocktäschel, E\. Grefenstette, and Y\. Tian\(2022\)Efficient planning in a compact latent action space\.In3rd Offline RL Workshop: Offline RL as a ”Launchpad”,External Links:[Link](https://openreview.net/forum?id=pVBETTS2av)Cited by:[§A\.8](https://arxiv.org/html/2605.24152#A1.SS8.SSS0.Px4.p1.6)\.
- \[50\]M\. I\. Jordan and D\. E\. Rumelhart\(1992\)Forward models: supervised learning with a distal teacher\.Cognitive Science16\(3\),pp\. 307–354\.External Links:[Document](https://dx.doi.org/10.1207/s15516709cog1603%5F1)Cited by:[1st item](https://arxiv.org/html/2605.24152#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.24152#S2.SS0.SSS0.Px1.p1.18),[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px6.p1.1.1),[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px7.p1.2.2)\.
- \[51\]D\. Kahneman\(2011\)Thinking, fast and slow\.macmillan\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx6.p1.5.9)\.
- \[52\]A\. Katharopoulos, A\. Vyas, N\. Pappas, and F\. Fleuret\(2020\)Transformers are rnns: fast autoregressive transformers with linear attention\.InInternational conference on machine learning,pp\. 5156–5165\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx4.p1.7.7)\.
- \[53\]M\. Kawato\(1999\)Internal models for motor control and trajectory planning\.Current Opinion in Neurobiology9\(6\),pp\. 718–727\.External Links:[Document](https://dx.doi.org/10.1016/S0959-4388%2899%2900028-8)Cited by:[1st item](https://arxiv.org/html/2605.24152#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px1.p1.1)\.
- \[54\]N\. Khaneja, T\. Reiss, C\. Kehlet, T\. Schulte\-Herbrüggen, and S\. J\. Glaser\(2005\)Optimal control of coupled spin dynamics: design of nmr pulse sequences by gradient ascent algorithms\.Journal of magnetic resonance172\(2\),pp\. 296–305\.Cited by:[§4\.5](https://arxiv.org/html/2605.24152#S4.SS5.p1.1)\.
- \[55\]P\. Kidger\(2022\)On neural differential equations\.arXiv preprint arXiv:2202\.02435\.Cited by:[§A\.11](https://arxiv.org/html/2605.24152#A1.SS11.SSS0.Px2.p1.12)\.
- \[56\]D\. E\. Kirk\(1970\)Optimal control theory: an introduction\.Prentice\-Hall\.Cited by:[§A\.5](https://arxiv.org/html/2605.24152#A1.SS5.p2.1)\.
- \[57\]D\. Kleinfeld and M\. Deschênes\(2011\)Neuronal basis for object location in the vibrissa scanning sensorimotor system\.Neuron72\(3\),pp\. 455–468\.External Links:[Document](https://dx.doi.org/10.1016/j.neuron.2011.10.009)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx2.p1.2)\.
- \[58\]D\. C\. Knill and A\. Pouget\(2004\)The Bayesian brain: the role of uncertainty in neural coding and computation\.Trends in Neurosciences27\(12\),pp\. 712–719\.External Links:[Document](https://dx.doi.org/10.1016/j.tins.2004.10.007)Cited by:[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px1.p1.1)\.
- \[59\]I\. Kostrikov, A\. Nair, and S\. Levine\(2022\)Offline reinforcement learning with implicit Q\-learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=68n2s9ZJWF8)Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.SSS0.Px2.p3.1),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.43.27.6)\.
- \[60\]A\. Kumar, J\. Hong, A\. Singh, and S\. Levine\(2022\)When should we prefer offline reinforcement learning over behavioral cloning?\.InInternational Conference on Learning Representations,Cited by:[Figure 4](https://arxiv.org/html/2605.24152#S4.F4),[Figure 4](https://arxiv.org/html/2605.24152#S4.F4.6.3.3)\.
- \[61\]A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine\(2020\)Conservative Q\-learning for offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1179–1191\.Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.SSS0.Px2.p3.1),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.28.12.6)\.
- \[62\]V\. Kurenkov and S\. Kolesnikov\(2022\)Showing your offline reinforcement learning work: online evaluation budget matters\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 11729–11752\.Cited by:[§A\.15](https://arxiv.org/html/2605.24152#A1.SS15.SSS0.Px3),[§A\.15](https://arxiv.org/html/2605.24152#A1.SS15.SSS0.Px3.p1.2),[§A\.15](https://arxiv.org/html/2605.24152#A1.SS15.SSS0.Px4.p1.3)\.
- \[63\]Y\. LeCun\(2022\)A path towards autonomous machine intelligence, version 0\.9\.2\.Note:OpenReview position paper2022\-06\-27Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1)\.
- \[64\]R\. N\. Lemon\(2008\)Descending pathways in motor control\.Annual Review of Neuroscience31,pp\. 195–218\.External Links:[Document](https://dx.doi.org/10.1146/annurev.neuro.31.060407.125547)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx2.p1.2)\.
- \[65\]S\. Levine and V\. Koltun\(2013\)Guided policy search\.InInternational conference on machine learning,pp\. 1–9\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px2.p1.1)\.
- \[66\]F\. Lieder and T\. L\. Griffiths\(2020\)Resource\-rational analysis: understanding human cognition as the optimal use of limited computational resources\.Behavioral and brain sciences43,pp\. e1\.Cited by:[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px2.p1.4.1)\.
- \[67\]A\. F\. Lipaei, E\. Khaleghian, S\. Aslan, G\. Göral, Z\. Lin, and Ö\. E\. Müstecaplıoğlu\(2026\)Fidelity\-informed neural pulse compilation of a continuous family of quantum gates with uncertainty\-margin analysis\.arXiv preprint arXiv:2604\.11314\.Cited by:[§4\.5](https://arxiv.org/html/2605.24152#S4.SS5.p2.18.4)\.
- \[68\]X\. Lv, Z\. Zhang, Y\. Li, Y\. Huo, S\. Ju, X\. Li, C\. Hong, T\. Wang, Y\. Wang, P\. Sun, C\. Yu, J\. Xu, and B\. Zheng\(2026\)DecisionLLM: large language models for long sequence decision exploration\.External Links:2601\.10148,[Link](https://arxiv.org/abs/2601.10148)Cited by:[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.82.66.7),[§4\.1](https://arxiv.org/html/2605.24152#S4.SS1.p2.10)\.
- \[69\]T\. Marcucci and R\. Tedrake\(2020\)Warm start of mixed\-integer programs for model predictive control of hybrid systems\.IEEE Transactions on Automatic Control66\(6\),pp\. 2433–2448\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px5.p1.1.1)\.
- \[70\]D\. E\. Meyer, R\. A\. Abrams, S\. Kornblum, C\. E\. Wright, and J\. E\. K\. Smith\(1988\)Optimality in human motor performance: ideal control of rapid aimed movements\.Psychological Review95\(3\),pp\. 340–370\.External Links:[Document](https://dx.doi.org/10.1037/0033-295X.95.3.340)Cited by:[§4\.2](https://arxiv.org/html/2605.24152#S4.SS2.p2.20)\.
- \[71\]I\. Mordatch and E\. Todorov\(2014\)Combining the benefits of function approximation and trajectory optimization\.InRobotics: Science and Systems,External Links:[Document](https://dx.doi.org/10.15607/RSS.2014.X.052)Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px2.p1.1)\.
- \[72\]A\. Nair, A\. Gupta, M\. Dalal, and S\. Levine\(2020\)AWAC: accelerating online reinforcement learning with offline datasets\.External Links:2006\.09359Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.SSS0.Px2.p3.1),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.38.22.6)\.
- \[73\]A\. Y\. Ng, S\. Russell,et al\.\(2000\)Algorithms for inverse reinforcement learning\.\.InIcml,Vol\.1,pp\. 2\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px7.p1.2.2)\.
- \[74\]J\. O’Keefe and J\. Dostrovsky\(1971\)The hippocampus as a spatial map: preliminary evidence from unit activity in the freely\-moving rat\.\.Brain research\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[75\]G\. Ongie, A\. Jalal, C\. A\. Metzler, R\. G\. Baraniuk, A\. G\. Dimakis, and R\. Willett\(2020\)Deep learning techniques for inverse problems in imaging\.IEEE Journal on Selected Areas in Information Theory1\(1\),pp\. 39–56\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px1.p1.1.1)\.
- \[76\]T\. L\. Paine, C\. Paduraru, A\. Michi, C\. Gulcehre, K\. Zolna, A\. Novikov, Z\. Wang, and N\. de Freitas\(2020\)Hyperparameter selection for offline reinforcement learning\.External Links:2007\.09055Cited by:[§A\.15](https://arxiv.org/html/2605.24152#A1.SS15.SSS0.Px3.p1.2),[§A\.15](https://arxiv.org/html/2605.24152#A1.SS15.SSS0.Px4.p1.3)\.
- \[77\]L\. S\. Pontryagin, V\. G\. Boltyansky, R\. V\. Gamkrelidze, and E\. F\. Mishchenko\(1962\)The mathematical theory of optimal processes\.Interscience Publishers,New York\.Cited by:[§2](https://arxiv.org/html/2605.24152#S2.p1.13)\.
- \[78\]R\. Pope, S\. Douglas, A\. Chowdhery, J\. Devlin, J\. Bradbury, J\. Heek, K\. Xiao, S\. Agrawal, and J\. Dean\(2023\)Efficiently scaling transformer inference\.Proceedings of machine learning and systems5,pp\. 606–624\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx3.p1.8.8)\.
- \[79\]R\. Q\. Quiroga, L\. Reddy, G\. Kreiman, C\. Koch, and I\. Fried\(2005\)Invariant visual representation by single neurons in the human brain\.Nature435\(7045\),pp\. 1102–1107\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[80\]A\. H\. Qureshi, A\. Simeonov, M\. J\. Bency, and M\. C\. Yip\(2019\)Motion planning networks\.InIEEE International Conference on Robotics and Automation,pp\. 2118–2124\.External Links:[Document](https://dx.doi.org/10.1109/ICRA.2019.8793889)Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1)\.
- \[81\]V\. Raman, A\. Donzé, M\. Maasoumy, R\. M\. Murray, A\. Sangiovanni\-Vincentelli, and S\. A\. Seshia\(2014\)Model predictive control with signal temporal logic specifications\.In53rd IEEE Conference on Decision and Control,pp\. 81–87\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px5.p1.1.1)\.
- \[82\]R\. P\. N\. Rao and D\. H\. Ballard\(1999\)Predictive coding in the visual cortex: a functional interpretation of some extra\-classical receptive\-field effects\.Nature Neuroscience2\(1\),pp\. 79–87\.External Links:[Document](https://dx.doi.org/10.1038/4580)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx6.p1.5.9)\.
- \[83\]J\. Rathelot and P\. L\. Strick\(2009\)Subdivisions of primary motor cortex based on cortico\-motoneuronal cells\.Proceedings of the National Academy of Sciences106\(3\),pp\. 918–923\.External Links:[Document](https://dx.doi.org/10.1073/pnas.0808362106)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx2.p1.2)\.
- \[84\]J\. Schmidhuber\(1990\)Making the world differentiable: on using self\-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non\-stationary environments\.Technical reportTechnical ReportFKI\-126\-90,Institut für Informatik, Technische Universität München\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px3.p1.1)\.
- \[85\]H\. Schnitzler and A\. Denzinger\(2011\)Auditory fovea and Doppler shift compensation: adaptations for flutter detection in echolocating bats using CF\-FM signals\.Journal of Comparative Physiology A197\(5\),pp\. 541–559\.External Links:[Document](https://dx.doi.org/10.1007/s00359-010-0569-6)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx2.p1.2)\.
- \[86\]J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel, T\. P\. Lillicrap, and D\. Silver\(2020\)Mastering Atari, Go, chess and shogi by planning with a learned model\.Nature588\(7839\),pp\. 604–609\.External Links:[Document](https://dx.doi.org/10.1038/s41586-020-03051-4)Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1)\.
- \[87\]J\. Shoshani, W\. J\. Kupsky, and G\. H\. Marchant\(2006\)Elephant brain: part I: gross morphology, functions, comparative anatomy, and evolution\.Brain Research Bulletin70\(2\),pp\. 124–157\.External Links:[Document](https://dx.doi.org/10.1016/j.brainresbull.2006.03.016)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx2.p1.2)\.
- \[88\]V\. Sobal, W\. Zhang, K\. Cho, R\. Balestriero, T\. G\. J\. Rudner, and Y\. LeCun\(2025\)Learning from reward\-free offline data: a case for planning with latent dynamics models\.External Links:2502\.14819Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx6.p1.5)\.
- \[89\]A\. Srinivas, A\. Jabri, P\. Abbeel, S\. Levine, and C\. Finn\(2018\)Universal planning networks: learning generalizable representations for visuomotor control\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 4732–4741\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px3.p1.1)\.
- \[90\]R\. S\. Sutton, D\. Precup, and S\. Singh\(1999\)Between MDPs and semi\-MDPs: a framework for temporal abstraction in reinforcement learning\.Artificial Intelligence112\(1–2\),pp\. 181–211\.External Links:[Document](https://dx.doi.org/10.1016/S0004-3702%2899%2900052-1)Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[91\]J\. Tanji and K\. Shima\(1994\)Role for supplementary motor area cells in planning several movements ahead\.Nature371\(6496\),pp\. 413–416\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx1.p1.3)\.
- \[92\]D\. Tarasov, V\. Kurenkov, A\. Nikulin, and S\. Kolesnikov\(2023\)Revisiting the minimalist approach to offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.SSS0.Px2.p3.1),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.55.39.8)\.
- \[93\]D\. Tarasov, A\. Nikulin, D\. Akimov, V\. Kurenkov, and S\. Kolesnikov\(2023\)CORL: research\-oriented deep offline reinforcement learning library\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§A\.2](https://arxiv.org/html/2605.24152#A1.SS2.SSS0.Px2.p3.1),[§A\.4](https://arxiv.org/html/2605.24152#A1.SS4.SSS0.Px1.p1.7),[§A\.8](https://arxiv.org/html/2605.24152#A1.SS8.p1.1),[§A\.9](https://arxiv.org/html/2605.24152#A1.SS9.SSS0.Px1.p1.1),[§A\.9](https://arxiv.org/html/2605.24152#A1.SS9.SSS0.Px9.p1.5),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.23.7.6),[Supplementary Table 5](https://arxiv.org/html/2605.24152#A1.T5.33.17.6),[Supplementary Table 9](https://arxiv.org/html/2605.24152#A1.T9),[Figure 2](https://arxiv.org/html/2605.24152#S4.F2),[Figure 2](https://arxiv.org/html/2605.24152#S4.F2.4.2.2)\.
- \[94\]E\. Todorov and M\. I\. Jordan\(2002\)Optimal feedback control as a theory of motor coordination\.Nature Neuroscience5\(11\),pp\. 1226–1235\.External Links:[Document](https://dx.doi.org/10.1038/nn963)Cited by:[Figure 1](https://arxiv.org/html/2605.24152#S1.F1),[Figure 1](https://arxiv.org/html/2605.24152#S1.F1.14.7.7)\.
- \[95\]P\. Virtanen, R\. Gommers, T\. E\. Oliphant, M\. Haberland, T\. Reddy, D\. Cournapeau, E\. Burovski, P\. Peterson, W\. Weckesser, J\. Bright,et al\.\(2020\)SciPy 1\.0: fundamental algorithms for scientific computing in python\.Nature methods17\(3\),pp\. 261–272\.Cited by:[§A\.11](https://arxiv.org/html/2605.24152#A1.SS11.SSS0.Px3.p1.10)\.
- \[96\]C\. Wang, M\. Uehara, Y\. He, A\. Wang, A\. Lal, T\. Jaakkola, S\. Levine, A\. Regev, H\. Wang, and T\. Biancalani\(2025\)Fine\-tuning discrete diffusion models via reward optimization with applications to dna and protein design\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 47871–47899\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px6.p1.1.1)\.
- \[97\]S\. Williams, A\. Waterman, and D\. Patterson\(2009\)Roofline: an insightful visual performance model for multicore architectures\.Communications of the ACM52\(4\),pp\. 65–76\.Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx3.p1.8.8)\.
- \[98\]D\. M\. Wolpert and M\. Kawato\(1998\)Multiple paired forward and inverse models for motor control\.Neural Networks11\(7–8\),pp\. 1317–1329\.External Links:[Document](https://dx.doi.org/10.1016/S0893-6080%2898%2900066-5)Cited by:[1st item](https://arxiv.org/html/2605.24152#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.24152#S1.SS0.SSS0.Px1.p1.1)\.
- \[99\]J\. Xu, X\. Liu, Y\. Wu, Y\. Tong, Q\. Li, M\. Ding, J\. Tang, and Y\. Dong\(2023\)Imagereward: learning and evaluating human preferences for text\-to\-image generation\.Advances in Neural Information Processing Systems36,pp\. 15903–15935\.Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px6.p1.1.1)\.
- \[100\]S\. C\. Yang, D\. M\. Wolpert, and M\. Lengyel\(2016\)Theoretical perspectives on active sensing\.Current Opinion in Behavioral Sciences11,pp\. 100–108\.External Links:[Document](https://dx.doi.org/10.1016/j.cobeha.2016.06.009)Cited by:[§5](https://arxiv.org/html/2605.24152#S5.SS0.SSS0.P0.SPx6.p1.5.9)\.
- \[101\]E\. P\. Zehr and D\. G\. Sale\(1994\)Ballistic movement: muscle activation and neuromuscular adaptation\.Canadian Journal of applied physiology19\(4\),pp\. 363–378\.Cited by:[2nd item](https://arxiv.org/html/2605.24152#S1.I1.i2.p1.1)\.
- \[102\]T\. Z\. Zhao, V\. Kumar, S\. Levine, and C\. Finn\(2023\)Learning fine\-grained bimanual manipulation with low\-cost hardware\.InRobotics: Science and Systems,External Links:[Document](https://dx.doi.org/10.15607/RSS.2023.XIX.016)Cited by:[§3](https://arxiv.org/html/2605.24152#S3.SS0.SSS0.Px4.p1.1.1)\.

## Appendix ATechnical appendices and supplementary material

### A\.1 Abbreviations

AWACAdvantage\-Weighted Actor\-CriticAWGArbitrary Waveform GeneratorBCBehavior CloningBFGSBroyden–Fletcher–Goldfarb–Shanno \(quasi\-Newton optimizer\)BFSBreadth\-first searchCBOPConservative Bayesian Model\-based Value Expansion for Offline Policy OptimizationCNOTControlled\-NOTCOMBOConservative Offline Model\-Based Policy OptimizationCORLClean Offline Reinforcement LearningCPGCentral Pattern GeneratorCQLConservative Q\-LearningCTComputed TomographyD4RLDatasets for Deep Data\-Driven Reinforcement LearningDLDeep LearningDPDynamic ProgrammingDTDecision TransformerEDACEnsemble\-Diversified Actor\-CriticEMAExponential Moving AverageFoMForward modelFSMFinite State MachineGRAPEGradient Ascent Pulse EngineeringILInverse LearningIMInverse modelIQLImplicit Q\-LearningIRLInverse Reinforcement LearningISLInverse Sequence LearningKAKCartan \(KAK\) decompositionKLKullback–Leibler \(divergence\)LEQLower Expectile Q\-learningLISTALearned Iterative Shrinkage\-Thresholding AlgorithmMAMLModel\-Agnostic Meta\-LearningMAPLEModel\-based Adaptable Policy LEarningMBRLModel\-Based Reinforcement LearningMILPMixed\-Integer Linear ProgrammingMLPMultilayer perceptronMOBILEMOdel\-Bellman Inconsistency penalized offLinE Policy OptimizationMOPOModel\-based Offline Policy OptimizationMOReLModel\-based Offline Reinforcement LearningMPCModel Predictive ControlMPNetMotion Planning NetworksMPPIModel Predictive Path IntegralMRIMagnetic Resonance ImagingMuJoCoMulti\-Joint dynamics with ContactNeSyNeuro\-Symbolic AINMRNuclear Magnetic ResonanceNNNeural networkNS\-CLNeuro\-Symbolic Concept LearnerOCOptimal ControlODEOrdinary Differential EquationOODOut\-of\-DistributionPDProportional–DerivativePGPolicy GradientPILCOProbabilistic Inference for Learning ControlPOMDPPartially Observable Markov Decision ProcessQECQuantum Error CorrectionQPQuadratic ProgrammingRAMBORobust Adversarial Model\-Based Offline RLReBRACRe\-evaluated Behavior\-Regularized Actor CriticRLReinforcement LearningSACSoft Actor\-CriticSAC\-NSoft Actor\-Critic withNN\-critic ensembleSEMStandard Error of the MeanSLAMSimultaneous Localization and MappingSNRSignal\-to\-Noise RatioSRSuccess RateTAPTrajectory Autoencoding PlannerTDTemporal DifferenceTD\-MPC2Temporal Difference Learning for Model Predictive Control 2TD3\+BCTwin Delayed DDPG \+ Behavior CloningTTTrajectory Transformer

### A\.2 Experimental details formaze2d \-umaze \-v1

Themaze2d \-umaze \-v1benchmark from[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)\[[29](https://arxiv.org/html/2605.24152#bib.bib67)\]is a continuous\-control navigation task in which a point mass must reach a goal position in a U\-shaped corridor\. The offline dataset consists of undirected trajectories generated by a[Proportional–Derivative](https://arxiv.org/html/2605.24152#A1.SS1.51.51.51)\([PD](https://arxiv.org/html/2605.24152#A1.SS1.51.51.51)\) controller navigating between random waypoints; these follow the angular U\-shaped geometry of the maze walls, visible in the heatmap in Figure[2](https://arxiv.org/html/2605.24152#S4.F2)\.

Our final[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)is a chunked causal Transformer with chunk lengthL=16L\\\!=\\\!16,dmodel=128d\_\{\\text\{model\}\}\\\!=\\\!128, 4 attention heads, 4 layers, and 847k parameters, trained for 300 epochs with Gaussian state noiseσ=0\.01\\sigma\\\!=\\\!0\.01\. We then freeze this model and train the final causal Transformer[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)for 2000 epochs through it, using segment length1616,dmodel=192d\_\{\\text\{model\}\}\\\!=\\\!192, 6 heads, 4 layers\. The[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)maps\(s0,sg\)↦a1:128\(s\_\{0\},s\_\{g\}\)\\mapsto a\_\{1:128\}in one forward pass\.Training goal sampling: each minibatch sample is a\(s0,sg\)\(s\_\{0\},s\_\{g\}\)pair withs0=sts\_\{0\}\\\!=\\\!s\_\{t\}drawn uniformly from the offline buffer andsg=st\+Hs\_\{g\}\\\!=\\\!s\_\{t\+H\}taken at a fixed horizon offsetH=128H\\\!=\\\!128ahead in the same trajectory; the sampler is episode\-aware \(pairs that would cross an episode boundary are rejected\)\. This matches the deployment query distribution: at test time the Path Inverter emits sub\-goals atwp\_spacing=\\,=\\,66m onmaze2d \-medium/largeand the goal directly onumaze, both well within one[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\-horizon of typical dataset displacement\.

#### Maze2d[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)training objective\.

Leta^1:H=gϕ​\(s0,sg\)\\hat\{a\}\_\{1:H\}\\\!=\\\!g\_\{\\phi\}\(s\_\{0\},s\_\{g\}\)be the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)output,s^1:H=fθ\(1:H\)​\(s0,a^1:H\)\\hat\{s\}\_\{1:H\}\\\!=\\\!f\_\{\\theta\}^\{\(1:H\)\}\(s\_\{0\},\\hat\{a\}\_\{1:H\}\)the segmented[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)rollout \(the rollout is chained across⌈H/T⌉\\lceil H/T\\rceilchunks of lengthT=16T\\\!=\\\!16so the gradient flows end\-to\-end through the frozen[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\), andp^t=s^t\(x​y\)\\hat\{p\}\_\{t\}\\\!=\\\!\\hat\{s\}\_\{t\}^\{\(xy\)\}the predicted positions\. We traingϕg\_\{\\phi\}to minimize

ℒ[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\-2d=λterm​‖s^H−sg‖2\+λdense​\[−1H​∑t=1Hexp⁡\(−‖p^t−pg‖\)\]\+λbnd​1H​∑t=1H\(1−O​\(p^t\)\),\\mathcal\{L\}\_\{\\text\{\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@gray@stroke\{0\}\\pgfsys@color@gray@fill\{0\}\\acs\{IM\}\}\-2d\}\}\\;=\\;\\lambda\_\{\\text\{term\}\}\\,\\big\\\|\\hat\{s\}\_\{H\}\-s\_\{g\}\\big\\\|^\{2\}\\;\+\\;\\lambda\_\{\\text\{dense\}\}\\,\\Big\[\-\\tfrac\{1\}\{H\}\\sum\_\{t=1\}^\{H\}\\\!\\exp\\\!\\big\(\{\-\}\\\|\\hat\{p\}\_\{t\}\-p\_\{g\}\\\|\\big\)\\Big\]\\;\+\\;\\lambda\_\{\\text\{bnd\}\}\\,\\tfrac\{1\}\{H\}\\sum\_\{t=1\}^\{H\}\\\!\\big\(1\-O\(\\hat\{p\}\_\{t\}\)\\big\),\(3\)withλterm=0\\lambda\_\{\\text\{term\}\}\\\!=\\\!0,λdense=λbnd=5\\lambda\_\{\\text\{dense\}\}\\\!=\\\!\\lambda\_\{\\text\{bnd\}\}\\\!=\\\!5in the final runs \(the terminal term is absorbed into the dense reward via the exponential profile and was set to0onumaze\)\.O:\[xmin,xmax\]×\[ymin,ymax\]→\[0,1\]O\\\!:\\\!\[x\_\{\\text\{min\}\},x\_\{\\text\{max\}\}\]\\\!\\times\\\!\[y\_\{\\text\{min\}\},y\_\{\\text\{max\}\}\]\\\!\\to\\\!\[0,1\]is a precomputed support map of the offline data, queried at the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-rolled\-out positions by bilinear interpolation \(F \.grid\_sample, differentiable inp^t\\hat\{p\}\_\{t\}\)\. Construction: discretize the data’s\(x,y\)\(x,y\)extent into a64×6464\{\\times\}64grid, count visits per cell, apply the chosen*occupancy mode*\(binary:𝟙​\[count\>0\]\\mathbb\{1\}\[\\text\{count\}\>0\]in published runs, orfrequency:log⁡\(1\+count\)\\log\(1\+\\text\{count\}\)\), Gaussian\-smooth withσcells=1\.5\\sigma\_\{\\text\{cells\}\}\\\!=\\\!1\.5for sub\-cell gradients, and rescale to\[0,1\]\[0,1\]\.

#### Loss\-balance toggle\.

The Table[16](https://arxiv.org/html/2605.24152#A1.T16)entry “boundary mode: \{binary, z\-score\}” is unrelated to the occupancy mode above; “z\-score” refers to a separate, orthogonal balance toggle in which each raw loss term \(ℒterm,ℒdense,ℒbnd,ℒfid\\mathcal\{L\}\_\{\\text\{term\}\},\\mathcal\{L\}\_\{\\text\{dense\}\},\\mathcal\{L\}\_\{\\text\{bnd\}\},\\mathcal\{L\}\_\{\\text\{fid\}\}\) is divided by its running[Exponential Moving Average](https://arxiv.org/html/2605.24152#A1.SS1.18.18.18)\([EMA](https://arxiv.org/html/2605.24152#A1.SS1.18.18.18)\) standard deviation \(momentum0\.990\.99,100100\-batch warm\-up\) before itsλ\\lambdaweight is applied – soλ\\lambdaoperates on unit\-variance signals\. The publishedmaze2druns leave this toggle off \(raw losses\) and usebinaryoccupancy\.

At evaluation time we consider two execution modes for the same final[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)checkpoint\. In*one\-shot*mode, the controller executes the generated128128\-step action sequence open\-loop and only requests a new plan once that horizon has been consumed\. In*replanning*mode, it executes the firstK=64K\\\!=\\\!64actions, observes the new state, and replans if the goal has not yet been reached\.

For comparison we trained eight offline[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baselines with the[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)library\[[93](https://arxiv.org/html/2605.24152#bib.bib57)\]:[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3),[TD3\+BC](https://arxiv.org/html/2605.24152#A1.SS1.70.70.70)\[[30](https://arxiv.org/html/2605.24152#bib.bib68)\],[CQL](https://arxiv.org/html/2605.24152#A1.SS1.11.11.11)\[[61](https://arxiv.org/html/2605.24152#bib.bib69)\],[IQL](https://arxiv.org/html/2605.24152#A1.SS1.24.24.24)\[[59](https://arxiv.org/html/2605.24152#bib.bib71)\],[AWAC](https://arxiv.org/html/2605.24152#A1.SS1.1.1.1)\[[72](https://arxiv.org/html/2605.24152#bib.bib72)\],[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)\[[92](https://arxiv.org/html/2605.24152#bib.bib73)\],[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62)\[[3](https://arxiv.org/html/2605.24152#bib.bib74)\], and Diffuser\[[47](https://arxiv.org/html/2605.24152#bib.bib44)\]\. We verified that our reproduced scores are consistent with the reference values reported by[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)and evaluated all methods on 100 episodes under the[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)\-official protocol \(env\-default goal, random initial state perenv \.reset\(\), score computed viaenv \.get\_normalized\_score\(\)\)\.

Supplementary Table 5:Per\-episode inference compute onmaze2d \-umaze \-v1\(100 episodes, 300 steps/ep\)\. Each cell reports the number of[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)forward passes per episode, mean wall time per pass \(CUDA\-synced, GPU, PyTorch, batch 1\),*other*— per\-episode non\-[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)overhead inside the algorithm \(Inverter: per\-chunk replan dispatch only \(no Path Inverter on this maze\); Diffuser: the[PD](https://arxiv.org/html/2605.24152#A1.SS1.51.51.51)tracker running at every env step; step\-wise[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baselines: 0\), the*sum*==\#[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-passes×\\timesms/[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-pass\+\+other, and×\\timesslower relative to our fastest configuration on this maze\.†JAX\+JIT; all other PyTorch\. Inverter rows are mean±\\pmstd over44seed[IMs](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23); the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)is not used at deployment, only the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\-related compute time is counted\.‡DecisionLLM sum is an order\-of\-magnitude estimate from the paper\.Methodscore↑\\uparrowstd\#NN\-passesms/NN\-passothersum×\\timesslower[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\[[93](https://arxiv.org/html/2605.24152#bib.bib57)\]0\.360\.368\.698\.693001\.701\.70ms—511\.3​ms511\.3\\,\\text\{ms\}44\.7×44\.7\\times[CQL](https://arxiv.org/html/2605.24152#A1.SS1.11.11.11)\[[61](https://arxiv.org/html/2605.24152#bib.bib69)\]−8\.90\-8\.906\.116\.113001\.841\.84ms—551\.2​ms551\.2\\,\\text\{ms\}48\.2×48\.2\\times[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\-10%\[[93](https://arxiv.org/html/2605.24152#bib.bib57)\]12\.1812\.184\.294\.293001\.251\.25ms—374\.9​ms374\.9\\,\\text\{ms\}32\.8×32\.8\\times[AWAC](https://arxiv.org/html/2605.24152#A1.SS1.1.1.1)\[[72](https://arxiv.org/html/2605.24152#bib.bib72)\]82\.6782\.6728\.3028\.303001\.281\.28ms—382\.5​ms382\.5\\,\\text\{ms\}33\.4×33\.4\\times[IQL](https://arxiv.org/html/2605.24152#A1.SS1.24.24.24)\[[59](https://arxiv.org/html/2605.24152#bib.bib71)\]42\.1142\.110\.580\.583001\.691\.69ms—508\.4​ms508\.4\\,\\text\{ms\}44\.4×44\.4\\times[TD3\+BC](https://arxiv.org/html/2605.24152#A1.SS1.70.70.70)\[[30](https://arxiv.org/html/2605.24152#bib.bib68)\]29\.4129\.4112\.3112\.313001\.671\.67ms—502\.1​ms502\.1\\,\\text\{ms\}43\.9×43\.9\\times[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)\[[92](https://arxiv.org/html/2605.24152#bib.bib73)\]106\.87106\.8722\.1622\.163000\.210\.21ms†—63\.0​ms63\.0\\,\\text\{ms\}†5\.5×5\.5\\times[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62)\[[3](https://arxiv.org/html/2605.24152#bib.bib74)\]130\.59130\.5916\.5216\.523001\.801\.80ms—539\.3​ms539\.3\\,\\text\{ms\}47\.1×47\.1\\times[EDAC](https://arxiv.org/html/2605.24152#A1.SS1.17.17.17)\[[3](https://arxiv.org/html/2605.24152#bib.bib74)\]95\.2695\.266\.396\.393001\.741\.74ms—522\.1​ms522\.1\\,\\text\{ms\}45\.6×45\.6\\times[DT](https://arxiv.org/html/2605.24152#A1.SS1.16.16.16)\[[15](https://arxiv.org/html/2605.24152#bib.bib43)\]18\.0818\.0825\.4225\.423003\.573\.57ms—1\.07​s1\.07\\,\\text\{s\}93\.7×93\.7\\timesDiffuser\[[47](https://arxiv.org/html/2605.24152#bib.bib44)\]116\.32116\.3234\.7034\.70646\.646\.64ms3\.03\.0ms427\.8​ms427\.8\\,\\text\{ms\}37\.4×37\.4\\timesDecisionLLM\[[68](https://arxiv.org/html/2605.24152#bib.bib58)\]145\.20145\.2035\.3235\.3230033\.3033\.30ms—10\.00​s10\.00\\,\\text\{s\}‡873\.9×873\.9\\timesInverterK=16K\\\!=\\\!16164\.25\\mathbf\{164\.25\}0\.54\\mathbf\{0\.54\}193\.363\.36ms9\.39\.3ms73\.2​ms\\mathbf\{73\.2\\,\\text\{ms\}\}6\.4×\\mathbf\{6\.4\\times\}InverterK=32K\\\!=\\\!32164\.77\\mathbf\{164\.77\}0\.34\\mathbf\{0\.34\}102\.672\.67ms4\.34\.3ms31\.0​ms\\mathbf\{31\.0\\,\\text\{ms\}\}2\.7×\\mathbf\{2\.7\\times\}InverterK=64K\\\!=\\\!64165\.19\\mathbf\{165\.19\}0\.80\\mathbf\{0\.80\}53\.143\.14ms2\.92\.9ms18\.7​ms\\mathbf\{18\.7\\,\\text\{ms\}\}1\.6×\\mathbf\{1\.6\\times\}InverterK=128K\\\!=\\\!128161\.64\\mathbf\{161\.64\}2\.17\\mathbf\{2\.17\}33\.193\.19ms1\.91\.9ms11\.4​ms\\mathbf\{11\.4\\,\\text\{ms\}\}1\.0×\\mathbf\{1\.0\\times\}

### A\.3 Timing measurement protocol

This appendix gives the exact protocol behind Table[5](https://arxiv.org/html/2605.24152#A1.T5)and explains why the Inverter transformer is not meaningfully slower than the baselines’[MLPs](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)on a single forward pass\.

#### Terminology\.

Throughout this paper,*“inference compute time”*refers to per\-episode wall time on a single device at batch 1; in a launch\-overhead\-limited small\-model regime this is the deployment\-relevant metric, distinct from FLOPs\. Per\-pass parity between the Inverter transformer and the baseline[MLPs](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)\(see “Why isn’t a transformer slower than an[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)?” below\) confirms we are in this regime on the architectures evaluated here\.

#### Hardware and software\.

All timings are measured on a single GPU \(cuda:0\) with PyTorch inevalmode undertorch \.no\_grad\.[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)is the sole exception: its inference is measured from JAX\+JIT code and is therefore marked with†in Table[5](https://arxiv.org/html/2605.24152#A1.T5)\.[MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)stepping uses the default single\-thread CPU implementation\.

#### Per\-pass wall time\.

Each neural\-network forward pass is timed usingtime \.perf\_counter\(\)wrapped bytorch \.cuda \.synchronize\(\)on both sides, so the reported wall time reflects the actual completion of the GPU work, not just the kernel\-launch queue\. Before the100100\-episode evaluation starts we run1010warm\-up forward passes on dummy inputs \(to flush kernel autotuning and CUDA graph compilation\) and then clear the timing buffers\. The same CUDA\-sync wrapper is used aroundenv \.step\(to time the[MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)physics\), around each denoising step for Diffuser, and around each[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)and[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)forward for the Inverter\.

#### Per\-episode accounting\.

We decompose an episode’s wall time into four additive components: \(i\)[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)forward passes, \(ii\)*other*algorithmic overhead – Python control loop, tensor prep,\.cpu\(\) \.numpy\(\)transfers, replan dispatch \(the Inverter\) or the[PD](https://arxiv.org/html/2605.24152#A1.SS1.51.51.51)tracker \(Diffuser\), \(iii\)env \.step, and \(iv\) a residual of a few milliseconds for episode\-level bookkeeping\. Table[5](https://arxiv.org/html/2605.24152#A1.T5)’s*sum*column reports\(i\)\+\(i​i\)\(i\)\+\(ii\), i\.e\. the algorithm’s inference computetimebudget\. Environment stepping and framework glue are excluded on purpose: we want a quantity that is invariant to the simulator and portable to real\-robot deployment, whereenv \.stepis replaced by physical dynamics\.

#### Evaluation protocol\.

All methods are evaluated on exactly100100episodes ofmaze2d \-umaze \-v1under the[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)\-official protocol – env\-default fixed target, random initial states fromenv \.reset\(\), fixed evaluation seed so the 100 starts are reproduced across runs – with a hard cap of300300environment steps per episode\. Inverter numbers average over44independently\-trained seed[IMs](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\(and their matching[FoMs](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\)\.

#### Transformer vs\.[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)compute time\.

A natural question is how a1\.51\.5M\-parameter,44\-layer /192192\-d\-model transformer \(the Inverter\) ends up at essentially the same∼1\.5\\sim\\\!1\.5ms\-per\-forward floor as a∼200\\sim\\\!200k\-parameter33\-layer /256256\-hidden[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)\([BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3),[IQL](https://arxiv.org/html/2605.24152#A1.SS1.24.24.24),[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62),[TD3\+BC](https://arxiv.org/html/2605.24152#A1.SS1.70.70.70), …\)\. The answer is that, at batch11GPU inference, neither network is compute\-bound: both run a small number of dense kernels whose wall time is dominated by CUDA kernel\-launch overhead \(∼10\\sim\\\!10μ\\mus per launch\) rather than by floating\-point work\. A44\-layer transformer incurs roughly44–8×8\\timesmore launches than a33\-layer[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35), but on a modern GPU those extra launches still fit inside the same∼1\\sim\\\!1–22ms fixed floor\. Concretely, Table[5](https://arxiv.org/html/2605.24152#A1.T5)shows the[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)baselines clustered between0\.210\.21and3\.573\.57ms/pass \([BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)at1\.701\.70,[IQL](https://arxiv.org/html/2605.24152#A1.SS1.24.24.24)at1\.691\.69,[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62)at1\.801\.80,[TD3\+BC](https://arxiv.org/html/2605.24152#A1.SS1.70.70.70)at1\.671\.67\), and the Inverter transformer at1\.511\.51–1\.681\.68ms/pass – inside that same band\. The per\-pass parity is also what makes the Inverter’s sum\-per\-episode advantage so large: because the transformer is*not*paying an extra order of magnitude per forward, the benefit of emitting a full128128\-step action chunk per forward \(instead of one action per env step\) translates directly into a3030–100×100\\timesreduction in total forwards per episode, and a matching reduction in sum wall time\.

### A\.4 Detailed performance tables \(maze2d \-medium/largeandantmaze\)

This appendix collects the per\-method performance tables formaze2d \-medium \-v1\(Table[7](https://arxiv.org/html/2605.24152#A1.T7)\) andmaze2d \-large \-v1\(Table[8](https://arxiv.org/html/2605.24152#A1.T8)\) referenced in Sec\.[4\.2](https://arxiv.org/html/2605.24152#S4.SS2), and for the sixantmaze \-v2variants \(Table[9](https://arxiv.org/html/2605.24152#A1.T9)\) referenced in Sec\.[4\.3](https://arxiv.org/html/2605.24152#S4.SS3)\. Table[6](https://arxiv.org/html/2605.24152#A1.T6)below summarizes the per\-task improvement of the Inverter over the strongest reported baseline on each variant, both in absolute[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)\-score points and as a percentage of the baseline, with the row\-wise mean across all 9 tasks\.

Supplementary Table 6:Per\-task[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)improvement of the Inverter over the strongest reported baseline, summarized across all 9maze2d/antmaze \-v2variants\.Baselines per variant:[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62)/ Diffuser /[AWAC](https://arxiv.org/html/2605.24152#A1.SS1.1.1.1)are the per\-maze winners onmaze2d\(Table[3](https://arxiv.org/html/2605.24152#S4.T3)\);[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)is used as the comparator on everyantmaze \-v2variant \(Table[4](https://arxiv.org/html/2605.24152#S4.T4)\)\.Δ\\Deltapts = Inverter−\-baseline;Δ\\Delta% =Δ\\Delta/ baseline\.∗Onantmaze \-medium \-playthe Inverter is nominally1\.71\.7points behind[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)\(87\.8±8\.087\.8\\\!\\pm\\\!8\.0vs\.89\.5±3\.489\.5\\\!\\pm\\\!3\.4\), well within the joint errorbar\. Final row: mean across all99tasks \([D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)points and %, equally weighted\)\.TaskBaselineBaselineInverterΔ\\DeltaptsΔ\\Delta%maze2d \-umaze \-v1[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62)130\.6130\.6161\.6\\mathbf\{161\.6\}\+31\.0\+31\.0\+23\.7\+23\.7maze2d \-medium \-v1Diffuser130\.1130\.1166\.8\\mathbf\{166\.8\}\+36\.7\+36\.7\+28\.2\+28\.2maze2d \-large \-v1[AWAC](https://arxiv.org/html/2605.24152#A1.SS1.1.1.1)209\.1209\.1220\.7\\mathbf\{220\.7\}\+11\.6\+11\.6\+5\.5\+5\.5antmaze \-umaze \-v2[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)97\.897\.899\.5\\mathbf\{99\.5\}\+1\.7\+1\.7\+1\.7\+1\.7antmaze \-umaze \-diverse \-v2[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)83\.583\.599\.8\\mathbf\{99\.8\}\+16\.3\+16\.3\+19\.5\+19\.5antmaze \-medium \-play \-v2∗[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)89\.5\\mathbf\{89\.5\}87\.887\.8−1\.7\-1\.7−1\.9\-1\.9antmaze \-medium \-diverse \-v2[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)83\.583\.596\.5\\mathbf\{96\.5\}\+13\.0\+13\.0\+15\.6\+15\.6antmaze \-large \-play \-v2[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)52\.252\.293\.0\\mathbf\{93\.0\}\+40\.8\+40\.8\+78\.2\+78\.2antmaze \-large \-diverse \-v2[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)64\.064\.094\.0\\mathbf\{94\.0\}\+30\.0\+30\.0\+46\.9\+46\.9Mean \(n=9\)\+19\.9\\mathbf\{\+19\.9\}\+24\.2\\mathbf\{\+24\.2\}#### Protocol formaze2d \-medium/large\.

Same[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)\-official protocol as Table[5](https://arxiv.org/html/2605.24152#A1.T5):100100episodes, env\-default fixed target, random initial states fromenv \.reset\(\)with a fixed evaluation seed,600600\-step cap on medium,800800\-step cap on large\.[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)numbers are taken from theTarasovet al\.\[[93](https://arxiv.org/html/2605.24152#bib.bib57)\]“Last Scores” benchmark \(we did not train our own medium/large[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)checkpoints\); per\-pass times come from our local umaze benchmark \(same actor architectures\) and are scaled by the target maze’s step count for the*sum*column\. Diffuser timings on medium/large are estimated from our measured umaze per\-denoise\-step timing scaled by horizon, since we do not yet have a pretrained Diffuser checkpoint for those mazes locally \(flagged∗\); the Diffuser[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)numbers are fromJanneret al\.\[[47](https://arxiv.org/html/2605.24152#bib.bib44)\]Table 2\. The*other*column \(per\-episode non\-[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)overhead\) is∼27\\sim\\\!27ms on medium and∼36\\sim\\\!36ms on large because the algorithmic Path Inverter runs a 4\-connected[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)on a∼100\\sim\\\!100\-cell grid once per replan event and occasionally performs a full re\-plan when the stuck\-check triggers \(App\.[A\.6](https://arxiv.org/html/2605.24152#A1.SS6)\); even so, this is an order of magnitude below the[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)cost of every step\-wise baseline\.

Supplementary Table 7:Per\-episode inference compute onmaze2d \-medium \-v1\(100 episodes, 600 steps/ep\)\. Each cell reports the number of[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)forward passes per episode, mean wall time per pass \(CUDA\-synced, GPU, PyTorch, batch 1\),*other*— per\-episode non\-[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)overhead inside the algorithm \(Inverter: replan dispatch \+ the data\-derived[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)Path Inverter; Diffuser: the[PD](https://arxiv.org/html/2605.24152#A1.SS1.51.51.51)tracker running at every env step; step\-wise[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baselines: 0\), the*sum*==\#[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-passes×\\timesms/[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-pass\+\+other, and×\\timesslower relative to our fastest configuration on this maze\.†JAX\+JIT; all other PyTorch\.[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)baselines have no trainedmediumcheckpoint in our local benchmark; per\-pass time is measured onumaze\(same actor architectures\) and the*sum*column scales it by600600environment steps\. The Diffuser row is measured locally on a checkpoint we trained onmaze2d \-medium \-v1for 2M steps matching the configuration ofJanneret al\.\[[47](https://arxiv.org/html/2605.24152#bib.bib44)\]; the reported[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score is the locally measured one\. Inverter row is mean±\\pmstd over44seed[IMs](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23);*other*here is the per\-episode cost of the data\-derived cardinal\-[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)Path Inverter \(Appendix[A\.6](https://arxiv.org/html/2605.24152#A1.SS6)\)\.Supplementary Table 8:Per\-episode inference compute onmaze2d \-large \-v1\(100 episodes, 800 steps/ep\)\. Each cell reports the number of[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)forward passes per episode, mean wall time per pass \(CUDA\-synced, GPU, PyTorch, batch 1\),*other*— per\-episode non\-[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)overhead inside the algorithm \(Inverter: replan dispatch \+ the data\-derived[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)Path Inverter; Diffuser: the[PD](https://arxiv.org/html/2605.24152#A1.SS1.51.51.51)tracker running at every env step; step\-wise[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baselines: 0\), the*sum*==\#[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-passes×\\timesms/[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-pass\+\+other, and×\\timesslower relative to our fastest configuration on this maze\.†JAX\+JIT; all other PyTorch\.[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)baselines have no trainedlargecheckpoint in our local benchmark; per\-pass time is measured onumaze\(same actor architectures\) and the*sum*column scales it by800800environment steps\. The Diffuser row is measured locally on a checkpoint we trained onmaze2d \-large \-v1for 2M steps matching the configuration ofJanneret al\.\[[47](https://arxiv.org/html/2605.24152#bib.bib44)\]; the reported[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score is the locally measured one\. Inverter row is mean±\\pmstd over44seed[IMs](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23);*other*here is the per\-episode cost of the data\-derived cardinal\-[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)Path Inverter \(Appendix[A\.6](https://arxiv.org/html/2605.24152#A1.SS6)\)\.Supplementary Table 9:Antmaze:[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score and per\-step inference compute\.Top:[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)success\-rate score \(%\) over 100 episodes on eachantmaze \-v2variant \([CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)baselines\[[93](https://arxiv.org/html/2605.24152#bib.bib57)\]: mean±\\,\\pm\\,std over44training seeds; Inverter: mean±\\,\\pm\\,std over44[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)seeds; best\-per\-column in bold\)\.Bottom:×\\timesslower per env step against the fastest method on each maze, measured on a single A40 GPU \(PyTorch, batch 1, CUDA\-synced\)\. For step\-wise actors ms/step==ms/pass \(the[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)runs once per env step\) and is env\-independent; ms/pass values are reported in Table[5](https://arxiv.org/html/2605.24152#A1.T5)\. For the Inverter, ms/step==\(\#[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)\-pass×\\timesms/pass \+[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)\-Path Inverter overhead\) / mean steps\-to\-goal, averaged across the44[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)seeds\.†[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58): JAX\+JIT timing\. Best\-per\-column \(1\.0×\\mathbf\{1\.0\\times\}\) bolded\.
![Refer to caption](https://arxiv.org/html/2605.24152v1/x8.png)Figure 8:Per\-episode trajectory overlays for the Inverter on every maze variant we evaluate\.3×33\\\!\\times\\\!3grid: rows==\(antmazeplay /antmazediverse /maze2d\); columns==small / medium / large maze\. Each panel overlays100100evaluation trajectories from a single representative[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)seed \(number\-of\-successes printed in the panel title\)\. Antmaze evaluations \(top two rows\) use a fixed corner goal, giving tight beam\-like overlays;maze2devaluations \(bottom row\) randomize start/goal pairs per episode, giving fan\-of\-paths through the corridor graph\.

### A\.5 Minimum\-time control under viscous damping

To interpret the smoother[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)trajectories in Figure[2](https://arxiv.org/html/2605.24152#S4.F2), it is useful to separate the geometry of the*state trajectory*from the structure of the*control signal*\. Consider the idealized one\-dimensional point\-mass model

x˙=v,v˙=−β​v\+u,\|u\|≤umax,β\>0,\\dot\{x\}=v,\\qquad\\dot\{v\}=\-\\beta v\+u,\\qquad\|u\|\\leq u\_\{\\max\},\\qquad\\beta\>0,\(4\)whereβ\\betais a viscous damping coefficient\. This captures bounded actuation and linear damping while abstracting away maze walls, goal radii, and replanning\. We consider the minimum\-time transfer from an initial state\(x0,v0\)\(x\_\{0\},v\_\{0\}\)to the terminal state\(0,0\)\(0,0\)\.

The minimum\-time transfer to the origin for this damped double integrator is a standard textbook result in optimal control\[[6](https://arxiv.org/html/2605.24152#bib.bib1),[56](https://arxiv.org/html/2605.24152#bib.bib2)\]\. Pontryagin’s maximum principle dictates that the time\-optimal input is strictly bang\-bang \(u∈\{\+umax,−umax\}u\\in\\\{\+u\_\{\\max\},\-u\_\{\\max\}\\\}\) with at most one switch\.

Furthermore, the exact switching curve is given in closed form by:

x=−sgn⁡\(v\)​\[\|v\|β−umaxβ2​log⁡\(1\+β​\|v\|umax\)\]\.x=\-\\operatorname\{sgn\}\(v\)\\left\[\\frac\{\|v\|\}\{\\beta\}\-\\frac\{u\_\{\\max\}\}\{\\beta^\{2\}\}\\log\\\!\\left\(1\+\\frac\{\\beta\|v\|\}\{u\_\{\\max\}\}\\right\)\\right\]\.\(5\)The optimal policy accelerates maximally toward the goal until it hits this curve, then brakes maximally to arrive at the target with zero velocity\. In the limitβ→0\\beta\\to 0, Eq\. \([5](https://arxiv.org/html/2605.24152#A1.E5)\) reduces to the familiar undamped switching parabolax=−12​umax​\|v\|​vx=\-\\frac\{1\}\{2u\_\{\\max\}\}\|v\|v\. Thus, viscous damping changes the shape of the switching condition from a parabola to a logarithmic curve, but it does not alter the fundamental bang\-bang nature of the minimum\-time input\.

This result should be interpreted as a*local straight\-segment model*, not as an exact theorem for the fullmaze2dbenchmark\. The benchmark only requires entry into a finite goal region, not arrival with zero terminal velocity, and the full maze introduces walls and path constraints\. In that setting the exact optimal law can differ, and the braking phase may disappear if first arrival to the goal set is all that matters\.

### A\.6 Simple algorithmic Path Inverter \(Level 2\): data\-driven[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)over offline\-data density

![Refer to caption](https://arxiv.org/html/2605.24152v1/x9.png)Figure 9:The simple algorithmic Path Inverter uses only the offline training\-data distribution, no maze geometry\.Background: log\-density of the training\-data states \(shared colorbars on the right; top row in qpos coordinates, bottom row in world coordinates\)\. No maze walls or simulator information is drawn or provided to the Path Inverter – the corridors are visible purely because the offline data concentrates there\. A cardinal\-[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)waypoint chain produced by our data\-derived Path Inverter is overlaid in orange \(circle markers = intermediate sub\-goals\); start and goal are shown in blue and orange\.Top row:maze2d \-medium \-v1\(1,988,111 training points\) andmaze2d \-large \-v1\(3,983,273 points\), with the maze2d planner settings \(resolution==0\.50\.5m,τ=2000\\tau\\\!=\\\!2000visits,min\_leg==11m\)\.Bottom row:antmaze \-medium \-play \-v2andantmaze \-large \-play \-v2\(1,000,000 points each\), with the antmaze planner settings \(resolution==1\.01\.0m,τ=100\\tau\\\!=\\\!100,min\_leg==0\); same code path otherwise\. In all four panels the chain turns through exactly those corners where the data concentration itself turns, demonstrating that deployment\-time routing can be driven by training\-data support alone\.For the largermaze2dandantmazelayouts, a single Inverter chunk no longer reaches the goal: corridors contain multiple90∘90^\{\\circ\}turns and the longest shortest path through free space exceeds the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)’straining horizon\. We therefore couple the \(unchanged\) Inverter with a trivial algorithmic Path Inverter at Level 2 that emits intermediate sub\-goals along a feasible corridor\. The key constraint: the planner must be buildable from the*offline training data alone*– no access to maze geometry, simulator, or the underlying occupancy grid\. A single class \(common \.cardinal\_bfs\_planner \.CardinalBFSPlanner\) drives both task families through different hyperparameters \(Fig\.[9](https://arxiv.org/html/2605.24152#A1.F9)\)\.

#### Construction\.

Five stages: \(i\)*Data\-derived occupancy grid*: discretize qpos into a 2\-D grid with*density\-aligned origin*\(sub\-cell phase set to the 1\-D marginals’ dominant peak, aligning corridor centerlines with cell centers;[SNR](https://arxiv.org/html/2605.24152#A1.SS1.65.65.65)jumps∼10×\{\\sim\}10\\times\); call a cell free if it received≥τ\\geq\\tautraining points \(true corridors get10410^\{4\}–10510^\{5\}visits, wall\-adjacent strays10110^\{1\}–10310^\{3\}, soτ=2000\\tau\\\!=\\\!2000cleanly separates them\); cell “centers” are the density\-weighted qpos mean\. \(ii\)*4\-connected[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)*from start to goal cell, tie\-breaking equally\-short paths by maximum cumulative distance\-to\-wall via a single backward[Dynamic Programming](https://arxiv.org/html/2605.24152#A1.SS1.15.15.15)\([DP](https://arxiv.org/html/2605.24152#A1.SS1.15.15.15)\) pass over the[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)layers \(tie\-break weight0recovers plain[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)\)\. \(iii\)*Turn\-based polyline*: keep only corner cells \(where the[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)direction changes\) plus start and goal, collapsing straight runs to their endpoints\. \(iv\)*Perpendicular snap*: between consecutive corner cells the[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)moves on a single axis; snap segment endpoints to their shared perpendicular coordinate so segments become exactly axis\-aligned\. \(v\)*L\-corner insertion and final\-approach axis lock*: split any residual diagonal into two perpendicular legs \(longer\-axis\-first, so the corner stays in free cells\); project the goal onto the last axis\-aligned segment if the final step would otherwise be diagonal\. Optionalmin\_legfilter \(default11m onmaze2d\) drops wobble waypoints whose incoming and outgoing legs are both shorter than the threshold\.

#### Per\-chunk control loop \(maze2d\)\.

Polyline emitted once at episode start\. EveryK=16K\\\!=\\\!16env \.steps the controller \(a\) advanceswp\_idxwhile the agent lies withinwp\_advance\_dist==0\.50\.5m of the current waypoint, \(b\) sets the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)target to the current sub\-goal \(never the final goal directly\), \(c\) runs one[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)forward and executes the firstKKof its128128\-step plan\. Terminal regulation:wp\_idxclamps at the last waypoint and the loop keeps re\-planning, parking the agent at the goal untilMAX\_STEPS– no separate[PD](https://arxiv.org/html/2605.24152#A1.SS1.51.51.51)tracker\. Ifwp\_idxfails to advance forν\\nuconsecutive chunks,[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)reruns from the current state and replaces the remaining polyline \(a few hundred microseconds\)\. Onmaze2dwithν=3\\nu\\\!=\\\!3the fallback fires55–1313times per episode on average; disabling it dropslarge[Success Rate](https://arxiv.org/html/2605.24152#A1.SS1.66.66.66)\([SR](https://arxiv.org/html/2605.24152#A1.SS1.66.66.66)\) from100/100100/100to93/10093/100on the same seeds\.

#### Antmaze differences\.

Pipeline is byte\-identical; only constructor arguments change to address antmaze’s coarser data and the ant’s wider body / slower per\-step displacement:resolution==1\.01\.0m,τ=100\\tau\\\!=\\\!100,min\_leg==0, centerline tie\-break enabled \(data\_center\_weight==1\.01\.0\),wp\_advance\_dist==1\.51\.5m,stuck\_threshold=10=\\\!10with≤5\\leq\\\!5replans/episode,wp\_skip\_dist==0\.80\.8m \(look\-ahead skip when the wide ant body has cleared a corner\), andtarget\_dist\_min /max==1\.0/2\.01\.0/2\.0m \(carrot\-on\-a\-stick along the heading, keeping the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)input in its calibrated reach\)\. Terminal regulation:goal\_reached\_dist==0\.50\.5m breaks the outer loop\.

#### What this Path Inverter is*not*\.

Not a learned module: plain[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)on a data\-derived grid, no parameters, no gradients\. Not an[MPC](https://arxiv.org/html/2605.24152#A1.SS1.39.39.39): no[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)call at deployment, no inner optimization loop\. Its only job is to cut long corridors into chunks that fit inside the Inverter’s training horizon\.

### A\.7 Replan\-horizon \(KK\) sweep onmaze2d \-medium \-v1andmaze2d \-large \-v1

The summary rows of Tables[7](https://arxiv.org/html/2605.24152#A1.T7)and[8](https://arxiv.org/html/2605.24152#A1.T8)report a single operating point \(K=16K\\\!=\\\!16\)\. Table[10](https://arxiv.org/html/2605.24152#A1.T10)shows how[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)and wall\-clock cost trade off asKKvaries across\{16,32,64,128,256\}\\\{16,32,64,128,256\\\}\.K∈\{16,32,64,128\}K\\\!\\in\\\!\\\{16,32,64,128\\\}all use the same paper\-summary[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\(training horizonh=128h\\\!=\\\!128\); theK=256K\\\!=\\\!256row uses a separately\-trainedh=256h\\\!=\\\!256[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)so it genuinely measures a256256\-step open\-loop commitment rather than collapsing toK=128K\\\!=\\\!128\. Three features stand out: \(i\)[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)and success rate are high and essentially flat forK≤64K\\\!\\leq\\\!64, drop sharply atK=128K\\\!=\\\!128where the agent commits to a full128128\-step plan with no in\-horizon replanning, and drop further still atK=256K\\\!=\\\!256as errors compound across twice as many open\-loop steps; \(ii\) the gap betweenK=128K\\\!=\\\!128andK=256K\\\!=\\\!256is informative: a longer[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)training horizon is*not*a free lunch when the cost is foregoing replan opportunities – in\-horizon course\-correction is what matters; \(iii\) the wall\-time sum drops monotonically withKK\(fewer[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)calls amortize the per\-chunk overhead\) but the difference betweenK=16K\\\!=\\\!16andK=64K\\\!=\\\!64is under a factor of four, so accuracy – not compute – is the binding constraint on this family of mazes\. We pickK=16K\\\!=\\\!16for the summary rows because it matches the Inverter’s per\-pass floor at the highest achievable accuracy\.

Supplementary Table 10:Inverter K\-sweep onmaze2d \-medium \-v1andmaze2d \-large \-v1\(100 episodes per seed,[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)\-official protocol;[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)and[SR](https://arxiv.org/html/2605.24152#A1.SS1.66.66.66)reported as mean±\\,\\pm\\,std over44[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)seeds\)\. Each row is a single replan horizonKK: the Inverter emits a fresh plan everyKKenv steps toward the current data\-derived[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)sub\-goal\.K∈\{16,32,64,128\}K\\\!\\in\\\!\\\{16,32,64,128\\\}uses the same paper\-headline[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\(training horizonh=128h\\\!=\\\!128\); theK=256K\\\!=\\\!256row uses a separately\-trainedh=256h\\\!=\\\!256[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)so the row genuinely measures a256256\-step open\-loop commitment instead of collapsing toK=128K\\\!=\\\!128\.*\#pass*is the number of[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)forward passes per episode;*ms/pass*is the CUDA\-synced mean wall time per pass \(batch 1\);*other*is per\-episode overhead outside the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)forward \(the data\-[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)Path Inverter \+ replan dispatch \+ CPU transfers\);*sum*==\#pass×\\timesms/pass\+\+other\.[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)score and[SR](https://arxiv.org/html/2605.24152#A1.SS1.66.66.66)are high and flat up toK=64K\\\!=\\\!64, drop sharply atK=128K\\\!=\\\!128where the agent commits to a full128128\-step plan with no in\-horizon replanning, and drop further still atK=256K\\\!=\\\!256as errors compound across twice as many open\-loop steps\. Runtime drops monotonically asKKgrows because fewer[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)calls pay the per\-chunk overhead, but theK≥64K\\\!\\geq\\\!64runs are all within one order of magnitude of each other in sum wall time — i\.e\., accuracy is the binding constraint, not speed\.maze2d \-medium \-v1\(600 steps/ep\)maze2d \-large \-v1\(800 steps/ep\)KKD4RL↑\\uparrowSR\#passms/passothersumD4RL↑\\uparrowSR\#passms/passothersum1616166\.8±1\.2\\mathbf\{166\.8\\,\{\\scriptstyle\\pm\\,1\.2\}\}𝟏𝟎𝟎±0\\mathbf\{100\\,\{\\scriptstyle\\pm\\,0\}\}381\.621\.62ms11\.211\.2ms72\.9​ms72\.9\\,\\text\{ms\}220\.7±0\.2\\mathbf\{220\.7\\,\{\\scriptstyle\\pm\\,0\.2\}\}𝟏𝟎𝟎±0\\mathbf\{100\\,\{\\scriptstyle\\pm\\,0\}\}501\.591\.59ms14\.314\.3ms93\.7​ms93\.7\\,\\text\{ms\}3232157\.9±0\.8157\.9\\,\{\\scriptstyle\\pm\\,0\.8\}𝟏𝟎𝟎±0\\mathbf\{100\\,\{\\scriptstyle\\pm\\,0\}\}191\.571\.57ms5\.55\.5ms35\.3​ms35\.3\\,\\text\{ms\}202\.3±2\.8202\.3\\,\{\\scriptstyle\\pm\\,2\.8\}𝟏𝟎𝟎±0\\mathbf\{100\\,\{\\scriptstyle\\pm\\,0\}\}251\.881\.88ms8\.78\.7ms55\.6​ms55\.6\\,\\text\{ms\}6464141\.1±1\.0141\.1\\,\{\\scriptstyle\\pm\\,1\.0\}𝟏𝟎𝟎±0\\mathbf\{100\\,\{\\scriptstyle\\pm\\,0\}\}102\.212\.21ms3\.83\.8ms25\.9​ms25\.9\\,\\text\{ms\}191\.8±3\.1191\.8\\,\{\\scriptstyle\\pm\\,3\.1\}𝟏𝟎𝟎±0\\mathbf\{100\\,\{\\scriptstyle\\pm\\,0\}\}132\.332\.33ms5\.45\.4ms35\.7​ms35\.7\\,\\text\{ms\}12812876\.0±1\.876\.0\\,\{\\scriptstyle\\pm\\,1\.8\}84±384\\,\{\\scriptstyle\\pm\\,3\}52\.792\.79ms2\.62\.6ms16\.6​ms16\.6\\,\\text\{ms\}95\.6±5\.295\.6\\,\{\\scriptstyle\\pm\\,5\.2\}86±286\\,\{\\scriptstyle\\pm\\,2\}72\.342\.34ms2\.92\.9ms19\.3​ms19\.3\\,\\text\{ms\}25625641\.7±4\.041\.7\\,\{\\scriptstyle\\pm\\,4\.0\}32±1032\\,\{\\scriptstyle\\pm\\,10\}32\.422\.42ms1\.51\.5ms8\.7​ms\\mathbf\{8\.7\\,\\text\{ms\}\}27\.5±7\.127\.5\\,\{\\scriptstyle\\pm\\,7\.1\}27±727\\,\{\\scriptstyle\\pm\\,7\}42\.272\.27ms1\.71\.7ms10\.7​ms\\mathbf\{10\.7\\,\\text\{ms\}\}

### A\.8 Comparison to model\-based offline[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baselines

The performance tables above \(Tables[5](https://arxiv.org/html/2605.24152#A1.T5)–[9](https://arxiv.org/html/2605.24152#A1.T9)\) compare against the[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)\[[93](https://arxiv.org/html/2605.24152#bib.bib57)\]offline\-[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)benchmark and Diffuser\[[47](https://arxiv.org/html/2605.24152#bib.bib44)\]\. This subsection compares to[Model\-Based Reinforcement Learning](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)\([MBRL](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)\) methods on these specific[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)tasks\. Citations in this subsection are listed in App\.[Supplementary references \(model\-basedcomparison\)](https://arxiv.org/html/2605.24152#Ax1)\(separate reference list\)\.

#### Coverage gap onmaze2d\.

To our knowledge, none of the canonical[MBRL](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)papers \([MOPO](https://arxiv.org/html/2605.24152#A1.SS1.37.37.37)\[S1\],[MOReL](https://arxiv.org/html/2605.24152#A1.SS1.38.38.38)\[S2\],[COMBO](https://arxiv.org/html/2605.24152#A1.SS1.8.8.8)\[S3\],[RAMBO](https://arxiv.org/html/2605.24152#A1.SS1.57.57.57)\[S4\],[MOBILE](https://arxiv.org/html/2605.24152#A1.SS1.36.36.36)\[S5\],[CBOP](https://arxiv.org/html/2605.24152#A1.SS1.6.6.6)\[S6\],[MAPLE](https://arxiv.org/html/2605.24152#A1.SS1.32.32.32)\[S7\], ARMOR \[S8\],[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\[S9\],[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)\[S10\],[LEQ](https://arxiv.org/html/2605.24152#A1.SS1.29.29.29)\[S12\]\) report results on the single\-taskmaze2d \-umaze /medium /large \-v1benchmarks in their published evaluation tables; their evaluations focus on[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)[MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)locomotion and \(less often\) Adroit / NeoRL\. The only widely\-cited “model\-based planner” with publishedmaze2d \-v1numbers is Diffuser, which is already in our main tables \(113\.9 / 121\.5 / 123\.0[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)on umaze / medium / large from \[S11\] Table 1, vs\. our Inverter at 164\.25 / 166\.82 / 220\.66[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)\)\. We did not re\-run[MOPO](https://arxiv.org/html/2605.24152#A1.SS1.37.37.37)/[MOReL](https://arxiv.org/html/2605.24152#A1.SS1.38.38.38)/[COMBO](https://arxiv.org/html/2605.24152#A1.SS1.8.8.8)/[RAMBO](https://arxiv.org/html/2605.24152#A1.SS1.57.57.57)/[MOBILE](https://arxiv.org/html/2605.24152#A1.SS1.36.36.36)/[CBOP](https://arxiv.org/html/2605.24152#A1.SS1.6.6.6)/[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)/[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)/[LEQ](https://arxiv.org/html/2605.24152#A1.SS1.29.29.29)onmaze2d \-v1ourselves under our compute envelope \(App\.[A\.16](https://arxiv.org/html/2605.24152#A1.SS16)\); we flag this as a missing baseline rather than fabricate numbers\.

#### Coverage gap onantmaze: v0 vs\. v2\.

Every published[MBRL](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)number we could locate onantmazeis for theantmaze \-v0datasets, not theantmaze \-v2datasets that we and the[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)benchmark use\.[RAMBO](https://arxiv.org/html/2605.24152#A1.SS1.57.57.57)’s authors explicitly state “we used the AntMaze\-v0 datasets” \(App\. B\.7 of \[S4\]\);[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)states “v2 version of the datasets for locomotion control and v0 for the other tasks” \(Sec\. 6 of \[S10\]\)\. The v0 vs\. v2 distinction matters – v2 corrected reward\-shaping and termination handling that affected v0 long\-horizon evaluations – so a same\-row comparison would not be apples\-to\-apples\. We therefore report the v0 numbers in Table[11](https://arxiv.org/html/2605.24152#A1.T11)with an explicit footnote, alongside our v2 Inverter row reproduced from Table[9](https://arxiv.org/html/2605.24152#A1.T9), and leave it to the reader to apply the appropriate caveat\.

Supplementary Table 11:Model\-based offline[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baselines onantmaze\([D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)Normalized Score,%\\%\)\. Top block: published[MBRL](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)numbers, all onantmaze \-v0\(rows above the rule\), as reported in the cited papers\. Bottom block: our Inverter, evaluated onantmaze \-v2\(reproduced from Table[9](https://arxiv.org/html/2605.24152#A1.T9)\)\.∗Antmaze\-v0 vs\. v2 mismatch: the v2 datasets we use have corrected reward and termination handling; published[MBRL](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)papers as of early 2025 evaluate on v0 only, so cross\-row comparison is approximate\.†[COMBO](https://arxiv.org/html/2605.24152#A1.SS1.8.8.8)row is the antmaze\-v0 baseline reported in[RAMBO](https://arxiv.org/html/2605.24152#A1.SS1.57.57.57)\[S4\] \(Table 1\); the original[COMBO](https://arxiv.org/html/2605.24152#A1.SS1.8.8.8)paper \[S3\] does not report antmaze\.§[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)result is the[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\+\+Q variant \(Q\-guided beam search\) from \[S9\] Table 2\.¶[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)result is the[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)\+\+G variant \(goal\-conditioned\) from \[S10\] Table 6\.∥[LEQ](https://arxiv.org/html/2605.24152#A1.SS1.29.29.29)values are the original[LEQ](https://arxiv.org/html/2605.24152#A1.SS1.29.29.29)paper’s own results \[S12\] from Table 11, averaged over55seeds \(true termination function\)\.n /aentries indicate antmaze\-v0 was not reported in the source paper; we listn /afor[MOPO](https://arxiv.org/html/2605.24152#A1.SS1.37.37.37)\[S1\],[MOReL](https://arxiv.org/html/2605.24152#A1.SS1.38.38.38)\[S2\],[MOBILE](https://arxiv.org/html/2605.24152#A1.SS1.36.36.36)\[S5\], and[CBOP](https://arxiv.org/html/2605.24152#A1.SS1.6.6.6)\[S6\] rather than impute values, as none of these papers report antmaze in their published evaluation tables\. “0\.0” entries are reported as such in the source – consistent with the well\-known difficulty[MBRL](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)methods face on long\-horizon sparse\-reward tasks\.
#### Results\.

Two patterns are visible across the antmaze suite\. First, four canonical[MBRL](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)methods \([MOPO](https://arxiv.org/html/2605.24152#A1.SS1.37.37.37),[MOReL](https://arxiv.org/html/2605.24152#A1.SS1.38.38.38),[MOBILE](https://arxiv.org/html/2605.24152#A1.SS1.36.36.36),[CBOP](https://arxiv.org/html/2605.24152#A1.SS1.6.6.6)\) do not report antmaze in their original published evaluation tables, so the public[MBRL](https://arxiv.org/html/2605.24152#A1.SS1.33.33.33)coverage on this benchmark is sparse\. Second, the methods that do report antmaze\-v0 –[COMBO](https://arxiv.org/html/2605.24152#A1.SS1.8.8.8)\(via[RAMBO](https://arxiv.org/html/2605.24152#A1.SS1.57.57.57)’s baseline table\),[RAMBO](https://arxiv.org/html/2605.24152#A1.SS1.57.57.57)itself,[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\+\+Q,[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)\+\+G, and[LEQ](https://arxiv.org/html/2605.24152#A1.SS1.29.29.29)\[S12\] – reach moderate\-to\-strong performance on a subset of variants, with[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\+\+Q strong on the umaze/medium variants \(100100on u\-umaze\) and[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)\+\+G strong on the large variants \(74\.074\.0/82\.082\.0on l\-play/l\-divrs\)\. On the umaze and medium variants the Inverter onantmaze \-v2sits in a comparable[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)band to[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\+\+Q onantmaze \-v0\([TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\+\+Q nominally ahead by≤12\\leq\\\!12points on each, the v0/v2 caveat going both ways\)\. On the twolargevariants the Inverter clearly leads \(93\.093\.0/94\.094\.0vs\.[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)\+\+G’s74\.074\.0/82\.082\.0– a19\.019\.0/12\.012\.0absolute\-points gap\), and beats[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\+\+Q and[LEQ](https://arxiv.org/html/2605.24152#A1.SS1.29.29.29)on every large variant\. The∼12\\sim\\\!12–2020point lead onlargeis larger than typical v0→\\tov2 score shifts\. We emphasize the v0/v2 caveat once more: a clean comparison would re\-run[RAMBO](https://arxiv.org/html/2605.24152#A1.SS1.57.57.57),[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\+\+Q,[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)\+\+G, and[LEQ](https://arxiv.org/html/2605.24152#A1.SS1.29.29.29)on antmaze\-v2 ourselves; the public codebases make this tractable but lie outside our 8\-GPU / seven\-week compute envelope \(App\.[A\.16](https://arxiv.org/html/2605.24152#A1.SS16)\) for this submission\.

#### Amortized\-vs\.\-iterative\.

Beyond the summary numbers,[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\[[48](https://arxiv.org/html/2605.24152#bib.bib42)\]and[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)\[[49](https://arxiv.org/html/2605.24152#bib.bib85)\]differ from the Inverter on two structural axes that organize the related\-work paragraph in Sec\.[3](https://arxiv.org/html/2605.24152#S3)\. \(i\)*Where the optimization runs\.*Both[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)and[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)place trajectory optimization at*inference*:[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)runs a beam search over per\-timestep tokens \(state and action dimensions discretized into a vocabulary, with aQQ\-function added as a search heuristic in the[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)\+\+Q variant on antmaze\);[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)runs a beam search over a state\-conditioned VQ\-VAE’s discrete latent action codes \(length\-L=3L\{=\}3chunks, codebook sizeK=512K\{=\}512\)\. Both therefore retain a sample\-time iteration whose cost grows with the search budget\. The Inverter folds the equivalent optimization into*training*\([FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-gradient amortization\) and emits the fullTT\-step plan in a single feedforward pass at deployment, with no inner loop\. This is the amortized\-vs\.\-iterative axis our related work draws \(Sec\.[3](https://arxiv.org/html/2605.24152#S3)\)\. \(ii\)*What the training loss is\.*Both[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)and[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)are behavior\-cloning derivatives at the loss level:[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)maximizes the trajectory likelihoodpθ​\(τ\)p\_\{\\theta\}\(\\tau\)on the offline data;[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67)minimizes a reconstruction MSE between offline trajectories and their VQ\-VAE\-decoded reconstructions\. Neither uses a[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)gradient\. By construction these objectives constrain the learned policy to the data manifold\. The Inverter explicitly differentiates a Bolza objective through a frozen[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19), which allows it to leave the data support when the data is sub\-optimal – this is the mechanism behind themaze2d \-umazebang\-bang result \(Sec\.[4\.1](https://arxiv.org/html/2605.24152#S4.SS1)\), where the Inverter approaches the analytic time\-optimal control while the offline data is located in the interior of the action box\.[TT](https://arxiv.org/html/2605.24152#A1.SS1.71.71.71)and[TAP](https://arxiv.org/html/2605.24152#A1.SS1.67.67.67), by contrast, cannot exceed the demonstrator distribution they are trained to reproduce\.

### A\.9 Antmaze: data, forward and inverse models, and training objective

This appendix gives the technical details behind Sec\.[4\.3](https://arxiv.org/html/2605.24152#S4.SS3)\.

#### Datasets\.

We use the sixantmaze \-v2variants from[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)\[[93](https://arxiv.org/html/2605.24152#bib.bib57),[29](https://arxiv.org/html/2605.24152#bib.bib67)\]:umaze,umaze \-diverse,medium \-play,medium \-diverse,large \-play,large \-diverse\. Each dataset comprises∼1\\sim\\\!1M ant\-locomotion transitions\.

#### State and action encoding\.

Actions are the standard 8\-D joint\-torque vector𝐚∈\[−1,1\]8\\mathbf\{a\}\\in\[\-1,1\]^\{8\}\(8 actuated joints\)\. States are the 29\-dim observation supplied by the[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)env, factored \(using Python half\-open ranges\) as

- •𝐬0:2\\mathbf\{s\}\_\{0:2\}– root\(x,y\)\(x,y\)position;
- •𝐬2:3\\mathbf\{s\}\_\{2:3\}– rootzz\(height\);
- •𝐬3:7\\mathbf\{s\}\_\{3:7\}– root quaternion\(qw,qx,qy,qz\)\(q\_\{w\},q\_\{x\},q\_\{y\},q\_\{z\}\);
- •𝐬7:15\\mathbf\{s\}\_\{7:15\}– 8 hip / ankle joint angles;
- •𝐬15:18\\mathbf\{s\}\_\{15:18\}– root linear velocity\(x˙,y˙,z˙\)\(\\dot\{x\},\\dot\{y\},\\dot\{z\}\);
- •𝐬18:21\\mathbf\{s\}\_\{18:21\}– root angular velocity\(ωx,ωy,ωz\)\(\\omega\_\{x\},\\omega\_\{y\},\\omega\_\{z\}\);
- •𝐬21:29\\mathbf\{s\}\_\{21:29\}– 8 joint velocities\.

#### Forward model\.

AntMazeFull29TransformerFM: causal transformer that maps\(s0,a1:L\)\(s\_\{0\},\\,a\_\{1:L\}\)tos^1:L\\hat\{s\}\_\{1:L\}with chunk lengthL=16L\\\!=\\\!16,dmodel=384d\_\{\\text\{model\}\}\\\!=\\\!384,66heads,66layers, ff\-mult44, dropout0\.10\.1, totalling≈11\.1\\approx\\\!11\.1M parameters\. Trained on the offline transitions for300300epochs \(batch512512, AdamW withlr=3×10−4\\text\{lr\}\\\!=\\\!3\\\!\\times\\\!10^\{\-4\}, weight decay10−410^\{\-4\},σs0=0\.01\\sigma\_\{s\_\{0\}\}\\\!=\\\!0\.01initial\-state noise\)\. The training loss is a sum of77per\-component MSEs \(x​yxy,zz, quaternion, joint angles, linear velocity, angular velocity, joint velocities\), each with unit weight; we use a running\-z\-score normalizer per dimension \(momentum0\.990\.99,100100\-epoch warm\-up\) so the per\-component losses share a comparable scale\.

#### Inverse model\.

AntMazeIWM\_L16\_Healthy29D: causal transformer mapping\(s0,Gx​y\)\(s\_\{0\},G\_\{xy\}\)to a1616\-step action chunka^1:16\\hat\{a\}\_\{1:16\}, withdmodel=192d\_\{\\text\{model\}\}\\\!=\\\!192,66heads,44layers, ff\-mult33, dropout0\.10\.1, totalling≈1\.5\\approx\\\!1\.5M parameters\. Trained for200200epochs \(batch256256, AdamW withlr=3×10−4\\text\{lr\}\\\!=\\\!3\\\!\\times\\\!10^\{\-4\}, weight decay10−410^\{\-4\},1010\-epoch warm\-up\) by back\-propagation through the frozen[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\.

*Single\-pass non\-autoregressive decoding\.*The two\-token nominal input\(s0,Gx​y\)\(s\_\{0\},G\_\{xy\}\)is first reduced to a single conditioning vectorc∈ℝdmodelc\\\!\\in\\\!\\mathbb\{R\}^\{d\_\{\\text\{model\}\}\}by a two\-layer[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)\(Linear\-GELU\-LN×2\\times 2\)\. AnL=16L\\\!=\\\!16\-position token bank is then formed in one shot asxt=c\+posemb​\[t\]x\_\{t\}=c\+\\mathrm\{posemb\}\[t\]fort=0,…,L−1t=0,\\ldots,L\{\-\}1, whereposemb∈ℝL×dmodel\\mathrm\{posemb\}\\\!\\in\\\!\\mathbb\{R\}^\{L\\times d\_\{\\text\{model\}\}\}is a learned position embedding; the same conditioning is thus broadcast to every output position\. TheseLLtokens are processed in parallel by a pre\-norm Transformer encoder with a triangular causal self\-attention mask \(tokenttattends only to tokens0\.\.t0\.\.t\), and a per\-position linear head followed bytanh\\tanhemits one action vector at every position\. The whole chunka^1:16\\hat\{a\}\_\{1:16\}thus comes out of a single feedforward pass; the “causal” label refers only to the attention pattern – there is no autoregressive sampling, no decoder cross\-attention, no iterative refinement\. The maze2d Motor Inverter \(App\.[A\.2](https://arxiv.org/html/2605.24152#A1.SS2)\) shares the same architecture pattern withL=128L\\\!=\\\!128\.

#### Training goal sampling\.

Each training minibatch sample is an\(s0,Gx​y\)\(s\_\{0\},G\_\{xy\}\)pair withs0=sts\_\{0\}\\\!=\\\!s\_\{t\}drawn uniformly from the offline buffer andGx​y=\(st\+L\)x​yG\_\{xy\}\\\!=\\\!\(s\_\{t\+L\}\)\_\{xy\}, the\(x,y\)\(x,y\)of the state at a fixed horizon offsetL=16L\\\!=\\\!16ahead in the*same*trajectory; the sampler is episode\-aware so pairs spanning an episode boundary are rejected\. The[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)internally encodes the goal as a displacementGx​y−\(s0\)x​yG\_\{xy\}\\\!\-\\\!\(s\_\{0\}\)\_\{xy\}before its first projection layer, so the network never sees a global\(x,y\)\(x,y\)at input\. ThisLL\-step\-ahead distribution is what aligns training with the deployment query distribution: the Sec\.[4\.3](https://arxiv.org/html/2605.24152#S4.SS3)Path Inverter emits sub\-goals atwp\_spacing=\\,=\\,1\.51\.5m, which is within the typical 16\-step ant displacement in the offline data \(∼1\.5\\sim\\\!1\.5m / chunk formd≥\\geq\\,0\.50\.5m /z≥\\geq\\,0\.30\.3m filtered states\)\.

#### Training objective\.

Leta^1:L=gϕ​\(s0,Gx​y\)\\hat\{a\}\_\{1:L\}=g\_\{\\phi\}\(s\_\{0\},G\_\{xy\}\)be the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)output,s^1:L=fθ\(1:L\)​\(s0,a^1:L\)\\hat\{s\}\_\{1:L\}=f\_\{\\theta\}^\{\(1:L\)\}\(s\_\{0\},\\hat\{a\}\_\{1:L\}\)the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)rollout, anda1:Ldataa\_\{1:L\}^\{\\,\\text\{data\}\}the corresponding offline action chunk for the same\(s0,Gx​y\)\(s\_\{0\},G\_\{xy\}\)pair\. We traingϕg\_\{\\phi\}to minimize

ℒ[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)=λterm​‖s^L\(x​y\)−Gx​y‖\+λyaw​‖\(s^1:L\(qw\),s^1:L\(qz\)\)−\(1,0\)‖2\+λfid​‖a^1:L−a1:Ldata‖2,\\mathcal\{L\}\_\{\\text\{\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@gray@stroke\{0\}\\pgfsys@color@gray@fill\{0\}\\acs\{IM\}\}\}\}=\\lambda\_\{\\text\{term\}\}\\,\\\|\\hat\{s\}\_\{L\}^\{\(xy\)\}\-G\_\{xy\}\\\|\\;\+\\;\\lambda\_\{\\text\{yaw\}\}\\,\\big\\\|\(\\hat\{s\}^\{\(q\_\{w\}\)\}\_\{1:L\},\\,\\hat\{s\}^\{\(q\_\{z\}\)\}\_\{1:L\}\)\-\(1,0\)\\big\\\|^\{2\}\\;\+\\;\\lambda\_\{\\text\{fid\}\}\\,\\big\\\|\\hat\{a\}\_\{1:L\}\-a\_\{1:L\}^\{\\,\\text\{data\}\}\\big\\\|^\{2\},\(6\)withλterm=λyaw=λfid=5\\lambda\_\{\\text\{term\}\}=\\lambda\_\{\\text\{yaw\}\}=\\lambda\_\{\\text\{fid\}\}=5in the final runs\.*Note:*the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)’s training loss \(above\) and the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)’sloss in Eq\.[6](https://arxiv.org/html/2605.24152#A1.E6)are distinct losses for distinct networks; the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)’sterminal term is intentionally unsquared, see explanation below\. The first term is the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-gradient task signal: minimize the predicted final\-distance to the goal\. The use of the*un\-squared*L2 norm here is intentional:∂‖x‖/∂x=x/‖x‖\\partial\\\|x\\\|/\\partial x=x/\\\|x\\\|has unit magnitude independent of‖x‖\\\|x\\\|, so the goal\-reaching gradient remains effective whether the sub\-goal is1\.51\.5or66m away; a squared‖x‖2\\\|x\\\|^\{2\}would yield gradient2​x2xthat vanishes as the agent approaches the sub\-goal and over\-weights distant ones\. The second term is the*body\-yaw regularizer*: it pushes the predicted body orientation across the chunk toward\(qw,qz\)=\(1,0\)\(q\_\{w\},q\_\{z\}\)=\(1,0\), which corresponds to yaw=0=\\\!0and the upright orientation overwhelmingly represented in the offline data\. The third term is the*[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)action\-fidelity anchor*: it pulls each predicted actiona^t\\hat\{a\}\_\{t\}toward the recorded data actionatdataa\_\{t\}^\{\\text\{data\}\}at the same*time index*ttwithin the chunk\. Sincea^1:L\\hat\{a\}\_\{1:L\}is generated open\-loop froms0s\_\{0\}, the matching is purely temporal, not state\-based — the open\-loop predicted statess^t\\hat\{s\}\_\{t\}will in general drift fromstdatas\_\{t\}^\{\\text\{data\}\}, and the anchor still penalizes deviations ofa^t\\hat\{a\}\_\{t\}fromatdataa\_\{t\}^\{\\text\{data\}\}at everyttregardless of that drift\. Both regularizers act on bounded per\-step quantities, where squared\-norm scaling is harmless; the asymmetry between the terminal∥⋅∥\\\|\\cdot\\\|and the regularizer∥⋅∥2\\\|\\cdot\\\|^\{2\}is therefore deliberate\. All three terms are per\-action, per\-time\-step, differentiable, and additive – no value function, no sample\-time guidance, no[KL](https://arxiv.org/html/2605.24152#A1.SS1.28.28.28)trust region\.

#### Level 2 substitute \(deployment\)\.

The data\-only Path Inverter follows the maze2d construction \(Appendix[A\.6](https://arxiv.org/html/2605.24152#A1.SS6): density\-aligned origin, noise\-floor threshold, turn\-based polyline\) with one antmaze\-specific addition: among all equally\-short shortest paths, we pick the one whose interior cells have the maximum cumulative cardinal distance\-to\-wall, so the resulting waypoint chain stays on the corridor centerline rather than along the edges\. This is implemented as a single backward[DP](https://arxiv.org/html/2605.24152#A1.SS1.15.15.15)pass over the[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)layers and remains pure[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)in the sense that path length is always the minimum cell\-step count\. Antmaze\-specific hyperparameters:resolution=1\.0=\\\!1\.0m \(one antmaze cell\),τ=100\\tau\\\!=\\\!100visits,data\_center\_weight=1\.0=\\\!1\.0\(centerline tie\-break enabled\),min\_leg=0=\\\!0\(sub\-cell wobble filter disabled, not needed at this grid resolution\),wp\_spacing=1\.5=\\\!1\.5m,wp\_advance\_dist=1\.5=\\\!1\.5m,wp\_skip\_dist=0\.8=\\\!0\.8m, target\-distance window\(1\.0,2\.0\)\(1\.0,2\.0\)m,stuck\_threshold=10=\\\!10chunks,≤5\\leq\\\!5full[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)replans per episode\.

#### Per\-chunk waypoint\-tracking and sub\-goal logic \(deployment\)\.

The[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)produces a polylinewp\[0:N\]\\mathrm\{wp\}\[0\\\!:\\\!N\]once at episode start\. At every chunk boundary \(everyK=16K\\\!=\\\!16env \.stepcalls\), the controller executes the following loop, with the agent’s current\(x,y\)\(x,y\)positionpp:

1. 1\.Advance\.While‖p−wp​\[wp\_idx\]‖2≤wp\_advance\_dist=1\.5\\\|p\-\\mathrm\{wp\}\[\\texttt\{wp\\\_idx\}\]\\\|\_\{2\}\\leq\\texttt\{wp\\\_advance\\\_dist\}=1\.5m, incrementwp\_idxby 1 \(skip waypoints the ant has already cleared\)\.
2. 2\.Look\-ahead skip\.Ifwp\_idx\+1<N\+1<N,‖p−wp​\[wp\_idx\+1\]‖2<‖p−wp​\[wp\_idx\]‖2\\\|p\-\\mathrm\{wp\}\[\\texttt\{wp\\\_idx\}\{\+\}1\]\\\|\_\{2\}<\\\|p\-\\mathrm\{wp\}\[\\texttt\{wp\\\_idx\}\]\\\|\_\{2\},*and*‖p−wp​\[wp\_idx\+1\]‖2≤wp\_skip\_dist=0\.8\\\|p\-\\mathrm\{wp\}\[\\texttt\{wp\\\_idx\}\{\+\}1\]\\\|\_\{2\}\\leq\\texttt\{wp\\\_skip\\\_dist\}=0\.8m, incrementwp\_idxby an additional 1 \(the wide ant body has effectively cleared a corner before the strict advance test fires\)\.
3. 3\.Target shaping \(carrot\-on\-a\-stick\)\.Letu^=\(wp​\[wp\_idx\]−p\)/‖wp​\[wp\_idx\]−p‖\\hat\{u\}=\(\\mathrm\{wp\}\[\\texttt\{wp\\\_idx\}\]\-p\)/\\\|\\mathrm\{wp\}\[\\texttt\{wp\\\_idx\}\]\-p\\\|\. Sampled∼𝒰​\(target\_dist\_min,target\_dist\_max\)=𝒰​\(1\.0,2\.0\)d\\sim\\mathcal\{U\}\(\\texttt\{target\\\_dist\\\_min\},\\texttt\{target\\\_dist\\\_max\}\)=\\mathcal\{U\}\(1\.0,2\.0\)m and set the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)targetGx​y←p\+d​u^G\_\{xy\}\\leftarrow p\+d\\,\\hat\{u\}\. This keeps the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)input distance inside its calibrated locomotion\-reach window even when the agent sits on top of a waypoint, where a literal targetwp​\[wp\_idx\]\\mathrm\{wp\}\[\\texttt\{wp\\\_idx\}\]would degenerate to zero distance and collapse joint\-torque magnitudes\.
4. 4\.Plan\.Run one[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)forward pass conditioned on\(p,Gx​y\)\(p,G\_\{xy\}\)to obtain a 16\-step action chunka^1:16\\hat\{a\}\_\{1:16\}\.
5. 5\.Execute\.Applya^1:16\\hat\{a\}\_\{1:16\}to the env \(16env \.stepcalls\), updatingpp\.
6. 6\.Stuck check\.Ifwp\_idxhas not advanced forstuck\_threshold=10=\\\!10consecutive chunks and the global cap of≤5\\leq\\\!5replans/episode is not exhausted, re\-run[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)from the currentppand replace the remaining polyline with the fresh plan\.
7. 7\.Terminate\.If‖p−Gepisode‖2≤goal\_reached\_dist=0\.5\\\|p\-G\_\{\\text\{episode\}\}\\\|\_\{2\}\\leq\\texttt\{goal\\\_reached\\\_dist\}=0\.5m, exit the outer loop; otherwise loop back to step 1\.

Thewp\_skip\_disttest in step 2 is the only condition that can advancewp\_idxby more than one per chunk; the carrot\-on\-a\-stick in step 3 is the only deviation from passing the literal current waypoint into the[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)as a target\. Both are off \(wp\_skip\_dist=0=\\\!0,target\_dist\_min /max=0=\\\!0\) onmaze2dwhere the point\-mass dynamics do not require them\.

#### Evaluation protocol\.

100100episodes per maze, default[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)env resets and targets,700700\-step cap onumaze\*variants and10001000\-step cap onmedium\*/large\*\. All runs sequential on a single A40 GPU so per\-pass and per\-step timings are directly comparable\.[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)scores in Table[9](https://arxiv.org/html/2605.24152#A1.T9)are taken fromTarasovet al\.\[[93](https://arxiv.org/html/2605.24152#bib.bib57)\]\(mean over44training seeds\); per\-pass time for the PyTorch[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)actors is measured onmaze2d \-umaze\(same actor architectures\), since per\-pass[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)cost is architecture\-bound, not env\-bound\.[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)is JAX\+JIT \(marked†\); all other baselines are PyTorch\.

#### Per\-step compute breakdown\.

The Inverter is the fastest method per env step on55of66variants \(Table[9](https://arxiv.org/html/2605.24152#A1.T9), bottom block\)\. Its per\-step cost is the per\-episode[NN](https://arxiv.org/html/2605.24152#A1.SS1.46.46.46)compute \(∼100\\sim\\\!100ms total, dominated by∼40\\sim\\\!40chunk\-level[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)passes at∼2\.4\\sim\\\!2\.4ms each plus∼20\\sim\\\!20ms of[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)\-planner overhead\) divided by the∼600\\sim\\\!600–700700env steps the agent takes to reach the goal –∼0\.18\\sim\\\!0\.18ms/step on every variant other thanu \-umaze\. The other PyTorch step\-wise baselines run their actor every env step at1\.251\.25–3\.573\.57ms/pass and are uniformly77–20×20\\timesslower per step\.[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)’s JAX\+JIT actor at0\.210\.21ms/pass is the one tight contender: it narrowly beats the Inverter on the shortu \-umazetrajectories \(1\.4×1\.4\\timesslower for the Inverter there, because∼289\\sim\\\!289steps don’t fully amortize the chunked planning cost\), but on every other variant the Inverter is∼1\.1\\sim\\\!1\.1–1\.2×1\.2\\timesfaster per step\.

#### Loss\-component ablation\.

Table[12](https://arxiv.org/html/2605.24152#A1.T12)quotes the contribution ofλyaw\\lambda\_\{\\text\{yaw\}\}on top ofλfid\\lambda\_\{\\text\{fid\}\}onlarge \-diverse \-v2\. The two ablated checkpoints share architecture, optimizer, and dataset with the published runs and differ only in the value ofλyaw\\lambda\_\{\\text\{yaw\}\}\(0vs55\); both were evaluated under the canonical Path Inverter config above\. Specific run directories are listed in the code\-release manifest accompanying the paper\.

Supplementary Table 12:Loss\-component ablation onantmaze \-large \-diverse \-v2\.100100episodes per seed, canonical waypoint config \(wp\_spacing=1\.5\\texttt\{wp\\\_spacing\}=1\.5, target\-distance window1\.01\.0–2\.02\.0m\);[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)reported as mean±\\,\\pm\\,std over44[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)seeds\. Removing the body\-yaw regularizer drops[D4RL](https://arxiv.org/html/2605.24152#A1.SS1.13.13.13)by1111points; the[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)anchor alone is not sufficient\. We do not separateλfid=0\\lambda\_\{\\text\{fid\}\}=0configurations on antmaze because the offline\-data generation process is fixed by the benchmark; we instead study the underlying[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-hacking mechanism in AntMan \(Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\), where the data distribution is under our control\.

### A\.10 AntMan forward\-model calibration scatter and extended discussion

This appendix shows the predicted\-vs\-realized reward scatter referenced in Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)and unpacks the implication for inverse\-learning data design\.

### A\.11 Quantum gate synthesis: setup and[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)baseline

#### Sampling convention \(U​\(2\)\\mathrm\{U\}\(2\)vs\.SU​\(2\)\\mathrm\{SU\}\(2\)\)\.

We sample targets Haar\-uniformly onU​\(2\)\\mathrm\{U\}\(2\)\. BecauseF¯avg\\bar\{F\}\_\{\\mathrm\{avg\}\}is invariant under a globalU​\(1\)\\mathrm\{U\}\(1\)phase onUtargetU\_\{\\mathrm\{target\}\}, this is equivalent for the learning objective to sampling onPU​\(2\)=U​\(2\)/U​\(1\)=SU​\(2\)/ℤ2\\mathrm\{PU\}\(2\)\\\!=\\\!\\mathrm\{U\}\(2\)/\\mathrm\{U\}\(1\)\\\!=\\\!\\mathrm\{SU\}\(2\)/\\mathbb\{Z\}\_\{2\}\. The phrases “HaarU​\(2\)\\mathrm\{U\}\(2\)” and “HaarSU​\(2\)\\mathrm\{SU\}\(2\)” are therefore used interchangeably in the paper in this sense\. The encodings \(App\.[A\.12](https://arxiv.org/html/2605.24152#A1.SS12)\) differ in whether they bake the global\-phase quotient in \(trig6,ck4factor throughPU​\(2\)\\mathrm\{PU\}\(2\)\) or not \(real8is a directU​\(2\)\\mathrm\{U\}\(2\)encoding\)\.

#### System\.

3\-level transmon, anharmonicityα=−4​Ωmax\\alpha\{=\}\{\-\}4\\,\\Omega\_\{\\max\}, energy relaxationT1=104T\_\{1\}\{=\}10^\{4\}, pure dephasingTϕ=8×103T\_\{\\phi\}\{=\}8\{\\times\}10^\{3\}, gate timeTgate=2​πT\_\{\\mathrm\{gate\}\}\{=\}2\\pi\(units of1/Ωmax1/\\Omega\_\{\\max\}\)\. Pulses are8080piecewise\-constant slices in\(Ωx,Ωy\)\(\\Omega\_\{x\},\\Omega\_\{y\}\), each squashed byΩmax​tanh⁡\(⋅\)\\Omega\_\{\\max\}\\tanh\(\\cdot\)\. The[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)is the dynamical map \(per\-slice matrix\-exponential of the Lindbladian, applied to a 4\-state axis\-aligned input set\{\|0⟩,\|1⟩,\|\+⟩,\|\+i⟩\}\\\{\|0\\rangle,\|1\\rangle,\|\+\\rangle,\|\+i\\rangle\\\}used for the per\-input fidelity uniformity diagnostic; the reportedF¯avg\\bar\{F\}\_\{\\mathrm\{avg\}\}is computed analytically over the full Haar measure and does not depend on this set\);F¯avg\\bar\{F\}\_\{\\mathrm\{avg\}\}is the analytic average gate fidelity\. The simulator is implemented from scratch in JAX\[[13](https://arxiv.org/html/2605.24152#bib.bib83)\]with adaptive\-step ODE integration via thediffraxlibrary\[[55](https://arxiv.org/html/2605.24152#bib.bib82)\]\(Tsit5; trace and hermiticity preserved to∼10−8\\sim\\\!10^\{\-8\}\); we do not use external quantum\-simulation packages such as QuTiP, Qiskit, or Cirq\.

#### [GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)baseline\.

scipy \.optimize \.minimize\[[95](https://arxiv.org/html/2605.24152#bib.bib84)\]with[Broyden–Fletcher–Goldfarb–Shanno \(quasi\-Newton optimizer\)](https://arxiv.org/html/2605.24152#A1.SS1.4.4.4)\([BFGS](https://arxiv.org/html/2605.24152#A1.SS1.4.4.4)\) \(or specifically*L\-BFGS\-B*for box\-constrained problems\) atcomplex128 /float64;nrestarts=10n\_\{\\mathrm\{restarts\}\}\{=\}10,maxiter=500500,ftol=10−1210^\{\-12\},gtol=10−910^\{\-9\}, init scale0\.10\.1, restart selection on the best\-of\-trajectory iterate \(Adam can drift past the optimum; the best step is not always the last\)\. We also run a*lean*variant \(nrestarts=3n\_\{\\mathrm\{restarts\}\}\{=\}3,maxiter=200200,ftol=10−910^\{\-9\}\) that produces statistically indistinguishable infidelity but is faster – this is the configuration used for the speed comparison reported in the main text\. Per\-target single\-process median wall time on a 128\-core CPU \(50\-target sample\): lean5\.645\.64s/gate, heavy5656s/gate\.

#### Inverter\.

44\-layer[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35), hidden width512512,real8input encoding \(real and imaginary parts of the2×22\{\\times\}2matrix flattened,88\-dim; not global\-phase invariant\), trained for40004000Adam steps at lr2×10−32\{\\times\}10^\{\-3\}\(cosine to10−510^\{\-5\}\), batch128128of fresh HaarU​\(2\)\\mathrm\{U\}\(2\)samples per step\. Loss is1−F¯avg1\-\\bar\{F\}\_\{\\mathrm\{avg\}\}computed by the same analytic Lindblad channel as[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21);*no*[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)\-pulse supervision \(the Inverter never sees a[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)pulse during training\)\. Median forward\-pass time2\.12\.1ms \(JIT\-cached, 50\-call median, same machine\)\.

#### Per\-target paired statistics \(n=250n\{=\}250Haar U\(2\)\)\.

Both methods saturate the dissipation floor\.[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21): median1−F¯avg=4\.26×10−41\{\-\}\\bar\{F\}\_\{\\mathrm\{avg\}\}\{=\}4\.26\{\\times\}10^\{\-4\},σacross​targets=2\.9×10−5\\sigma\_\{\\mathrm\{across\\ targets\}\}\{=\}2\.9\{\\times\}10^\{\-5\}\. Inverter: median4\.69×10−44\.69\{\\times\}10^\{\-4\}\. Per\-input fidelity uniformity \(σacross​4​inputs​\(1−F\)\\sigma\_\{\\mathrm\{across\\ 4\\ inputs\}\}\(1\{\-\}F\)\):[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)1\.14×10−41\.14\{\\times\}10^\{\-4\}, Inverter1\.28×10−41\.28\{\\times\}10^\{\-4\}\(1\.12×1\.12\{\\times\}\)\. Leakage to\|2⟩\|2\\rangle:[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)1\.30×10−51\.30\{\\times\}10^\{\-5\}, Inverter7\.97×10−67\.97\{\\times\}10^\{\-6\}\(Inverter0\.61×0\.61\{\\times\}\)\. Pulse bandwidthf95f\_\{95\}\(frequency below which95%95\\%of\|Ωx\|2\+\|Ωy\|2\|\\Omega\_\{x\}\|^\{2\}\{\+\}\|\\Omega\_\{y\}\|^\{2\}spectral power lies, in units ofΩmax\\Omega\_\{\\max\}\):[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)1\.911\.91, Inverter4\.094\.09\(2\.14×2\.14\{\\times\}\)\. A bandwidth\-penalty termλ​∑t\|Δ​Ωt\|2\\lambda\\sum\_\{t\}\|\\Delta\\Omega\_\{t\}\|^\{2\}added to𝒥\\mathcal\{J\}is the natural countermeasure if[Arbitrary Waveform Generator](https://arxiv.org/html/2605.24152#A1.SS1.2.2.2)\([AWG](https://arxiv.org/html/2605.24152#A1.SS1.2.2.2)\) bandwidth becomes a deployment constraint; we leave a quantitative sweep to future work\.

#### What was tried beyond the main experiments\.

The main single\-qubit result above is one slice of a larger design\-space exploration that we report in App\.[A\.12](https://arxiv.org/html/2605.24152#A1.SS12): an input\-encoding ablation comparingreal8/trig6/ck4on HaarU​\(2\)\\mathrm\{U\}\(2\)targets \(geometry\-respecting encodings, which bake in the global\-phaseU​\(1\)\\mathrm\{U\}\(1\)quotient, dominate convergence at fixed compute\); and a first two\-qubit Haar SU\(4\) extension that already reachesF¯=0\.957\\bar\{F\}\{=\}0\.957\([GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)floor0\.9980\.998\) at the same∼4×104\\sim\\\!4\{\\times\}10^\{4\}inference speedup – a promising starting point given the much wider symmetry surface of two\-qubit Haar SU\(4\), with several concrete paths to close the remaining gap to[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)belonging to the standalone gate\-synthesis project this work has spun off into\.

### A\.12 Quantum gate synthesis: encoding ablation and two\-qubit extension

This appendix collects the design\-space exploration sitting behind the main single\-qubit result of App\.[A\.11](https://arxiv.org/html/2605.24152#A1.SS11): three target\-encoding alternatives compared on the single\-qubit problem \(which one we use is the dominant factor at fixed compute\); and a first extension to two\-qubit Haar SU\(4\) gate synthesis with its own[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)baseline\.

#### Single\-qubit input\-encoding ablation\.

Three encodings of the single\-qubit target were compared at identical training budget \(4\-layer GeLU[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35), hidden256256,40004000Adam steps, batch128128of fresh HaarU​\(2\)\\mathrm\{U\}\(2\)samples, cosine LR2×10−3→10−52\{\\times\}10^\{\-3\}\{\\to\}10^\{\-5\}\)\. The encodings differ in whether they bake in the global\-phaseU​\(1\)\\mathrm\{U\}\(1\)quotient thatF¯avg\\bar\{F\}\_\{\\mathrm\{avg\}\}is invariant to:real8\(real and imaginary parts of the2×22\{\\times\}2matrix flattened,88\-dim, a directU​\(2\)\\mathrm\{U\}\(2\)encoding –*not*global\-phase invariant\);trig6\(\{cos⁡\(θ/2\),sin⁡\(θ/2\)⋅n^\}\\\{\\cos\(\\theta/2\),\\sin\(\\theta/2\)\\\!\\cdot\\\!\\hat\{n\}\\\},66\-dim, sign\-canonicalized so it factors throughSU​\(2\)/ℤ2=PU​\(2\)\\mathrm\{SU\}\(2\)/\\mathbb\{Z\}\_\{2\}=\\mathrm\{PU\}\(2\)\); andck4\(Cayley–Klein parameters,44\-dim, also sign\-canonicalized toPU​\(2\)\\mathrm\{PU\}\(2\)\)\. On the same100100\-target reference set, after40004000steps mean ref fidelity wastrig60\.999530\.99953\(100%100\\%at≥0\.999\\\!\\geq\\\!0\.999\),ck40\.999480\.99948\(98%98\\%\),real80\.998750\.99875\(46%46\\%\); the[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)floor for the same set sits at0\.999520\.99952\.trig6matches[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)within15001500steps,ck4reaches it by∼3000\\sim\\\!3000steps,real8is still5×5\{\\times\}further from the floor at the end of training\. Reading: at fixed compute, geometry\-respecting encodings that bake in theU​\(1\)\\mathrm\{U\}\(1\)global\-phase quotient remove an entire invariance the network would otherwise have to learn from data – this is the dominant lever on convergence speed\. In the long\-training/longer\-batched regime reported in App\.[A\.11](https://arxiv.org/html/2605.24152#A1.SS11)the simplerreal8also reaches the dissipation floor, but at a much higher compute cost; we keepreal8in the main run for parameter\-counting parity with the[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)pulse parameterization rather than because it is the recommended choice\.

#### Two\-qubit extension\.

As a first step toward two\-qubit gate synthesis, we extend the same Inverter recipe to a two\-qubit transmon \(Hilbert dim44, no leakage approximation, drift HamiltonianHdrift=J2​σz⊗σzH\_\{\\mathrm\{drift\}\}\{=\}\\tfrac\{J\}\{2\}\\sigma\_\{z\}\{\\otimes\}\\sigma\_\{z\}with always\-onZ​ZZZcouplingJ=0\.3J\{=\}0\.3chosen soJ​Tgate≈2​πJT\_\{\\mathrm\{gate\}\}\\\!\\approx\\\!2\\pi, four independent drive channels\(Ωx\(i\),Ωy\(i\)\)i=1,2\(\\Omega\_\{x\}^\{\(i\)\},\\Omega\_\{y\}^\{\(i\)\}\)\_\{i=1,2\}, Lindblad operators withT1=104,Tϕ=8×103T\_\{1\}\{=\}10^\{4\},\\;T\_\{\\phi\}\{=\}8\{\\times\}10^\{3\}on each qubit, gate timeTgate=4​πT\_\{\\mathrm\{gate\}\}\{=\}4\\pi,8080piecewise\-constant slices; simulator validated against analyticT1T\_\{1\}/T2T\_\{2\}/Z​ZZZreferences to≤10−5\\\!\\leq\\\!10^\{\-5\}\)\.[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)on a100100\-target Haar SU\(4\) reference set converges to a target\-independent decoherence floor atF¯=0\.998\\bar\{F\}\{=\}0\.998\. The Inverter, trained end\-to\-end against the same Lindblad simulator with no[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)supervision \(real32\_su4SU\(4\)\-projected encoding,44\-layer[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35), batch6464,40004000steps\), reachesF¯=0\.957\\bar\{F\}\{=\}0\.957on the reference set at421​μ421\\,\\mus per gate vs\.[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)’s18\.518\.5s \(∼4\.4×104\\sim\\\!4\.4\{\\times\}10^\{4\}speedup, in line with single\-qubit\) – a promising first result given the much wider symmetry surface of two\-qubit Haar SU\(4\) \(under uniform Lindblad noise the Haar\-average channel itself sits atF=1/d=0\.25F\{=\}1/d\{=\}0\.25, a saddle that initial sequence architectures with weight\-sharing across the slice axis tend to fall into, while[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)and U\-Net escape it from random init\)\. Initial recipe sweeps \([KAK](https://arxiv.org/html/2605.24152#A1.SS1.27.27.27)/ Cartan\-decomposition encoding; longer training and larger batches;[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)\-pretrain \+ simulator\-in\-the\-loop fine\-tune; multi\-seed ensembles\) all reach the∼0\.96\\sim\\\!0\.96band without yet closing the gap to[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21); promising next steps includeUU\-conditioned per\-position output heads \(which retain the parameter efficiency of sequence architectures while breaking the Haar\-average symmetry per slice\), curriculum from a single fixed target to Haar SU\(4\), and longer training combined with[KAK](https://arxiv.org/html/2605.24152#A1.SS1.27.27.27)encoding \(the only ablation whose training curve was still trending toward the[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)floor at our compute budget\)\.

### A\.13 Summary of task\-specific adaptations

This subsection inventories the framework components that were specified per task family in this paper, alongside how each could become more general\. Three groups: \(A\) auxiliary loss terms used inmaze2d/antmaze; \(B\) Level 2 Inverter slot instances; \(C\) per\-task architecture and parameterization\. The rightmost column of Tab\.[15](https://arxiv.org/html/2605.24152#A1.T15)maps each item to one of four broad strategy classes \(Tab\.[13](https://arxiv.org/html/2605.24152#A1.T13)\) that organize the route toward more general solutions – operating on the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)’s training data \(1\), the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)and Inverter as models \(2\), the Bolza objective𝒥\\mathcal\{J\}\(3\), or the deployment\-time hierarchical / adaptive control structure \(4\)\.

Supplementary Table 13:Four strategy classes for the path toward more general adaptations\.Each class operates on a different lever of the inverse\-learning framework\. Classes are complementary, not alternatives: most concrete strategies combine two or more \(e\.g\., the AntMan strategy combines Class 1 with Class 4\)\.Supplementary Table 14:Neurosymbolic components of the five Inverters\(parallel to Tab\.[1](https://arxiv.org/html/2605.24152#S1.T1)\)\. For each Inverter we list its symbolic substrate, neural amortized component, the differentiable coupling between them, and the cumulative levels of the neurosymbolic spectrum \(representation⊆\\subseteqcomposition⊆\\subseteqsearch⊆\\subseteqinference⊆\\subseteqsynthesis\) the symbolic piece occupies\. The Motor and Locomotion Inverters are purely continuous, showing that the[IL](https://arxiv.org/html/2605.24152#A1.SS1.22.22.22)paradigm itself is not necessarily neurosymbolic; neurosymbolic structure enters at the implementation level in the other three\.Supplementary Table 15:Inventory of task\-specific adaptations in this paper\.The rightmost column maps each item to the strategy classes defined in Tab\.[13](https://arxiv.org/html/2605.24152#A1.T13)\.ComponentWhat is task\-specificToward more generalClass*A\. Auxiliary loss terms*maze2dboundary loss,λboundary=5\\lambda\_\{\\text\{boundary\}\}\{=\}5\(App\.[A\.2](https://arxiv.org/html/2605.24152#A1.SS2)\)Support\-derived per\-step penalty pushing predicted states onto the data manifold near walls\.Broader[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)training data with wall\-grazing trajectories; uncertainty\-aware[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)regularization \(same recipe as AntMan’s random\-mixed condition\)\.1, 2antmazebody\-yaw regularizer,λyaw=5\\lambda\_\{\\text\{yaw\}\}\{=\}5\(Eq\.[6](https://arxiv.org/html/2605.24152#A1.E6)\)Per\-step pull of predicted body quaternion toward upright\.Broader[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)training data including falls and recoveries \(AntMan’s random\-mixed condition is the direct demonstration\)\.1antmaze[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)action\-fidelity anchor,λfid=5\\lambda\_\{\\text\{fid\}\}\{=\}5\(Eq\.[6](https://arxiv.org/html/2605.24152#A1.E6)\)Per\-step pull of predicted actions toward recorded data actions\.Broader[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)training data; in AntMan, the random\-mixed condition removes the need for an action anchor entirely\.1maze2ddense intermediate\-goal loss,λdense=5\\lambda\_\{\\text\{dense\}\}\{=\}5\(App\.[A\.2](https://arxiv.org/html/2605.24152#A1.SS2)\)Per\-step state\-vs\-goal supervision over the chunk \(denser than terminal\-only\)\.Value / cost\-to\-go term in the Bolza objective𝒥\\mathcal\{J\}\(noted in App\.[A\.14](https://arxiv.org/html/2605.24152#A1.SS14)\)\.3*B\. Level 2 Inverter slot*Simple algorithmic Path Inverter \(maze2d \-medium/large\+ all sixantmaze \-v2variants, App\.[A\.6](https://arxiv.org/html/2605.24152#A1.SS6)\)4\-connected[BFS](https://arxiv.org/html/2605.24152#A1.SS1.5.5.5)on a data\-density occupancy grid with polyline extraction, perpendicular snap, L\-corner insertion, sub\-cell wobble filter\.Learned closed\-loop Level 2 Inverter \(Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4), AntMan\); the algorithmic Path Inverter is the cheaper slot filling for static corridors with abundant data\.4Per\-chunk dispatch controller between the two Inverter levels \(App\.[A\.6](https://arxiv.org/html/2605.24152#A1.SS6)\)Thresholds for waypoint progress, target\-distance window, stuck\-recovery, and goal\-region behavior\.Closed\-loop Level 2 Inverter re\-queried per chunk; broader low\-level training data with near\-target conditions, or direction\-conditioning instead of position\-conditioning\.4, 2, 1Replan\-horizonKKchoice \(K=128K\{=\}128maze2d \-umazeone\-shot;K=16K\{=\}16maze2d \-medium/large\+ all six antmaze\-v2 variants;K=80K\{=\}80pulse slices for quantum, no deployment replan; App\.[A\.7](https://arxiv.org/html/2605.24152#A1.SS7)\)Per\-task constant value chosen by sweep\.AdaptiveKKconditioned on[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)uncertainty estimates\.4, 2*C\. Per\-task architecture and parameterization*Per\-task Inverter /[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)networks \(Apps\.[A\.2](https://arxiv.org/html/2605.24152#A1.SS2),[A\.9](https://arxiv.org/html/2605.24152#A1.SS9),[A\.11](https://arxiv.org/html/2605.24152#A1.SS11); Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\)maze2d/antmaze: causal\-transformer[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\+Inverter at Level 1 \(sized to state/action dim\)\. AntMan: reuses Level\-1 antmaze pair plus a causal\-transformer Level\-2[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)and Transformer Level\-2 Inverter \(3232\-step waypoint\-direction sequence\)\. Quantum: 4\-layer[MLP](https://arxiv.org/html/2605.24152#A1.SS1.35.35.35)Inverter, analytic Lindblad[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\(no learned[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\)\.Cross\-task pretraining of the[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\-and\-Inverter core with task\-specific adapter heads \(foundation\-model pattern\); or a meta\-learned core \([MAML](https://arxiv.org/html/2605.24152#A1.SS1.31.31.31), hypernetworks, in\-context adaptation\) for few\-shot specialization\.2Quantum:8080\-slice piecewise\-constant pulse,Ωmax​tanh⁡\(⋅\)\\Omega\_\{\\max\}\\tanh\(\\cdot\)squash \(App\.[A\.11](https://arxiv.org/html/2605.24152#A1.SS11)\)Trotterized piecewise\-constant control parameterization with bounded amplitude \(matches[GRAPE](https://arxiv.org/html/2605.24152#A1.SS1.21.21.21)1\-to\-1\)\.Continuous\-time spline / Fourier\-basis parameterizations; bandwidth\-penalty term in𝒥\\mathcal\{J\}if[AWG](https://arxiv.org/html/2605.24152#A1.SS1.2.2.2)bandwidth becomes a deployment constraint\.3Quantum:real8U​\(2\)\\mathrm\{U\}\(2\)input encoding, 8\-dim \(App\.[A\.11](https://arxiv.org/html/2605.24152#A1.SS11)\)Target unitary flattened to 8 real components\.Equivariant encodings \(Bloch\-vector, quaternion, Lie\-algebra\) that scale to higher\-dim systems\.2
### A\.14 Relation to Optimal Control and Reinforcement Learning

The Inverse Learning paradigm composes naturally with the two paradigms it sits between\. On the[OC](https://arxiv.org/html/2605.24152#A1.SS1.48.48.48)side, for example, an Inverter may emit a feedforward warm start that an[MPC](https://arxiv.org/html/2605.24152#A1.SS1.39.39.39)/[MPPI](https://arxiv.org/html/2605.24152#A1.SS1.41.41.41)optimizer can reactively adapt at deployment\. Similarly, on the[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)side, an[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)actor may use the Inverter’s output as a behavior prior for online fine\-tuning\. Both would preserve the Inverter’s amortized forward pass\.

### A\.15 Hyperparameters

Our Inverter stack is assembled from two kinds of components: components*designed to be shared across tasks*\(the chunked Transformer[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)and the Transformerinverse model\), and components that are explicitly*task\-specific*\(the auxiliary loss terms and the Level 2 instance such as the Path Inverter\)\.

#### What we actually tuned\.

Table[16](https://arxiv.org/html/2605.24152#A1.T16)lists the task\-specific hyperparameters that were actively tuned during this project \(via manual scans of33–66settings on a single seed\), while most architectural knobs \(e\.g\., transformer depth, width, batch size, learning rates\) were set once from standard defaults and kept fixed across all tasks\.

Supplementary Table 16:Task\-specific hyperparameters tuned during this project\.“Range explored” shows the set of values checked before fixing the final value\.
#### Counting methodology and comparison\.

While the actively tuned subset per task is small, the complete Inverter stack has more total configuration hyperparameters than any single offline[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)baseline\. To quantify this precisely, we count the algorithmically active hyperparameters in each method’smaze2d \-umaze \-v1YAML orconfig \.json: learning rates, architectural sizes, loss weights, temperature/entropy/[KL](https://arxiv.org/html/2605.24152#A1.SS1.28.28.28)coefficients, target\-update rates, normalization flags, and any algorithm\-specific quantity\. We exclude purely bookkeeping fields \(seed, device, checkpoint path, project/group names, evaluation\-episode counts, eval\-frequency\)\.

Table[17](https://arxiv.org/html/2605.24152#A1.T17)summarizes this total capacity budget\. The Inverter stack factorizes its larger knob count into independently optimized modules \([FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19),[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23), Path Inverter\) rather than a single end\-to\-end actor\-critic objective\.

Supplementary Table 17:Number of algorithmically active hyperparameters permaze2d \-umaze \-v1configuration\.Counts exclude purely bookkeeping fields\.Method\# HPsSource[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)/[BC](https://arxiv.org/html/2605.24152#A1.SS1.3.3.3)\-107bc/maze2d/umaze\_v1\.yaml[AWAC](https://arxiv.org/html/2605.24152#A1.SS1.1.1.1)8awac/maze2d/umaze\_v1\.yaml[TD3\+BC](https://arxiv.org/html/2605.24152#A1.SS1.70.70.70)12td3\_bc/maze2d/umaze\_v1\.yaml[IQL](https://arxiv.org/html/2605.24152#A1.SS1.24.24.24)13iql/maze2d/umaze\_v1\.yaml[SAC\-N](https://arxiv.org/html/2605.24152#A1.SS1.62.62.62)13sac\_n/maze2d/umaze\_v1\.yaml[EDAC](https://arxiv.org/html/2605.24152#A1.SS1.17.17.17)14edac/maze2d/umaze\_v1\.yaml[DT](https://arxiv.org/html/2605.24152#A1.SS1.16.16.16)18dt/maze2d/umaze\_v1\.yaml[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)20rebrac/maze2d/umaze\_v1\.yaml[CQL](https://arxiv.org/html/2605.24152#A1.SS1.11.11.11)27cql/maze2d/umaze\_v1\.yaml[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\(ours\)12scripts/train\_maze2d\_fm\_universal\.py[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\(ours\)17scripts/train\_iwm\_maze2d\_v1\.pyDeployment \(ours\)6maze2d/navigate\_maze2d\.pyPath Inverter \(medium / large\)5App\.[A\.6](https://arxiv.org/html/2605.24152#A1.SS6)Total stack –umaze35[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)\+[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)\+ deploymentTotal stack –medium,large40\+ Path Inverter
#### Online tuning budget \(in the sense ofKurenkov and Kolesnikov \[[62](https://arxiv.org/html/2605.24152#bib.bib78)\], Jacksonet al\.\[[46](https://arxiv.org/html/2605.24152#bib.bib79)\]\)\.

To audit our implicit online tuning budget\[[76](https://arxiv.org/html/2605.24152#bib.bib80),[62](https://arxiv.org/html/2605.24152#bib.bib78),[46](https://arxiv.org/html/2605.24152#bib.bib79)\], we enumerate every deployment\-time hyperparameter\-selection run in this project by scanning all generated run directories and uniformly assigning the nominal100100online evaluation episodes to each run\. On average, we consumed∼900\\sim\\\!900episodes per task cell for hyperparameter selection\.

#### Where this stands in the literature\.

At∼900\\sim\\\!900episodes per task cell, our hyperparameter\-selection budget operates within the10210^\{2\}–10310^\{3\}episodes\-per\-cell regime proposed as a deployment\-realistic target byPaineet al\.\[[76](https://arxiv.org/html/2605.24152#bib.bib80)\]\. This is computationally well below the implicit budgets of many offline[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)methods audited byKurenkov and Kolesnikov \[[62](https://arxiv.org/html/2605.24152#bib.bib78)\]\.

### A\.16 Compute environment and project timeline

All experiments reported in this paper \(Inverter[FoM](https://arxiv.org/html/2605.24152#A1.SS1.19.19.19)and[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)training, ILHL fine\-tuning and reward modeling, the full88\-baseline×\\times44\-layout×\\times33\-seed[CORL](https://arxiv.org/html/2605.24152#A1.SS1.9.9.9)baseline sweep, theAntMan\-[IM](https://arxiv.org/html/2605.24152#A1.SS1.23.23.23)experiments, and every timing measurement in Table[5](https://arxiv.org/html/2605.24152#A1.T5)\) were run on a single internal 8\-GPU node between2026\-03\-18and2026\-05\-06\(5050days of wall\-clock project time, or roughly77calendar weeks\)\. No external cluster, TPU, or cloud GPU was used\.

#### Hardware and software\.

- •GPUs:8×8\\,\\timesNVIDIA A40 \(GA102GL,4848GB GDDR6 per card, driver 535\.274\), all on one PCIe host\.
- •CPU:2×2\\,\\timesAMD EPYC 7513 \(3232physical cores /6464threads each;6464physical cores /128128threads total\)\.
- •RAM:503503GiB DDR4 shared across sockets\.
- •Software: Ubuntu 24\.04, Linux 6\.8\.0; PyTorch \+ CUDA for all learned components; JAX \+ JIT for the[ReBRAC](https://arxiv.org/html/2605.24152#A1.SS1.58.58.58)baseline only \(marked†in Table[5](https://arxiv.org/html/2605.24152#A1.T5)\);[MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)2\.3 formaze2d/antmaze, a[MuJoCo](https://arxiv.org/html/2605.24152#A1.SS1.43.43.43)\-basedAntManenvironment for Sec\.[4\.4](https://arxiv.org/html/2605.24152#S4.SS4)\.

## Supplementary references \(model\-based[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)comparison\)

The following references are cited by the model\-based\-[RL](https://arxiv.org/html/2605.24152#A1.SS1.59.59.59)comparison in App\.[A\.8](https://arxiv.org/html/2605.24152#A1.SS8)\(Table[11](https://arxiv.org/html/2605.24152#A1.T11)\)\.

\[S1\]T\. Yu, G\. Thomas, L\. Yu, S\. Ermon, J\. Y\. Zou, S\. Levine, C\. Finn, T\. Ma\.*MOPO: Model\-based Offline Policy Optimization*\. NeurIPS 2020\. arXiv:2005\.13239\.

\[S2\]R\. Kidambi, A\. Rajeswaran, P\. Netrapalli, T\. Joachims\.*MOReL: Model\-Based Offline Reinforcement Learning*\. NeurIPS 2020\. arXiv:2005\.05951\.

\[S3\]T\. Yu, A\. Kumar, R\. Rafailov, A\. Rajeswaran, S\. Levine, C\. Finn\.*COMBO: Conservative Offline Model\-Based Policy Optimization*\. NeurIPS 2021\. arXiv:2102\.08363\.

\[S4\]M\. Rigter, B\. Lacerda, N\. Hawes\.*RAMBO\-RL: Robust Adversarial Model\-Based Offline Reinforcement Learning*\. NeurIPS 2022\. arXiv:2204\.12581\. Antmaze numbers in Table 1; v0 caveat in Appendix B\.7\.

\[S5\]Y\. Sun, J\. Zhang, C\. Jia, H\. Lin, J\. Ye, Y\. Yu\.*Model\-Bellman Inconsistency for Model\-based Offline RL*\. ICML 2023\.

\[S6\]J\. Jeong, X\. Wang, M\. Coskun, Q\. Kong\.*CBOP: Conservative Bayesian Model\-Based Value Expansion for Offline Policy Optimization*\. ICLR 2023\. arXiv:2210\.03802\.

\[S7\]X\. Chen, Y\. Yu, Q\. Zhu, Z\. Liu, L\. Yang, Y\. Li, P\. Zhao\.*MAPLE: Offline Model\-based Adaptable Policy Learning*\. NeurIPS 2021\.

\[S8\]M\. Bhardwaj, T\. Xie, B\. Boots, N\. Jiang, C\. Cheng\.*Adversarial Model for Offline Reinforcement Learning \(ARMOR\)*\. NeurIPS 2023\. arXiv:2302\.11048\.

\[S9\]M\. Janner, Q\. Li, S\. Levine\.*Offline Reinforcement Learning as One Big Sequence Modeling Problem \(Trajectory Transformer\)*\. NeurIPS 2021\. arXiv:2106\.02039\. Antmaze\-v0 numbers: Table 2\.

\[S10\]Z\. Jiang, T\. Zhang, M\. Janner, Y\. Li, T\. Rocktäschel, E\. Grefenstette, Y\. Tian\.*Efficient Planning in a Compact Latent Action Space \(TAP\)*\. ICLR 2023\. arXiv:2208\.10291\. Antmaze\-v0 numbers in Table 6; mixed v0/v2 protocol described in Sec\. 6\.

\[S11\]M\. Janner, Y\. Du, J\. B\. Tenenbaum, S\. Levine\.*Planning with Diffusion for Flexible Behavior Synthesis \(Diffuser\)*\. ICML 2022\. arXiv:2205\.09991\. Maze2D\-v1 numbers: Table 1\.

\[S12\]K\. Park, J\. Lee, M\. Tomar, V\. Naik, S\. Kadavath, B\. Eysenbach\.*Tackling Long\-Horizon Tasks with Model\-Based Offline Reinforcement Learning \(LEQ\)*\. ICLR 2025\. arXiv:2407\.00699\. Re\-runs of MOBILE and CBOP on antmaze\-v0 in Table 1\.

Similar Articles

Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

arXiv cs.LG

This paper introduces Interactive Inverse Reinforcement Learning (IIRL), a framework where a learner actively interacts with an expert to infer reward functions, formulated as a stochastic bi-level optimization problem. The authors propose the BISIRL algorithm, providing convergence guarantees and experimental validation for this interactive learning paradigm.