A Goal-Set Characterization of Task Composition in the Boolean Task Algebra
Summary
This paper revisits the Boolean Task Algebra (BTA) for zero-shot task composition in reinforcement learning, proving that in deterministic MDPs all optimal extended Q-functions collapse to just two components (universal and empty tasks), making the originally proposed logarithmic base task set redundant. The authors introduce a goal-set-based composition method that reduces learning costs and composition time while preserving policy performance across multiple experimental domains.
View Cached Full Text
Cached at: 06/05/26, 02:19 AM
# A Goal-Set Characterization of Task Composition in the Boolean Task Algebra
Source: [https://arxiv.org/html/2606.04053](https://arxiv.org/html/2606.04053)
Eduardo Terrés\-Caballero1Herke van Hoof2 1Informatics Institute, University of Amsterdam 2AMLab, University of Amsterdam
###### Abstract
The Boolean Task Algebra \(BTA\) provides a principled framework for zero\-shot task composition in reinforcement learning by equipping goal\-reaching tasks with Boolean operations\. We revisit its structural assumptions and formalize a collapse in the space of optimal extended Q\-value functions: in deterministic MDPs, every such function is fully determined by the universal and empty tasks\. This makes the logarithmic set of base tasks proposed in the original BTA formulation redundant\. Building on this observation, we introduce a goal\-set\-based composition method that performs logical operations on goal sets and reconstructs composed value functions by selecting slices from the universal and empty value functions\. This reduces learning costs for standard BTA and reduces composition time for both BTA and Skill Machines, while preserving policy performance\. Experiments across tabular, visual, function\-approximation, and continuous\-control domains show that learning additional base tasks does not yield better performance\. Finally, we study the stochastic setting and provide a counterexample showing that this collapse need not hold, that is, optimal composition may require accounting for exponentially many policies in the number of goals\. Code is available at[https://github\.com/EduardoTerres/bta\_paper](https://github.com/EduardoTerres/bta_paper)\.
## 1Introduction
Compositionality is a central principle for developing agents that can reason about and solve multiple tasks within a shared environment\. When tasks exhibit structure, it becomes possible to express complex objectives as combinations of simpler ones, thereby enabling reuse of previously acquired behaviors\. In this direction,Nangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]introduce a Boolean Task Algebra \(BTA\) and show that, under specific assumptions, the space of tasks admits the structure of a Boolean algebra\. By operating on reward functions and their associated extended value functions, the BTA enables exact logical composition of tasks, and provides a principled framework for zero\-shot derivation of optimal policies for composite tasks\. This algebraic structure further establishes a homomorphism between the task algebra and the algebra of optimal extended Q\-value functions, allowing compositions to be expressed directly in extended value\-function space\. This algebra requires learning a collection of base tasks whose compositions span the entire algebra\. The original formulation proposes learning𝒪\(⌈log\|𝒢\|⌉\)\\mathcal\{O\}\(\\lceil\\log\|\\mathcal\{G\}\|\\rceil\)such tasks in order to guarantee optimal representational completeness\.
Following an insight fromTasseet al\.\[[2022](https://arxiv.org/html/2606.04053#bib.bib35)\]we formalize why this collapse in the BTA happens and what simplifications in terms of computational time for composition can be made to methods that build on this framework \(e\.g\.Nangue Tasseet al\.\[[2024](https://arxiv.org/html/2606.04053#bib.bib33)\]\)\. We prove that every optimal extended Q\-function can be constructed from only two components: the extended Q\-function of the universal task, which assigns the highest terminal reward to every goal, and that of the empty task, which assigns the lowest terminal reward to every goal\.
This observation leads to a more direct formulation of the BTA\. Since a task is determined by the set of goals that yield the higher reward, composition can be performed by attending to the desired set of goals\. Extended Q\-value functions for composed tasks can then be reconstructed by selecting the appropriate slices from only two learned extended Q\-value functions, as long as the two cover the whole space of goals\. This eliminates the need to learn a Boolean basis of tasks, posing a series of computational advantages\. Additionally, we study how this optimization in composition time would work in the stochastic MDP setting\. In the deterministic case, composing in an optimal manner involves learning goal slices independently and then performing Boolean operations over the set of desired goals, which scales linearly with the size of the desired goal set\. For stochastic MDPs, we provide a counterexample showing that each desired goal set, a subset of the original goal set, may have a different optimal policy in the worst case\. Consequently, this independence between goals disappears, making the number of optimal policies to account for during composition to grow with the size of the power set of goals\.
##### Contributions\.
The main contributions of this work are:
1. 1\.We formalize a representational limitation in the BTA framework stated inTasseet al\.\[[2022](https://arxiv.org/html/2606.04053#bib.bib35)\]that drops training cost from𝒪\(⌈log2\|G\|⌉\)\\mathcal\{O\}\(\\lceil\\log\_\{2\}\|G\|\\rceil\)to constant\. We empirically show that this improvement does not hinder convergence in either tabular or function approximation settings, as well as show that training additional base tasks does not translate to better performance upon convergence\.
2. 2\.We propose a new method for task composition that requires only array lookups and no operations between learned value functions\. We empirically confirm the computational gains in tabular, function approximation, visual and continuous\-control environments\.
3. 3\.We identify a limitation of deterministic BTA\-style composition in stochastic MDPs\. While deterministic tasks can be composed by independently combining goal slices, this independence disappears under stochastic transitions and requires operating through an exponential number, with respect to goal set size, of optimal policies\.
## 2Related work
##### Zero\-shot compositional learning in Reinforcement Learning\.
Early work on composition in control theory\[Todorov,[2006](https://arxiv.org/html/2606.04053#bib.bib8),[2009](https://arxiv.org/html/2606.04053#bib.bib6)\]demonstrated that Linearly Solvable MDPs admit value and policy decompositions by linearizing the Bellman equation\. Building upon this,Barretoet al\.\[[2017](https://arxiv.org/html/2606.04053#bib.bib20)\]introduce Successor Features \(SFs\) by generalizing successor representations\[Dayan,[1993](https://arxiv.org/html/2606.04053#bib.bib22)\]\. This framework achieves zero\-shot compositionality by factorizing the value function into a dot product of environment dynamics \(successor features\) and task preferences \(reward vectors\), enabling the agent to solve novel tasks simply by linearly combining previously learned predictive maps\.
Schaulet al\.\[[2015](https://arxiv.org/html/2606.04053#bib.bib10)\]introduce Universal Value Function Approximator \(UVFA\) to enable generalization across continuous goal spaces by conditioning value functions on goal representations\.Borsaet al\.\[[2019](https://arxiv.org/html/2606.04053#bib.bib21)\]later unified this with SF through Universal Successor Features \(USFs\) and overcome the limitation of discrete task sets inherent in standard SFs by extending the dynamics\-reward decomposition to continuous goal spaces by employing a UVFA architecture to approximate the successor features themselves\.Nangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]build on these ideas by introducingextendedQ\-value functions, which include a goal dimension in addition to the state and action dimensions\. This extra level of conditioning ultimately allows learning optimal policies\.
While the previous methods focus on composing rewards, another stream of research addresses the composition of time and logic\. This is primarily handled through Hierarchical Reinforcement Learning \(HRL\), which simplifies long\-horizon tasks by grouping actions into reusable behaviors\. Fundamental works like the Options framework\[Suttonet al\.,[1999](https://arxiv.org/html/2606.04053#bib.bib28)\]treat these behavior sequences as single steps, an idea extended byAndreaset al\.\[[2017](https://arxiv.org/html/2606.04053#bib.bib15)\]to chain sub\-policies using symbolic sketches, and byVezhnevetset al\.\[[2017](https://arxiv.org/html/2606.04053#bib.bib14)\]to separate high\-level planning from low\-level execution\. Finally, composition extends to formal specifications, where approaches replace fixed task identifiers with structured logic to enable generalization\. By directly grounding Linear Temporal Logic\[Vaezipooret al\.,[2021](https://arxiv.org/html/2606.04053#bib.bib12), Kuricet al\.,[2024](https://arxiv.org/html/2606.04053#bib.bib37)\]or natural language\[Hermannet al\.,[2017](https://arxiv.org/html/2606.04053#bib.bib13)\], these methods allow agents to interpret and execute novel complex constraints zero\-shot, simply by parsing the logical structure of the command\.
##### Predecessors and subsequent work on the Boolean Task Algebra\.
The Boolean Task Algebra builds on several works that established how value functions could be manipulated to represent logical operations\.Van Niekerket al\.\[[2019](https://arxiv.org/html/2606.04053#bib.bib3)\]demonstrated that while standard RL can bound value composition, Soft Q\-learning supports exact disjunctive \(∨\\lor\) composition, as shown byHaarnojaet al\.\[[2017](https://arxiv.org/html/2606.04053#bib.bib31)\]\. Parallel work byHaarnojaet al\.\[[2018](https://arxiv.org/html/2606.04053#bib.bib2)\]showed that logical conjunction \(∧\\land\) between tasks can be approximated by averaging soft value functions\. These compositional mechanisms were further rigorously analyzed byHuntet al\.\[[2019](https://arxiv.org/html/2606.04053#bib.bib16)\], who identified theoretical limitations in soft Q\-based composition and proposed corrections to ensure reliability\.
The Boolean Task Algebra\[Nangue Tasseet al\.,[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]formalizes these binary notions into a cohesive framework, utilizing extended Q\-value functions that encode all goal\-reaching behaviors to support arbitrary Boolean compositions: disjunction \(∨\\lor\), conjunction \(∧\\land\) and negation \(¬\\neg\)\.Tasseet al\.\[[2022](https://arxiv.org/html/2606.04053#bib.bib35)\]later pointed out the specific structure of optimal value functions but did not analyze further simplifications in the composition process that derive from these\. The authors show that Boolean composition of learned skills enables efficient lifelong transfer and task\-space generalization, but in stochastic environments it provides bounded near\-optimality guarantees rather than exact compositional optimality\.Nangue Tasseet al\.\[[2024](https://arxiv.org/html/2606.04053#bib.bib33)\]incorporate these simplifications into the Skill Machines framework, which extends the BTA from purely logical task composition to temporal\-logic specifications by using reward\-machine\-guided\[Icarteet al\.,[2022](https://arxiv.org/html/2606.04053#bib.bib7)\]‘Skill Machines’ that sequence Boolean\-composed skills to solve complex long\-horizon tasks in a zero\-shot manner\.
## 3Framework of the Boolean Task Algebra
##### Task definition and assumptions\.
FollowingNangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\], we model a task as a Markov Decision Process \(MDP\),M=\(𝒮,𝒜,ρ,r\)M=\(\\mathcal\{S\},\\mathcal\{A\},\\rho,r\), where𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}is the action space,ρ\(s,a\)\\rho\(s,a\)the transition kernel, andr\(s,a\)r\(s,a\)the reward function\. We only consider undiscounted MDPs with deterministic transition dynamics and an absorbing set𝒢⊆𝒮\\mathcal\{G\}\\subseteq\\mathcal\{S\}, whose elements constitute the goal states of any given task\. We defineρ,𝒮\\rho,\\mathcal\{S\}and𝒜\\mathcal\{A\}and assume them fixed for all tasks\. Now, the set of tasks of our interest is defined as
ℳ=\{\(𝒮,𝒜,ρ,r\)\|r:𝒮×𝒜→ℝsuch that\\displaystyle\\mathcal\{M\}=\\big\\\{\(\\mathcal\{S\},\\mathcal\{A\},\\rho,r\)\\;\\big\|\\;r:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}\\text\{ such that \}r\(s,a\)=rs,a\\displaystyle r\(s,a\)=r\_\{s,a\}∀s∉𝒢,a∈𝒜\\displaystyle\\forall s\\notin\\mathcal\{G\},a\\in\\mathcal\{A\}\(1\)andr\(g,a\)∈\{r∅,r𝒰\}\\displaystyle r\(g,a\)\\in\\\{r\_\{\\varnothing\},r\_\{\\mathcal\{U\}\}\\\}∀g∈𝒢,a∈𝒜\}\.\\displaystyle\\forall g\\in\\mathcal\{G\},a\\in\\mathcal\{A\}\\big\\\}\.Here,rs,ar\_\{s,a\}are fixed \(task\-independent\) non\-terminal rewards, andr∅≤r𝒰r\_\{\\varnothing\}\\leq r\_\{\\mathcal\{U\}\}denote the only two possible terminal rewards, hence inducing the Boolean nature of the formulation\.
##### Extended Q\-value functions\.
Standard value functions are insufficient to compose tasks under the Boolean framework, since they only represent the expected return with respect to the nearest goal\[Nangue Tasseet al\.,[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]\. To enable reasoning about all possible goals, the extended reward functionr¯:𝒮×𝒢×𝒜→ℝ\\bar\{r\}:\\mathcal\{S\}\\times\\mathcal\{G\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}is defined asr¯\(s,g,a\)=r\(s,a\)ifs∉𝒢∨s=g,andr¯\(s,g,a\)=r¯minifs∈𝒢∖\{g\},\\bar\{r\}\(s,g,a\)=r\(s,a\)\\text\{ if \}s\\notin\\mathcal\{G\}\\lor s=g,\\text\{ and \}\\bar\{r\}\(s,g,a\)=\\bar\{r\}\_\{\\min\}\\text\{ if \}s\\in\\mathcal\{G\}\\setminus\\\{g\\\},\. Herer¯min\\bar\{r\}\_\{\\min\}is a large negative penalty ensuring that reaching an undesired goal yields the worst possible return\. The corresponding extended Q\-value function is then given by
Q¯π\(s,g,a\)=r¯\(s,g,a\)\+∫𝒮V¯π\(s′,g\)ρ\(s,a\)\(ds′\),\\bar\{Q\}^\{\\pi\}\(s,g,a\)=\\bar\{r\}\(s,g,a\)\+\\int\_\{\\mathcal\{S\}\}\\bar\{V\}^\{\\pi\}\(s^\{\\prime\},g\)\\,\\rho\_\{\(s,a\)\}\(ds^\{\\prime\}\),whereV¯π\(s,g\)=𝔼π\[∑t=0∞r¯\(st,g,at\)\],\\bar\{V\}^\{\\pi\}\(s,g\)=\\mathbb\{E\}\_\{\\pi\}\\\!\\left\[\\sum\_\{t=0\}^\{\\infty\}\\bar\{r\}\(s\_\{t\},g,a\_\{t\}\)\\right\],with optimal counterpartQ¯∗=maxπQ¯π\\bar\{Q\}^\{\*\}=\\max\_\{\\pi\}\\bar\{Q\}^\{\\pi\}\. Intuitively,Q¯∗\(s,g,a\)\\bar\{Q\}^\{\*\}\(s,g,a\)measures the value of taking actionaain statesswhen pursuing goalgg, thus encoding how to reach every goal in the environment\. The standard optimal Q\-function for a taskMMcan be recovered asQM∗\(s,a\)=maxg∈𝒢Q¯M∗\(s,g,a\)Q^\{\*\}\_\{M\}\(s,a\)=\\max\_\{g\\in\\mathcal\{G\}\}\\bar\{Q\}^\{\*\}\_\{M\}\(s,g,a\), establishingQ¯∗\\bar\{Q\}^\{\*\}as a richer representation that enables the zero\-shot logical composition of tasks\.
##### Boolean algebras of tasks and value functions\.
We endow the set of tasksℳ\\mathcal\{M\}with a Boolean algebraic structure\. We first define two special tasks: the*universal task*\(always maximally rewarding\),ℳ𝒰=\(S,A,ρ,r𝒰\(s,a\)=maxM∈ℳrM\(s,a\)\)\\mathcal\{M\}\_\{\\mathcal\{U\}\}=\(S,A,\\rho,r\_\{\\mathcal\{U\}\}\(s,a\)=\\max\_\{M\\in\\mathcal\{M\}\}r\_\{M\}\(s,a\)\)and the*empty task*\(always minimally rewarding\),ℳ∅=\(S,A,ρ,r∅\(s,a\)=minM∈ℳrM\(s,a\)\)\\mathcal\{M\}\_\{\\varnothing\}=\(S,A,\\rho,r\_\{\\varnothing\}\(s,a\)=\\min\_\{M\\in\\mathcal\{M\}\}r\_\{M\}\(s,a\)\)\. Their corresponding optimal extended value functions,Q¯𝒰∗\\bar\{Q\}^\{\*\}\_\{\\mathcal\{U\}\}andQ¯∅∗\\bar\{Q\}^\{\*\}\_\{\\varnothing\}, represent the maximal and minimal achievable returns\.
For any pair of tasksM1,M2∈ℳM\_\{1\},M\_\{2\}\\in\\mathcal\{M\}, the Boolean operators are defined directly on their reward functions, given that the rest of the conforming elements of the MDP are fixed\. Analogously, composition rules for optimal extended Q\-value functions are also defined:
\(Negation¬\\neg\)r¬M\(s,a\)=\(r𝒰\(s,a\)\+r∅\(s,a\)\)−rM\(s,a\),\\displaystyle r\_\{\\lnot M\}\(s,a\)=\\big\(r\_\{\\mathcal\{U\}\}\(s,a\)\+r\_\{\\varnothing\}\(s,a\)\\big\)\-r\_\{M\}\(s,a\),Q¯¬M∗\(s,g,a\)=\(Q¯𝒰∗\(s,g,a\)\+Q¯∅∗\(s,g,a\)\)−Q¯M∗\(s,g,a\),\\displaystyle\\bar\{Q\}^\{\*\}\_\{\\lnot M\}\(s,g,a\)=\\big\(\\bar\{Q\}^\{\*\}\_\{\\mathcal\{U\}\}\(s,g,a\)\+\\bar\{Q\}^\{\*\}\_\{\\varnothing\}\(s,g,a\)\\big\)\-\\bar\{Q\}^\{\*\}\_\{M\}\(s,g,a\),\(Disjunction∨\\lor\)rM1∨M2\(s,a\)=max\{rM1\(s,a\),rM2\(s,a\)\},\\displaystyle r\_\{M\_\{1\}\\lor M\_\{2\}\}\(s,a\)=\\max\\\{r\_\{M\_\{1\}\}\(s,a\),\\,r\_\{M\_\{2\}\}\(s,a\)\\\},Q¯M1∨M2∗\(s,g,a\)=max\{Q¯M1∗\(s,g,a\),Q¯M2∗\(s,g,a\)\},\\displaystyle\\bar\{Q\}^\{\*\}\_\{M\_\{1\}\\lor M\_\{2\}\}\(s,g,a\)=\\max\\\{\\bar\{Q\}^\{\*\}\_\{M\_\{1\}\}\(s,g,a\),\\,\\bar\{Q\}^\{\*\}\_\{M\_\{2\}\}\(s,g,a\)\\\},\(Conjunction∧\\land\)rM1∧M2\(s,a\)=min\{rM1\(s,a\),rM2\(s,a\)\},\\displaystyle r\_\{M\_\{1\}\\land M\_\{2\}\}\(s,a\)=\\min\\\{r\_\{M\_\{1\}\}\(s,a\),\\,r\_\{M\_\{2\}\}\(s,a\)\\\},Q¯M1∧M2∗\(s,g,a\)=min\{Q¯M1∗\(s,g,a\),Q¯M2∗\(s,g,a\)\}\.\\displaystyle\\bar\{Q\}^\{\*\}\_\{M\_\{1\}\\land M\_\{2\}\}\(s,g,a\)=\\min\\\{\\bar\{Q\}^\{\*\}\_\{M\_\{1\}\}\(s,g,a\),\\,\\bar\{Q\}^\{\*\}\_\{M\_\{2\}\}\(s,g,a\)\\\}\.
With these operations and identity elementsℳ𝒰\\mathcal\{M\}\_\{\\mathcal\{U\}\}andℳ∅\\mathcal\{M\}\_\{\\varnothing\}for tasks, andQ¯𝒰∗\\bar\{Q\}^\{\*\}\_\{\\mathcal\{U\}\}andQ¯∅∗\\bar\{Q\}^\{\*\}\_\{\\varnothing\}for value functions,Nangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]show in Theorems 1 and 2 that the tuples\(ℳ,∨,∧,¬,ℳ𝒰,ℳ∅\)and\(Q¯∗,∨,∧,¬,Q¯𝒰∗,Q¯∅∗\)\(\\mathcal\{M\},\\lor,\\land,\\lnot,\\mathcal\{M\}\_\{\\mathcal\{U\}\},\\mathcal\{M\}\_\{\\varnothing\}\)\\quad\\text\{and\}\\quad\(\\bar\{Q\}^\{\*\},\\lor,\\land,\\lnot,\\bar\{Q\}^\{\*\}\_\{\\mathcal\{U\}\},\\bar\{Q\}^\{\*\}\_\{\\varnothing\}\)form the Boolean Task Algebra and the Boolean value function algebra, respectively\.
##### Homomorphism between algebras\.
A central contribution ofNangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]is establishing a homomorphism between the algebra of tasks and value functions\. That is, the following constitutes a homomorphism,ℱ:ℳ⟶Q¯∗\\mathcal\{F\}:\\mathcal\{M\}\\longrightarrow\\bar\{Q\}^\{\*\}withℱ\(M\)=Q¯M∗,\\mathcal\{F\}\(M\)=\\bar\{Q\}^\{\*\}\_\{M\},whereQ¯∗=\{Q¯M∗\|M∈ℳ\}\\bar\{Q\}^\{\*\}=\\\{\\bar\{Q\}^\{\*\}\_\{M\}\|M\\in\\mathcal\{M\}\\\}is the set of possible optimal extended Q\-value functions\. Ultimately, this provides a way for composing value functions, where givenM1,M2∈ℳM\_\{1\},M\_\{2\}\\in\\mathcal\{M\},
Q¯¬M∗=¬Q¯M∗,Q¯M1∨M2∗=Q¯M1∗∨Q¯M2∗,Q¯M1∧M2∗=Q¯M1∗∧Q¯M2∗\.\\bar\{Q\}^\{\*\}\_\{\\lnot M\}=\\lnot\\bar\{Q\}\_\{M\}^\{\*\},\\quad\\bar\{Q\}^\{\*\}\_\{M\_\{1\}\\lor M\_\{2\}\}=\\bar\{Q\}\_\{M\_\{1\}\}^\{\*\}\\lor\\bar\{Q\}\_\{M\_\{2\}\}^\{\*\},\\quad\\bar\{Q\}^\{\*\}\_\{M\_\{1\}\\land M\_\{2\}\}=\\bar\{Q\}\_\{M\_\{1\}\}^\{\*\}\\land\\bar\{Q\}\_\{M\_\{2\}\}^\{\*\}\.This result formally connects logical task composition to the algebraic composition of optimal extended value functions, implying that once the agent has learnedQ¯∗\\bar\{Q\}^\{\*\}for a small set of base tasks, it can derive the optimal policy for any Boolean combination of them without further learning\. The original formulation proves this result for a broader set of tasks that allow for more than two terminal rewardsr∅r\_\{\\varnothing\},r𝒰r\_\{\\mathcal\{U\}\}but keep the other restrictions present inℳ\\mathcal\{M\}\.
##### BTA in Skill Machines\.
Skill machines\[Nangue Tasseet al\.,[2024](https://arxiv.org/html/2606.04053#bib.bib33)\]extends the BTA to handle tasks specified in Linear Temporal Logic \(LTL\), a formal language that combines Boolean operators with temporal operators such as𝐅\\mathbf\{F\}\(eventually\) and𝐆\\mathbf\{G\}\(always\) to express ordered sequences of goals\. Let𝒮\\mathcal\{S\}denote the environment state space,𝒫\\mathcal\{P\}the set of propositional symbols representing high\-level environment features,L:𝒮→2𝒫L:\\mathcal\{S\}\\to 2^\{\\mathcal\{P\}\}the labelling function assigning truth values to environment states,G=2𝒫∪𝒞G=2^\{\\mathcal\{P\}\\cup\\mathcal\{C\}\}the set of absorbing goal states, andσu\\sigma\_\{u\}the Boolean expression over𝒫∪𝒞\\mathcal\{P\}\\cup\\mathcal\{C\}assigned to skill machine stateuuvia planning over the reward machine\. To support this, skill primitivesQp∗\(⟨s,c⟩,g,⟨a,aτ⟩\)Q^\{\*\}\_\{p\}\(\\langle s,c\\rangle,g,\\langle a,a\_\{\\tau\}\\rangle\)are defined over an augmented state space𝒮G=\(𝒮×2𝒞\)∪2𝒫∪𝒞\\mathcal\{S\}\_\{G\}=\(\\mathcal\{S\}\\times 2^\{\\mathcal\{C\}\}\)\\cup 2^\{\\mathcal\{P\}\\cup\\mathcal\{C\}\}, where𝒞\\mathcal\{C\}tracks violated constraints andaτ∈\{0,1\}a\_\{\\tau\}\\in\\\{0,1\\\}is a termination action\. Since all primitives share the same state space and dynamics, the BTA operators apply directly to yieldQσ∗Q^\{\*\}\_\{\\sigma\}for any Boolean expressionσ\\sigmaover𝒫∪𝒞\\mathcal\{P\}\\cup\\mathcal\{C\}\. A skill machine is then a finite state machine⟨U,u0,δu,δQ⟩\\langle U,u\_\{0\},\\delta\_\{u\},\\delta\_\{Q\}\\ranglewhere each stateu∈Uu\\in Uexecutes a spatially composed skillδQ\(u\)=maxg∈GQσu∗\(⟨s,c⟩,g,⟨a,0⟩\)\\delta\_\{Q\}\(u\)=\\max\_\{g\\in G\}Q^\{\*\}\_\{\\sigma\_\{u\}\}\(\\langle s,c\\rangle,g,\\langle a,0\\rangle\)and transitions tou′=δu\(u,L\(s′\)\)u^\{\\prime\}=\\delta\_\{u\}\(u,L\(s^\{\\prime\}\)\)upon satisfying the required propositions, thereby chaining Boolean\-composed skills sequentially to solve any LTL\-specified task in a zero\-shot manner\.
## 4An alternative method for composition
Having introduced the Boolean Task Algebra framework we now present a series of limitations regarding the representational capacity of this framework\.
Tasseet al\.\[[2022](https://arxiv.org/html/2606.04053#bib.bib35)\]note a collapse in the optimal extended value functions under the BTA framework and state that the optimal extended Q\-value function can only take two forms, that of the universal or empty value functions\. However, they do not formalize this phenomenon nor its implications for the framework\. We formalize this issue and further argue that, as a consequence, it is not necessary to learn the full set of base tasks defined in the BTA framework presented, see Section[3](https://arxiv.org/html/2606.04053#S3)\. In particular, the assumptions restrict the space of possible tasks and optimal Q\-value functions such that it suffices to learn the value functions of two tasks, regardless of the number of goals in𝒢\\mathcal\{G\}\. Proposition[1](https://arxiv.org/html/2606.04053#Thmproposition1)embodies this limitation by showing that the extended Q\-value function can be written as the sum of rewards accumulated until reaching a terminal state \(Gs:g,a∗G^\{\*\}\_\{s:g,a\}, which is shared across tasks\) plus the reward for reaching the goal \(r¯M\(s′,g,a′\)\\bar\{r\}\_\{M\}\(s^\{\\prime\},g,a^\{\\prime\}\)\)\. In Section[5](https://arxiv.org/html/2606.04053#S5), we empirically investigate whether learning a logarithmic number of base tasks nevertheless improves performance sufficiently to justify the additional training cost\.
###### Proposition 1\.
\(Corollary 1 fromNangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]\) DenoteGs:g,a∗G^\{\*\}\_\{s:g,a\}as the sum of rewards starting fromssand taking actionaaup until, but not including,gg\. Then letM∈ℳM\\in\\mathcal\{M\}andQ¯M∗\\bar\{Q\}^\{\*\}\_\{M\}be the extendedQQ\-value function\. Then for alls∈𝒮s\\in\\mathcal\{S\},g∈𝒢g\\in\\mathcal\{G\},a∈𝒜a\\in\\mathcal\{A\}, there exists aGs:g,a∗∈ℝG^\{\*\}\_\{s:g,a\}\\in\\mathbb\{R\}such that
Q¯M∗\(s,g,a\)=Gs:g,a∗\+r¯M\(s′,g,a′\),wheres′∈𝒢anda′=argmaxb∈𝒜r¯M\(s′,g,b\)\.\\bar\{Q\}^\{\*\}\_\{M\}\(s,g,a\)=G^\{\*\}\_\{s:g,a\}\+\\bar\{r\}\_\{M\}\(s^\{\\prime\},g,a^\{\\prime\}\),\\quad\\text\{where \}s^\{\\prime\}\\in\\mathcal\{G\}\\text\{ and \}a^\{\\prime\}=\\arg\\max\_\{b\\in\\mathcal\{A\}\}\\bar\{r\}\_\{M\}\(s^\{\\prime\},g,b\)\.
We assume the optimal Q\-value functions for tasksℳ𝒰\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}andℳ∅\{\\mathcal\{M\}\_\{\\varnothing\}\},Q¯𝒰∗\\bar\{Q\}\_\{\\mathcal\{U\}\}^\{\*\}andQ¯∅∗\\bar\{Q\}\_\{\\varnothing\}^\{\*\}, respectively, are known, and present an alternative way to obtainQ¯M∗\\bar\{Q\}\_\{M\}^\{\*\}for any taskMM\.
###### Definition 1\.
LetM∈ℳM\\in\\mathcal\{M\}and define𝒢M\+=\{g∈𝒢:r¯M\(g,g,a\)=r𝒰∀a∈𝒜\}\\mathcal\{G\}^\{\+\}\_\{M\}=\\\{g\\in\\mathcal\{G\}:\\bar\{r\}\_\{M\}\(g,g,a\)=\{r\_\{\\mathcal\{U\}\}\}\\;\\forall a\\in\\mathcal\{A\}\\\}, i\.e\.,𝒢M\+\\mathcal\{G\}^\{\+\}\_\{M\}is the set of desired goals for taskMM\. Denote𝒢i\+\\mathcal\{G\}^\{\+\}\_\{i\}for the set of goals for taskMiM\_\{i\}\.
###### Lemma 1\.
\(Proof in Appendix[A](https://arxiv.org/html/2606.04053#A1)\) Letℳ\\mathcal\{M\}be the set of tasks defined in Equation[1](https://arxiv.org/html/2606.04053#S3.E1)and𝒢M\+\\mathcal\{G\}^\{\+\}\_\{M\}like in Definition[1](https://arxiv.org/html/2606.04053#Thmdefinition1)\. For anyg∈𝒢g\\in\\mathcal\{G\}andM∈ℳM\\in\\mathcal\{M\},
Q¯M∗\(s,g,a\)=\{Q¯𝒰∗\(s,g,a\)ifg∈𝒢M\+∀\(s,a\)∈𝒮×𝒜,Q¯∅∗\(s,g,a\)ifg∈𝒢∖𝒢M\+∀\(s,a\)∈𝒮×𝒜\.\\bar\{Q\}^\{\*\}\_\{M\}\(s,g,a\)=\\begin\{cases\}\\bar\{Q\}\_\{\\mathcal\{U\}\}^\{\*\}\(s,g,a\)\\;\\;\\text\{if $g\\in\\mathcal\{G\}^\{\+\}\_\{M\}$\}&\\forall\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\},\\\\ \\bar\{Q\}\_\{\\varnothing\}^\{\*\}\(s,g,a\)\\;\\;\\text\{if $g\\in\\mathcal\{G\}\\setminus\\mathcal\{G\}^\{\+\}\_\{M\}$\}&\\forall\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\.\\end\{cases\}In other words, ifggis a goal for taskMM, i\.e\.g∈𝒢\+g\\in\\mathcal\{G\}^\{\+\}, then the state\-action slice for goalggand taskMMis that of the universal task for goalgg\. Similarly, ifggisnota goal for taskMM, i\.e\.g∉𝒢\+g\\notin\\mathcal\{G\}^\{\+\}, then it is the slice of the empty task\.
Lemma[1](https://arxiv.org/html/2606.04053#Thmlemma1)states that for anygg, theQ¯M∗\(⋅,g,⋅\)\\bar\{Q\}^\{\*\}\_\{M\}\(\\cdot,g,\\cdot\)slice is not arbitrary; it is either theQ¯𝒰∗\(⋅,g,⋅\)\\bar\{Q\}^\{\*\}\_\{\\mathcal\{U\}\}\(\\cdot,g,\\cdot\)slice or theQ¯∅∗\(⋅,g,⋅\)\\bar\{Q\}^\{\*\}\_\{\\varnothing\}\(\\cdot,g,\\cdot\)slice\. We can obtain the optimal extended Q\-value function for any given task in the BTA by independently selecting the state\-action slices corresponding to the universal task if the goal is active for taskMMand to the empty task otherwise\. It then follows that it is not necessary to learn2\+⌈log2\(\|𝒢\|\)⌉2\+\\lceil\\log\_\{2\}\(\\mathcal\{\|G\|\}\)\\rceilbase tasks to obtain composition because we would be learning duplicate information\. Figure[1](https://arxiv.org/html/2606.04053#S4.F1)portrays this idea by conceptually representing the optimal value functions as a 3D tensor for two tasks,M1M\_\{1\}andM2M\_\{2\}, that pursue goals𝒢M1\+=\{g2,g3\}\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}=\\\{g\_\{2\},g\_\{3\}\\\}and𝒢M2\+=\{g1,g2\}\\mathcal\{G\}^\{\+\}\_\{M\_\{2\}\}=\\\{g\_\{1\},g\_\{2\}\\\}, respectively\. In particular, both tasks share a goalg2g\_\{2\}and hence the optimal state\-action slice corresponding to this goal coincides\. Moreover, the slices for shared non\-goals, i\.e\. elements of𝒢\\mathcal\{G\}that neitherM1M\_\{1\}norM2M\_\{2\}pursue, likeg4g\_\{4\}are also shared across tasks\.
Figure 1:Optimal extended Q\-value functions for two tasksM1M\_\{1\}andM2M\_\{2\}that pursue goals\{g2,g3\}\\\{g\_\{2\},g\_\{3\}\\\}and\{g1,g2\}\\\{g\_\{1\},g\_\{2\}\\\}, respectively\. Pursued goal slices are colored green and non\-pursued red\.It is important to note that we decide to learn the universal and empty tasks because they constitute an intuitive separation of the goal space but we can learn any two tasksM1,M2∈ℳM\_\{1\},M\_\{2\}\\in\\mathcal\{M\}such that𝒢1\+⊎𝒢2\+=𝒢\\mathcal\{G\}^\{\+\}\_\{1\}\\uplus\\mathcal\{G\}^\{\+\}\_\{2\}=\\mathcal\{G\}111Here⊎\\uplusdenotes disjoint union\., which means we learn a state\-action slice for pursuing and not pursuing every goal\. In summary, any combination of tasks can be chosen as long as the agent learns the expected return of reaching the goal when it is desired and undesired in the task\.
Under this new formulation, composition becomes straightforward since it suffices to obtain the set of desired goals for the composed task and construct the extended Q\-value function using Lemma[1](https://arxiv.org/html/2606.04053#Thmlemma1)\. We formalize this notion by defining, in Lemma[2](https://arxiv.org/html/2606.04053#Thmlemma2), an isomorphism between the set of tasks,ℳ\\mathcal\{M\}, and the power set of goals,2𝒢2^\{\\mathcal\{G\}\}, allowing us to express task composition via goal set composition in Theorem[1](https://arxiv.org/html/2606.04053#Thmtheorem1)\.
###### Lemma 2\.
\(Proof in Appendix[A](https://arxiv.org/html/2606.04053#A1)\) Letℱ:ℳ→𝒫\(𝒢\)\\mathcal\{F\}:\\mathcal\{M\}\\to\\mathcal\{P\}\(\\mathcal\{G\}\)be the map fromℳ\\mathcal\{M\}to𝒫\(𝒢\)\\mathcal\{P\}\(\\mathcal\{G\}\)such thatℱ\(M\)=𝒢M\+\\mathcal\{F\}\(M\)=\\mathcal\{G\}^\{\+\}\_\{M\}\. Thenℱ\\mathcal\{F\}is an isomorphism between the task Boolean algebra\(ℳ,∨,∧,¬,ℳ𝒰,ℳ∅\)\(\\mathcal\{M\},\\lor,\\land,\\neg,\\mathcal\{M\}\_\{\\mathcal\{U\}\},\\mathcal\{M\}\_\{\\varnothing\}\)and the power set algebra for goals\(𝒫\(𝒢\),∪,∩,′,𝒢,∅\)\(\\mathcal\{P\}\(\\mathcal\{G\}\),\\cup,\\cap,\\\>^\{\\prime\},\\mathcal\{G\},\\varnothing\)\.
###### Theorem 1\.
\(Proof in Appendix[A](https://arxiv.org/html/2606.04053#A1)\) LetM1,M2∈ℳM\_\{1\},M\_\{2\}\\in\\mathcal\{M\}be two tasks, and𝒢M1\+,𝒢M2\+\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\},\\mathcal\{G\}^\{\+\}\_\{M\_\{2\}\}the respective sets of desired goals\. Then,
1. \(i\)Q¯¬M1∗\(s,g,a\)=\{Q¯𝒰∗\(s,g,a\)∀\(s,a\)ifg∈𝒢∖𝒢M1\+,Q¯∅∗\(s,g,a\)∀\(s,a\)ifg∈𝒢M1\+\.\\;\\;\\bar\{Q\}^\{\*\}\_\{\\neg M\_\{1\}\}\(s,g,a\)=\\begin\{cases\}\\bar\{Q\}^\{\*\}\_\{\\mathcal\{\\mathcal\{U\}\}\}\(s,g,a\)\\;\\;\\forall\(s,a\)\\quad\\text\{if $g\\in\\mathcal\{G\}\\setminus\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}$\},\\\\ \\bar\{Q\}^\{\*\}\_\{\\mathcal\{\\varnothing\}\}\(s,g,a\)\\;\\;\\forall\(s,a\)\\quad\\text\{if $g\\in\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}$\}\.\\end\{cases\}
2. \(ii\)Q¯M1∨M2∗\(s,g,a\)=\{Q¯𝒰∗\(s,g,a\)∀\(s,a\)ifg∈𝒢M1\+∪𝒢M2\+,Q¯∅∗\(s,g,a\)∀\(s,a\)ifg∈𝒢∖\(𝒢M1\+∪𝒢M2\+\)\.\\;\\;\\bar\{Q\}^\{\*\}\_\{M\_\{1\}\\lor M\_\{2\}\}\(s,g,a\)=\\begin\{cases\}\\bar\{Q\}^\{\*\}\_\{\\mathcal\{\\mathcal\{U\}\}\}\(s,g,a\)\\;\\;\\forall\(s,a\)\\quad\\text\{if $g\\in\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}\\cup\\mathcal\{G\}^\{\+\}\_\{M\_\{2\}\}$\},\\\\ \\bar\{Q\}^\{\*\}\_\{\\mathcal\{\\varnothing\}\}\(s,g,a\)\\;\\;\\forall\(s,a\)\\quad\\text\{if $g\\in\\mathcal\{G\}\\setminus\(\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}\\cup\\mathcal\{G\}^\{\+\}\_\{M\_\{2\}\}\)$\}\.\\end\{cases\}
3. \(iii\)Q¯M1∧M2∗\(s,g,a\)=\{Q¯𝒰∗\(s,g,a\)∀\(s,a\)ifg∈𝒢M1\+∩𝒢M2\+,Q¯∅∗\(s,g,a\)∀\(s,a\)ifg∈𝒢∖\(𝒢M1\+∩𝒢M2\+\)\.\\;\\;\\bar\{Q\}^\{\*\}\_\{M\_\{1\}\\land M\_\{2\}\}\(s,g,a\)=\\begin\{cases\}\\bar\{Q\}^\{\*\}\_\{\\mathcal\{\\mathcal\{U\}\}\}\(s,g,a\)\\;\\;\\forall\(s,a\)\\quad\\text\{if $g\\in\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}\\cap\\mathcal\{G\}^\{\+\}\_\{M\_\{2\}\}$\},\\\\ \\bar\{Q\}^\{\*\}\_\{\\mathcal\{\\varnothing\}\}\(s,g,a\)\\;\\;\\forall\(s,a\)\\quad\\text\{if $g\\in\\mathcal\{G\}\\setminus\(\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}\\cap\\mathcal\{G\}^\{\+\}\_\{M\_\{2\}\}\)$\}\.\\end\{cases\}
Theorem[1](https://arxiv.org/html/2606.04053#Thmtheorem1)presents an alternative way to obtain task composition, characterized by the set of desired goals for each task\. Furthermore it brings a series of advantages with respect to the original method of composition presented in the paper\.
##### Computational cost\.
The formulation presented in Theorem[1](https://arxiv.org/html/2606.04053#Thmtheorem1)offers computational advantages over the original BTA composition method, both in learning and composition\. The original BTA framework requires learning⌈log2\|𝒢\|⌉\\lceil\\log\_\{2\}\|\\mathcal\{G\}\|\\rceilbase tasks, in addition toℳ𝒰\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}andℳ∅\{\\mathcal\{M\}\_\{\\varnothing\}\}, whereas in our new formulation it suffices to learn only two tasks:ℳ𝒰\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}andℳ∅\{\\mathcal\{M\}\_\{\\varnothing\}\}\. This reduces the number of parameters required forlearningfrom𝒪\(⌈log2\|𝒢\|⌉×\|𝒢\|×\|𝒮\|×\|𝒜\|\)\\mathcal\{O\}\\left\(\\lceil\\log\_\{2\}\|\\mathcal\{G\}\|\\rceil\\times\|\\mathcal\{G\}\|\\times\|\\mathcal\{S\}\|\\times\|\\mathcal\{A\}\|\\right\)to𝒪\(\|𝒢\|×\|𝒮\|×\|𝒜\|\)\\mathcal\{O\}\\left\(\|\\mathcal\{G\}\|\\times\|\\mathcal\{S\}\|\\times\|\\mathcal\{A\}\|\\right\)\. Thecost of compositionis also reduced by instead of performing a sequence of⌈log2\|𝒢\|⌉\\lceil\\log\_\{2\}\|\\mathcal\{G\}\|\\rceilelement\-wise operations, with worst\-case cost𝒪\(⌈log2\|𝒢\|⌉×\|𝒢\|×\|𝒮\|×\|𝒜\|\)\\mathcal\{O\}\\left\(\\lceil\\log\_\{2\}\|\\mathcal\{G\}\|\\rceil\\times\|\\mathcal\{G\}\|\\times\|\\mathcal\{S\}\|\\times\|\\mathcal\{A\}\|\\right\), performing set operations on the desired goals and memory access/copying of the corresponding per\-goal state\-action slices, with cost𝒪\(\|𝒢\|×\|𝒮\|×\|𝒜\|\)\\mathcal\{O\}\\left\(\|\\mathcal\{G\}\|\\times\|\\mathcal\{S\}\|\\times\|\\mathcal\{A\}\|\\right\)\. Corollary[1](https://arxiv.org/html/2606.04053#Thmcorollary1)provides the key insight, i\.e\., leveraging the order between task slices, the computations over the state\-action dimensions can be done in𝒪\(1\)\\mathcal\{O\}\(1\)time by simply choosing the bigger slice\.
###### Corollary 1\.
\(Proof in Appendix[A](https://arxiv.org/html/2606.04053#A1)\) The universal slice is uniformly greater than the empty slice for all goals, states and actions\.Q¯∅∗\(s,g,a\)≤Q¯𝒰∗\(s,g,a\)∀\(s,g,a\)∈𝒮×𝒢×𝒜\.\\bar\{Q\}^\{\*\}\_\{\\varnothing\}\(s,g,a\)\\leq\\bar\{Q\}^\{\*\}\_\{\\mathcal\{U\}\}\(s,g,a\)\\;\\;\\forall\(s,g,a\)\\in\\mathcal\{S\}\\times\\mathcal\{G\}\\times\\mathcal\{A\}\.
##### What happens for stochastic MDPs?
This work addresses the matter of how many value functions are needed to construct every optimal extended Q\-value function in*deterministic*MDPs\. We can also wonder how many value functions would be needed to construct every optimal extended Q\-value function in*stochastic*MDPs\. In the former case we know that whatever the goal set, at most we need the functions under\|𝒢\|\|\\mathcal\{G\}\|different policies to obtain the optimal policy through composition\. For a task with a given set of desired goals we can operate independently learned slices for each set of goals\. However, this does not hold in a stochastic MDP, because this independence between goals inside the desired goal set disappears\. Proposition[2](https://arxiv.org/html/2606.04053#Thmproposition2)proves these points by showing that the nature of optimal policies in the stochastic setting is exponential with respect to the number of total goals\.
###### Proposition 2\.
\(Proof in Appendix[A](https://arxiv.org/html/2606.04053#A1)\.\) There exist MDPs with stochastic transition dynamics and goal set𝒢\\mathcal\{G\}for which there will be a different optimal action for each of the2\|𝒢\|−12^\{\|\\mathcal\{G\}\|\}\-1desired goal sets,𝒢\+⊆𝒢\\mathcal\{G\}^\{\+\}\\subseteq\\mathcal\{G\}, with\|𝒢\+\|≥1\|\\mathcal\{G\}^\{\+\}\|\\geq 1\.
Consequently, we argue that there can be no strategy for the stochastic case that, in each state, takes a maximum \(or another type of convex combination\) over Q\-value functions of a ‘small’ set of policies \(linear or even polynomial in\|𝒢\|\|\\mathcal\{G\}\|\)\. In the worst case, a different action is optimal for each possible goal set \(of which there can be exponentially many\), meaning that a convex combination would need to be taken over an exponential number of ‘primitive’ Q\-value functions222Note that there are special cases where such constructions are possible, see e\.g\.Infanteet al\.\[[2022](https://arxiv.org/html/2606.04053#bib.bib34)\]\.\. In fact,Tasseet al\.\[[2022](https://arxiv.org/html/2606.04053#bib.bib35)\]extend the BTA framework to stochastic MDPs, but optimality of composed value functions no longer holds, and only approximate guarantees can be established\.
## 5Experiments
Thus far, we formalized and proposed a theoretical improvement in both the learning and composition costs of the BTA, respectively\. The aim of the experiments presented here is to compare our proposed method to that of the original BTA\[Nangue Tasseet al\.,[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]in order to confirm these improvements empirically\. Additionally, we compare the advantage of this new method with respect to that of Skill Machines\[Nangue Tasseet al\.,[2024](https://arxiv.org/html/2606.04053#bib.bib33)\], where we aim to show improvements only in composition cost, since this framework already takes advantage of training improvements by training two extended value functions \(Algorithm 1 in\[Nangue Tasseet al\.,[2024](https://arxiv.org/html/2606.04053#bib.bib33)\]\)\. In particular we study these methods insub\-optimal scenarioswith increasing number of training iterations andupon convergence\.
### 5\.1Experimental setup
![[Uncaptioned image]](https://arxiv.org/html/2606.04053v1/figures/rooms_comparison_grid.png)##### Rooms environment\[Sutton and Barto,[1998](https://arxiv.org/html/2606.04053#bib.bib18)\]\.
We extend the four rooms environment\[Sutton and Barto,[1998](https://arxiv.org/html/2606.04053#bib.bib18)\]to incorporate an exponentially growing number of rooms so we can place one goal per room and study the scaling law of both methods\. Consider a grid\-world in which the agent navigates between rooms separated by walls with narrow doorways\. The state space consists of all navigable grid cells and actions are the four cardinal movement directions plus a stay action\. We instantiate three variants: a2×22\\times 2rooms layout with\|𝒢\|=4\|\\mathcal\{G\}\|=4goals \(one per room\), a3×33\\times 3rooms layout with\|𝒢\|=8\|\\mathcal\{G\}\|=8goals \(the middle room contains no goals\), and a4×44\\times 4rooms layout with\|𝒢\|=16\|\\mathcal\{G\}\|=16goals\. We setr𝒰=2r\_\{\\mathcal\{U\}\}=2,r∅=−0\.1r\_\{\\varnothing\}=\-0\.1, a step reward of−0\.1\-0\.1, and a discount factor of11\. Given the exponentially growing number of tasks, we randomly sample at most55tasks for each goal set length for diversity and train the universal, empty, base tasks as well as all tasks separately to obtain their optimal policies\. All tasks are trained with the same number of maximum iterations, meaning that the base task method uses more compute since it requires learning more tasks\.
![[Uncaptioned image]](https://arxiv.org/html/2606.04053v1/figures/map.png)
##### Boxman environment\[Nangue Tasseet al\.,[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]\.
A high\-dimensional object collection environment with pixel\-based observations of size84×84×384\\times 84\\times 3\(RGB\)\. The agent must collect objects defined by combinations of colors \(beige, blue and purple\) and shapes \(squares and circles\), which constitute skill primitives\. Tasks are learned using deep Q\-learning with a CNN\+MLP, where we train for 3 full runs and save intermediate checkpoints\. We then take each checkpoint and evaluate in the environment, averaging over runs\. This allows us to understand how the learned policies converge under different methods\.
![[Uncaptioned image]](https://arxiv.org/html/2606.04053v1/figures/office_map.png)
##### Office Gridworld\[Icarteet al\.,[2022](https://arxiv.org/html/2606.04053#bib.bib7)\]for Skill Machines\.
A tabular grid\-world containing labelled locations: roomsA,B,C,DA,B,C,D, a coffee machine, a mail room, an office, and decoration cells\. States are grid cells, actions are the four cardinal directions, and the goal set𝒢\\mathcal\{G\}consists of the labelled locations\. Tasks are temporal logic specifications over these propositions, such as delivering coffee to the office before picking up mail or avoiding decoration cells\.
![[Uncaptioned image]](https://arxiv.org/html/2606.04053v1/figures/safety_map.png)Safety Gym Domain\[Achiam and Amodei,[2019](https://arxiv.org/html/2606.04053#bib.bib36)\]for Skill Machines\.A continuous control environment in which a point\-mass robot navigates to various regions using a continuous state space𝒮=ℝ60\\mathcal\{S\}=\\mathbb\{R\}^\{60\}\(comprising joint velocities, LiDAR readings, and other sensory inputs\) and a continuous action space𝒜=ℝ2\\mathcal\{A\}=\\mathbb\{R\}^\{2\}\(direction and force\)\. Tasks are defined over\|𝒫\|=3\|\\mathcal\{P\}\|=3propositions corresponding to a red cylinder, green buttons, and blue regions, with a single constraint𝒞=\{c\}\\mathcal\{C\}=\\\{c\\\}tracking blue\-region violations, yielding\|𝒫∪𝒞\|=4\|\\mathcal\{P\}\\cup\\mathcal\{C\}\|=4skill primitives trained using TD3\.
In both of these environments we measure successes in the LTL specification as well as total return\. We evaluatethree methodsin this domain: theoriginalSkill Machines approach, which learns22value functions and performs composition via Boolean operations \(ultimately these are min, max, sum\); the goal\-set based approach, which learns22value functions and composes via slice selection; and the base tasks method, which learns more than22tasks and composes via Boolean operations\.
### 5\.2Results\.
##### Rooms environment\.
Figure[2](https://arxiv.org/html/2606.04053#S5.F2)shows the evolution of the average returns of the sampled tasks as we increase the number of training iterations\. We observe that returns given by the goal\-set method are consistently higher than the base task approach\. With sufficient iterations, both methods learn the optimal policy, but composition through the universal and empty tasks converges to the optimal policy faster\. We hypothesize this behavior is due to the mixing of the value functions obtained via themax\\maxandmin\\minoperations being detrimental\. In this sense, Corollary[1](https://arxiv.org/html/2606.04053#Thmcorollary1)is not guaranteed to hold until the Q\-value functions converge, thus these operations might select an arbitrary number of parameters from one slice and the rest from the other\. In turn, full slice selection still maintains the relative ordering of slices, which we show is a fundamental property of the optimal value function\.
Figure 2:Evolution of the mean \(±\\pmstd\) return for each goal set length with the number of maximum iterations of the training algorithm\. The returns for each task are measured on10001000episodes and averaged over55tasks\. The number of training iterations \(x\-axis\) is per\-value function\.![[Uncaptioned image]](https://arxiv.org/html/2606.04053v1/x3.png)Temporal cost of composition\.We measure the time taken for performing composition across all tasks using base tasks versus using only the universal and empty tasks in the goal\-set based formulation\. The right figure shows thatUniv\./Emptyconsistently achieves lower composition times and exhibits substantially less variability than theBase Tasksapproach across all problem sizes\. Moreover, the gap between the two methods widens as the number of goals increases\. Note that the y\-axis is in logarithmic scale\.
##### Boxman environment\.
Figure[3](https://arxiv.org/html/2606.04053#S5.F3)shows convergence of both methods in the Boxman environment, where we include two representations of the same training runs\. Figure[3\(a\)](https://arxiv.org/html/2606.04053#S5.F3.sf1)shows the evolution of the average return as a function of the number of training iterations per UVFA\. We observe that both methods reach a similar final return, but the original base task method converges to the same optimal policy with fewer iterations per value function\. However, this advantage disappears when we correct for the total number of training iterations per method\. Figure[3\(b\)](https://arxiv.org/html/2606.04053#S5.F3.sf2)shows the same results as a function of the total number of training iterations summed across all UVFAs\. Since the base task method requires learning more value functions, its total training cost is higher, and it remains at a suboptimal return for a significantly longer total budget, whereas our proposed method reaches near\-optimal performance earlier\. The figures also confirm that in a function approximation setting, learning additional base tasks does not translate to improved performance upon convergence\.
\(a\)Comparison for a fixed per\-UVFA budget\.
\(b\)Comparison for a fixed total training budget\.
Figure 3:Evolution of the average reward \(±\\pmstd\) across all tasks in the Boxman environment\. The x\-axis represents the number of training iterations\. Results are averaged over tasks and33runs\.
##### Office Gridworld environment with Skill Machines\.
Figure[4\(a\)](https://arxiv.org/html/2606.04053#S5.F4.sf1)depicts the success rate and average episode return of the LTL specification for the average of the different tasks in the office environment\. We see that all methods converge almost in an equivalent manner, confirming that extra efforts to train base tasks yield no additional returns\. Figure[4\(b\)](https://arxiv.org/html/2606.04053#S5.F4.sf2)confirms that the goal\-set based approach \(Univ\./Empty\) has a lower composition cost, since it consists only of slice selection\. We further observe that the original and base task methods have similar cost given that their method for composition is the same\. See Appendix[B\.1](https://arxiv.org/html/2606.04053#A2.SS1)and[B\.2\.3](https://arxiv.org/html/2606.04053#A2.SS2.SSS3)for task explanations and per\-task metrics\.
\(a\)Average returns \(left axis\) and success rate \(right axis\)\.
\(b\)Violin plot of composition time \(log sc\.\)\.
Figure 4:Evolution of average \(±\\pmstd\) successes and returns across33runs and44tasks in the temporal composition of actions for the Office Gridworld environment\.
##### Safety Gym environment\.
The results for the safety gym environment further confirm the conclusions of the office Gridworld and can be found in Appendix[B\.2\.4](https://arxiv.org/html/2606.04053#A2.SS2.SSS4)\.
## 6Conclusion
In this work we revisited the Boolean Task Algebra and studied structural properties of its environmental assumptions that lead to a simplified formulation of the optimal value functions, see Section[4](https://arxiv.org/html/2606.04053#S4)\. First, we proved that the expressive space of extended optimal Q\-value functions is fully determined by the universal and empty tasks, thereby confirming our first contribution in Lemma[1](https://arxiv.org/html/2606.04053#Thmlemma1), that is, the effective collapse of the representation space to two fundamental value functions\. Building on this insight, we introduced an isomorphism in Lemma[2](https://arxiv.org/html/2606.04053#Thmlemma2)between tasks and subsets of goals, enabling a goal\-set\-based characterization of composition\. This establishes our second contribution, provided in Theorem[1](https://arxiv.org/html/2606.04053#Thmtheorem1), as it allows logical operations to be performed directly on goal sets rather than through element\-wise manipulation of extended Q\-value functions\. We further analyze whether our proposed method would extend to stochastic MDPs and propose a counterexample in Proposition[2](https://arxiv.org/html/2606.04053#Thmproposition2)where we show that the number of optimal policies to account for in order to achieve optimal composition is exponential in the number of total goals\.
Finally, Section[5](https://arxiv.org/html/2606.04053#S5)shows that our goal\-set method reduces composition time across tabular, function approximation, visual and continuous control environments, for both standard and LTL task specifications\. This includes training and composition time improvements for standard BTA, and composition\-only gains for Skill Machines\.
## References
- Benchmarking safe exploration in deep reinforcement learning\.External Links:[Link](https://api.semanticscholar.org/CorpusID:208283920)Cited by:[§5\.1](https://arxiv.org/html/2606.04053#S5.SS1.SSS0.Px3.p2.5.1)\.
- J\. Andreas, D\. Klein, and S\. Levine \(2017\)Modular multitask reinforcement learning with policy sketches\.InProceedings of the 34th International Conference on Machine Learning,D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.70,pp\. 166–175\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p3.1)\.
- A\. Barreto, W\. Dabney, R\. Munos, J\. J\. Hunt, T\. Schaul, H\. van Hasselt, and D\. Silver \(2017\)Successor features for transfer in reinforcement learning\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Borsa, A\. Barreto, J\. Quan, D\. J\. Mankowitz, H\. van Hasselt, R\. Munos, D\. Silver, and T\. Schaul \(2019\)Universal successor features approximators\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p2.1)\.
- P\. Dayan \(1993\)Improving generalization for temporal difference learning: the successor representation\.Neural Computation5\(4\),pp\. 613–624\.External Links:ISSN 0899\-7667,[Document](https://dx.doi.org/10.1162/neco.1993.5.4.613)Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Haarnoja, V\. Pong, A\. Zhou, M\. Dalal, P\. Abbeel, and S\. Levine \(2018\)Composable deep reinforcement learning for robotic manipulation\.In2018 IEEE International Conference on Robotics and Automation \(ICRA\),Vol\.,pp\. 6244–6251\.External Links:[Document](https://dx.doi.org/10.1109/ICRA.2018.8460756)Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px2.p1.2)\.
- T\. Haarnoja, H\. Tang, P\. Abbeel, and S\. Levine \(2017\)Reinforcement learning with deep energy\-based policies\.InProceedings of the 34th International Conference on Machine Learning,D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.70,pp\. 1352–1361\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px2.p1.2)\.
- K\. M\. Hermann, F\. Hill, S\. Green, F\. Wang, R\. Faulkner, H\. Soyer, D\. Szepesvari, W\. M\. Czarnecki, M\. Jaderberg, D\. Teplyashin, M\. Wainwright, C\. Apps, D\. Hassabis, and P\. Blunsom \(2017\)Grounded language learning in a simulated 3d world\.External Links:1706\.06551Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p3.1)\.
- J\. Hunt, A\. Barreto, T\. Lillicrap, and N\. Heess \(2019\)Composing entropic policies using divergence correction\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 2911–2920\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px2.p1.2)\.
- R\. T\. Icarte, T\. Q\. Klassen, R\. Valenzano, and S\. A\. McIlraith \(2022\)Reward machines: exploiting reward function structure in reinforcement learning\.Journal of Artificial Intelligence Research73,pp\. 173–208\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px2.p2.3),[§5\.1](https://arxiv.org/html/2606.04053#S5.SS1.SSS0.Px3)\.
- G\. Infante, A\. Jonsson, and V\. Gómez \(2022\)Globally optimal hierarchical reinforcement learning for linearly\-solvable markov decision processes\.Vol\.36,pp\. 6970–6977\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v36i6.20655)Cited by:[footnote 2](https://arxiv.org/html/2606.04053#footnote2)\.
- D\. Kuric, G\. Infante, V\. Gómez, A\. Jonsson, and H\. van Hoof \(2024\)Planning with a learned policy basis to optimally solve complex tasks\.InInternational Conference on Automated Planning and Scheduling,Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p3.1)\.
- G\. Nangue Tasse, S\. James, and B\. Rosman \(2020\)A boolean task algebra for reinforcement learning\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 9497–9507\.Cited by:[§A\.1](https://arxiv.org/html/2606.04053#A1.SS1.1.p1.12),[§A\.3](https://arxiv.org/html/2606.04053#A1.SS3.1.p1.7),[§B\.1](https://arxiv.org/html/2606.04053#A2.SS1.p1.1),[§1](https://arxiv.org/html/2606.04053#S1.p1.1),[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px2.p2.3),[§3](https://arxiv.org/html/2606.04053#S3.SS0.SSS0.Px1.p1.8),[§3](https://arxiv.org/html/2606.04053#S3.SS0.SSS0.Px2.p1.3),[§3](https://arxiv.org/html/2606.04053#S3.SS0.SSS0.Px3.p3.5),[§3](https://arxiv.org/html/2606.04053#S3.SS0.SSS0.Px4.p1.4),[§5\.1](https://arxiv.org/html/2606.04053#S5.SS1.SSS0.Px2),[§5](https://arxiv.org/html/2606.04053#S5.p1.1),[Proposition 1](https://arxiv.org/html/2606.04053#Thmproposition1.p1.11.11)\.
- G\. Nangue Tasse, D\. Jarvis, S\. James, and B\. Rosman \(2024\)Skill machines: temporal logic skill composition in reinforcement learning\.External Links:2205\.12532Cited by:[§B\.1](https://arxiv.org/html/2606.04053#A2.SS1.SSS0.Px4.p1.13),[§B\.1](https://arxiv.org/html/2606.04053#A2.SS1.p1.1),[§1](https://arxiv.org/html/2606.04053#S1.p2.1),[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px2.p2.3),[§3](https://arxiv.org/html/2606.04053#S3.SS0.SSS0.Px5.p1.20),[§5](https://arxiv.org/html/2606.04053#S5.p1.1)\.
- T\. Schaul, D\. Horgan, K\. Gregor, and D\. Silver \(2015\)Universal value function approximators\.InProceedings of the 32nd International Conference on Machine Learning,F\. Bach and D\. Blei \(Eds\.\),Proceedings of Machine Learning Research, Vol\.37,Lille, France,pp\. 1312–1320\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p2.1)\.
- R\. S\. Sutton and A\. G\. Barto \(1998\)Reinforcement learning: an introduction\.Second edition,The MIT Press\.Cited by:[§5\.1](https://arxiv.org/html/2606.04053#S5.SS1.SSS0.Px1),[§5\.1](https://arxiv.org/html/2606.04053#S5.SS1.SSS0.Px1.p1.11)\.
- R\. S\. Sutton, D\. Precup, and S\. Singh \(1999\)Between mdps and semi\-mdps: a framework for temporal abstraction in reinforcement learning\.Artificial Intelligence112\(1\),pp\. 181–211\.External Links:ISSN 0004\-3702,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0004-3702%2899%2900052-1)Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p3.1)\.
- G\. N\. Tasse, S\. James, and B\. Rosman \(2022\)Generalisation in lifelong reinforcement learning through logical composition\.InInternational Conference on Learning Representations,Cited by:[item 1](https://arxiv.org/html/2606.04053#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.04053#S1.p2.1),[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px2.p2.3),[§4](https://arxiv.org/html/2606.04053#S4.SS0.SSS0.Px2.p2.1),[§4](https://arxiv.org/html/2606.04053#S4.p2.3)\.
- E\. Todorov \(2006\)Linearly\-solvable markov decision problems\.InAdvances in Neural Information Processing Systems,B\. Schölkopf, J\. Platt, and T\. Hoffman \(Eds\.\),Vol\.19,pp\.\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Todorov \(2009\)Compositionality of optimal control laws\.InAdvances in Neural Information Processing Systems,Y\. Bengio, D\. Schuurmans, J\. Lafferty, C\. Williams, and A\. Culotta \(Eds\.\),Vol\.22,pp\.\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Vaezipoor, A\. C\. Li, R\. A\. T\. Icarte, and S\. A\. Mcilraith \(2021\)LTL2Action: generalizing ltl instructions for multi\-task rl\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 10497–10508\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p3.1)\.
- B\. Van Niekerk, S\. James, A\. Earle, and B\. Rosman \(2019\)Composing value functions in reinforcement learning\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 6401–6409\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px2.p1.2)\.
- A\. S\. Vezhnevets, S\. Osindero, T\. Schaul, N\. Heess, M\. Jaderberg, D\. Silver, and K\. Kavukcuoglu \(2017\)FeUdal networks for hierarchical reinforcement learning\.InProceedings of the 34th International Conference on Machine Learning,D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.70,pp\. 3540–3549\.Cited by:[§2](https://arxiv.org/html/2606.04053#S2.SS0.SSS0.Px1.p3.1)\.
## Appendix AProofs
In this section we present proofs for the different mathematical statements\. We begin by restating the following result and definition\.
See[1](https://arxiv.org/html/2606.04053#Thmproposition1)
See[1](https://arxiv.org/html/2606.04053#Thmdefinition1)
### A\.1Characterizing optimal extended value functions
See[1](https://arxiv.org/html/2606.04053#Thmlemma1)
###### Proof\.
\(Lemma[1](https://arxiv.org/html/2606.04053#Thmlemma1)\)LetM∈ℳM\\in\\mathcal\{M\},g∈𝒢g\\in\\mathcal\{G\},\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\. Ifggis unreachable fromss, from the proof of Lemma 2 fromNangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\], we know thatMg=ℳ𝒰,gM\_\{g\}=\\mathcal\{M\}\_\{\\mathcal\{U\},g\}, whereMgM\_\{g\}andℳ𝒰,g\\mathcal\{M\}\_\{\\mathcal\{U\},g\}are MDPs with the same transition dynamics asMMandℳ𝒰\\mathcal\{M\}\_\{\\mathcal\{U\}\}and with reward functionsrMg\(s,a\)=r¯M\(s,g,a\)r\_\{M\_\{g\}\}\(s,a\)=\\bar\{r\}\_\{M\}\(s,g,a\)andrℳ𝒰,g=r¯ℳ𝒰\(s,g,a\)r\_\{\\mathcal\{M\}\_\{\\mathcal\{U\},g\}\}=\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(s,g,a\), respectively\. By uniqueness of the solution to the Bellman optimality equation, their optimal Q\-functions must be identical\.
We now study the case whereggis reachable from statess\. From Corollary 1 we know that for theses,g,a,∃Gs:g,a∗∈ℝs,g,a,\\;\\exists\\\!\\;G\_\{s:g,a\}^\{\*\}\\in\\mathbb\{R\}such thatQ¯M∗\(s,g,a\)=Gs:g,a∗\+r¯M\(s′,g,a′\)\\bar\{Q\}\_\{M\}^\{\*\}\(s,g,a\)=G\_\{s:g,a\}^\{\*\}\+\\bar\{r\}\_\{M\}\(s^\{\\prime\},g,a^\{\\prime\}\), wheres′∈𝒢s^\{\\prime\}\\in\\mathcal\{G\}anda′=argmaxb∈𝒜r¯M\(s′,g,b\)a^\{\\prime\}=\\underset\{b\\in\\mathcal\{A\}\}\{\\arg\\max\}\\,\\bar\{r\}\_\{M\}\(s^\{\\prime\},g,b\)\. Furthermore, from the proof of Lemma 2 we know thats′=gs^\{\\prime\}=g, henceQ¯M∗\(s,g,a\)=Gs:g,a∗\+r¯M\(g,g,a′\)\\bar\{Q\}\_\{M\}^\{\*\}\(s,g,a\)=G\_\{s:g,a\}^\{\*\}\+\\bar\{r\}\_\{M\}\(g,g,a^\{\\prime\}\)\. Note thatGs:g,a∗G\_\{s:g,a\}^\{\*\}is independent of the task\. We now proceed by cases\.
■\\blacksquareIfggis a goal for taskMM, we have that by definition ofr¯M\\bar\{r\}\_\{M\}andr¯ℳ𝒰\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\},r¯M\(g,g,a′\)=rM\(g,a′\)\\bar\{r\}\_\{M\}\(g,g,a^\{\\prime\}\)=r\_\{M\}\(g,a^\{\\prime\}\)andr¯ℳ𝒰\(g,g,a′\)=rℳ𝒰\(g,a′\)\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(g,g,a^\{\\prime\}\)=r\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(g,a^\{\\prime\}\)\. From the definition ofℳ\\mathcal\{M\},rM\(g,a′\),rℳ𝒰\(g,a′\)∈\{r∅,r𝒰\}r\_\{M\}\(g,a^\{\\prime\}\),r\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(g,a^\{\\prime\}\)\\in\\\{\{r\_\{\\varnothing\}\},\{r\_\{\\mathcal\{U\}\}\}\\\}, and more concretelyrM\(g,a′\)=rℳ𝒰\(g,a′\)=r𝒰r\_\{M\}\(g,a^\{\\prime\}\)=r\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(g,a^\{\\prime\}\)=\{r\_\{\\mathcal\{U\}\}\}given thatggis a goal for bothMMandℳ𝒰\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\. Consequently, we have thatr¯M\(g,g,a′\)=r¯ℳ𝒰\(g,g,a′\)\\bar\{r\}\_\{M\}\(g,g,a^\{\\prime\}\)=\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(g,g,a^\{\\prime\}\)and together with Corollary 1 as well as the fact thatgganda′a^\{\\prime\}are arbitrary, we obtain
Q¯M∗\(s,g,a\)=Gs:g,a∗\+r¯M\(g,g,a′\)=Gs:g,a∗\+r¯ℳ𝒰\(g,g,a′\)=Q¯𝒰∗\(s,g,a\)\.\\displaystyle\\bar\{Q\}\_\{M\}^\{\*\}\(s,g,a\)=G\_\{s:g,a\}^\{\*\}\+\\bar\{r\}\_\{M\}\(g,g,a^\{\\prime\}\)=G\_\{s:g,a\}^\{\*\}\+\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(g,g,a^\{\\prime\}\)=\\bar\{Q\}\_\{\\mathcal\{U\}\}^\{\*\}\(s,g,a\)\.\(2\)
■\\blacksquareIfggis not a goal for taskMM, the reasoning is analogous,r¯M\(g,g,a′\)=rM\(g,a′\)=r∅=rℳ∅\(g,a′\)=r¯ℳ∅\(g,g,a′\)\\bar\{r\}\_\{M\}\(g,g,a^\{\\prime\}\)=r\_\{M\}\(g,a^\{\\prime\}\)=\{r\_\{\\varnothing\}\}=r\_\{\{\\mathcal\{M\}\_\{\\varnothing\}\}\}\(g,a^\{\\prime\}\)=\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\varnothing\}\}\}\(g,g,a^\{\\prime\}\)\. Finally,
Q¯M∗\(s,g,a\)=Gs:g,a∗\+r¯M\(g,g,a′\)=Gs:g,a∗\+r¯ℳ∅\(g,g,a′\)=Q¯∅∗\(s,g,a\)\.\\displaystyle\\bar\{Q\}\_\{M\}^\{\*\}\(s,g,a\)=G\_\{s:g,a\}^\{\*\}\+\\bar\{r\}\_\{M\}\(g,g,a^\{\\prime\}\)=G\_\{s:g,a\}^\{\*\}\+\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\varnothing\}\}\}\(g,g,a^\{\\prime\}\)=\\bar\{Q\}\_\{\\varnothing\}^\{\*\}\(s,g,a\)\.\(3\)∎
### A\.2A new goal\-set based method for composition
See[2](https://arxiv.org/html/2606.04053#Thmlemma2)
###### Proof\.
\(Lemma[2](https://arxiv.org/html/2606.04053#Thmlemma2)\)Bijectivity follows directly from Definition[1](https://arxiv.org/html/2606.04053#Thmdefinition1)\. We now prove thatℱ\\mathcal\{F\}is a homomorphism\. LetM1∈ℳM\_\{1\}\\in\\mathcal\{M\}\. It suffices to prove the following: \(i\)ℱ\(M1\)′=ℱ\(¬M1\)\\mathcal\{F\}\(M\_\{1\}\)^\{\\prime\}=\\mathcal\{F\}\(\\neg M\_\{1\}\), \(ii\)ℱ\(M1∧M2\)=ℱ\(M1\)∩ℱ\(M2\)\\mathcal\{F\}\(M\_\{1\}\\land M\_\{2\}\)=\\mathcal\{F\}\(M\_\{1\}\)\\cap\\mathcal\{F\}\(M\_\{2\}\)and \(iii\)ℱ\(M1∨M2\)=ℱ\(M1\)∪ℱ\(M2\)\\mathcal\{F\}\(M\_\{1\}\\lor M\_\{2\}\)=\\mathcal\{F\}\(M\_\{1\}\)\\cup\\mathcal\{F\}\(M\_\{2\}\)\.
1. \(i\)We first prove the set inclusionℱ\(¬M1\)⊆ℱ\(M1\)′\\mathcal\{F\}\(\\neg M\_\{1\}\)\\subseteq\\mathcal\{F\}\(M\_\{1\}\)^\{\\prime\}\. Letg∈ℱ\(¬M1\)g\\in\\mathcal\{F\}\(\\neg M\_\{1\}\), thenr¯¬M1\(g,g,a\)=r¬M1\(g,a\)=r𝒰\\bar\{r\}\_\{\\neg M\_\{1\}\}\(g,g,a\)=r\_\{\\neg M\_\{1\}\}\(g,a\)=\{r\_\{\\mathcal\{U\}\}\}\. Now,r¯¬\(¬M1\)\(g,g,a\)=r¬\(¬M1\)\(g,a\)=\(rℳ𝒰\(g,a\)\+rℳ∅\(g,a\)\)−r¬M1\(g,a\)=r𝒰\+r∅−r𝒰=r∅\\bar\{r\}\_\{\\neg\(\\neg M\_\{1\}\)\}\(g,g,a\)=r\_\{\\neg\(\\neg M\_\{1\}\)\}\(g,a\)=\(r\_\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\(g,a\)\+r\_\{\\mathcal\{M\}\_\{\\varnothing\}\}\(g,a\)\)\-r\_\{\\neg M\_\{1\}\}\(g,a\)=\{r\_\{\\mathcal\{U\}\}\}\+\{r\_\{\\varnothing\}\}\-\{r\_\{\\mathcal\{U\}\}\}=\{r\_\{\\varnothing\}\}\. Hence,g∉ℱ\(¬\(¬M1\)\)=ℱ\(M1\)g\\notin\\mathcal\{F\}\(\\neg\(\\neg M\_\{1\}\)\)=\\mathcal\{F\}\(M\_\{1\}\), and thusg∈ℱ\(M1\)′g\\in\\mathcal\{F\}\(M\_\{1\}\)^\{\\prime\}\. The previous reasoning can be traversed in reverse order to obtainℱ\(M1\)′⊆ℱ\(¬M1\)\\mathcal\{F\}\(M\_\{1\}\)^\{\\prime\}\\subseteq\\mathcal\{F\}\(\\neg M\_\{1\}\)\.
2. \(ii\)ℱ\(M1\)∩ℱ\(M2\)⊆ℱ\(M1∧M2\)\\mathcal\{F\}\(M\_\{1\}\)\\cap\\mathcal\{F\}\(M\_\{2\}\)\\subseteq\\mathcal\{F\}\(M\_\{1\}\\land M\_\{2\}\)follows directly from the definition of operator∧\\landoverℳ\\mathcal\{M\}\. We now prove thatℱ\(M1∧M2\)⊆ℱ\(M1\)∩ℱ\(M2\)\\mathcal\{F\}\(M\_\{1\}\\land M\_\{2\}\)\\subseteq\\mathcal\{F\}\(M\_\{1\}\)\\cap\\mathcal\{F\}\(M\_\{2\}\)\. Letg∈ℱ\(M1∧M2\)g\\in\\mathcal\{F\}\(M\_\{1\}\\land M\_\{2\}\), thenr¯M1∧M2\(g,g,a\)=r𝒰\\bar\{r\}\_\{M\_\{1\}\\land M\_\{2\}\}\(g,g,a\)=\{r\_\{\\mathcal\{U\}\}\}\. We now proceed by contradiction, suppose thatg∉ℱ\(M1\)∩ℱ\(M2\)g\\notin\\mathcal\{F\}\(M\_\{1\}\)\\cap\\mathcal\{F\}\(M\_\{2\}\), this isr¯M1\(g,g,a\)=rM1\(g,a\)=r∅\\bar\{r\}\_\{M\_\{1\}\}\(g,g,a\)=r\_\{M\_\{1\}\}\(g,a\)=\{r\_\{\\varnothing\}\}orr¯M2\(g,g,a\)=rM2\(g,a\)=r∅\\bar\{r\}\_\{M\_\{2\}\}\(g,g,a\)=r\_\{M\_\{2\}\}\(g,a\)=\{r\_\{\\varnothing\}\}\. Then,r¯M1∧M2\(g,g,a\)=rM1∧M2\(g,a\)=min\{rM1\(g,a\),rM2\(g,a\)\}=r∅\\bar\{r\}\_\{M\_\{1\}\\land M\_\{2\}\}\(g,g,a\)=r\_\{M\_\{1\}\\land M\_\{2\}\}\(g,a\)=\\min\\\{r\_\{M\_\{1\}\}\(g,a\),r\_\{M\_\{2\}\}\(g,a\)\\\}=\{r\_\{\\varnothing\}\}, which is a contradiction\. Consequentlyg∈ℱ\(M1\)∩ℱ\(M2\)g\\in\\mathcal\{F\}\(M\_\{1\}\)\\cap\\mathcal\{F\}\(M\_\{2\}\)\.
3. \(iii\)ℱ\(M1\)∪ℱ\(M2\)⊆ℱ\(M1∨M2\)\\mathcal\{F\}\(M\_\{1\}\)\\cup\\mathcal\{F\}\(M\_\{2\}\)\\subseteq\\mathcal\{F\}\(M\_\{1\}\\lor M\_\{2\}\)follows directly from the definition of operator∨\\loroverℳ\\mathcal\{M\}\. We now provex thatℱ\(M1∨M2\)⊆ℱ\(M1\)∪ℱ\(M2\)\\mathcal\{F\}\(M\_\{1\}\\lor M\_\{2\}\)\\subseteq\\mathcal\{F\}\(M\_\{1\}\)\\cup\\mathcal\{F\}\(M\_\{2\}\)in an analogous manner to \(ii\)\. Letg∈ℱ\(M1∨M2\)g\\in\\mathcal\{F\}\(M\_\{1\}\\lor M\_\{2\}\), thenr¯M1∨M2\(g,g,a\)=r𝒰\\bar\{r\}\_\{M\_\{1\}\\lor M\_\{2\}\}\(g,g,a\)=\{r\_\{\\mathcal\{U\}\}\}\. Suppose thatg∉ℱ\(M1\)∪ℱ\(M2\)g\\notin\\mathcal\{F\}\(M\_\{1\}\)\\cup\\mathcal\{F\}\(M\_\{2\}\), this isr¯M1\(g,g,a\)=r¯M2\(g,g,a\)=r∅\\bar\{r\}\_\{M\_\{1\}\}\(g,g,a\)=\\bar\{r\}\_\{M\_\{2\}\}\(g,g,a\)=\{r\_\{\\varnothing\}\}\. Then,r¯M1∨M2\(g,g,a\)=rM1∨M2\(g,a\)=max\{rM1\(g,a\),rM2\(g,a\)\}=r∅\\bar\{r\}\_\{M\_\{1\}\\lor M\_\{2\}\}\(g,g,a\)=r\_\{M\_\{1\}\\lor M\_\{2\}\}\(g,a\)=\\max\\\{r\_\{M\_\{1\}\}\(g,a\),r\_\{M\_\{2\}\}\(g,a\)\\\}=\{r\_\{\\varnothing\}\}leads to another contradiction, finally giving us thatg∈ℱ\(M1\)∪ℱ\(M2\)g\\in\\mathcal\{F\}\(M\_\{1\}\)\\cup\\mathcal\{F\}\(M\_\{2\}\)\.
∎
Theorem[1](https://arxiv.org/html/2606.04053#Thmtheorem1)builds on Lemmas[1](https://arxiv.org/html/2606.04053#Thmlemma1)and[2](https://arxiv.org/html/2606.04053#Thmlemma2)and formalizes the new proposed method for composition\. See[1](https://arxiv.org/html/2606.04053#Thmtheorem1)
###### Proof\.
\(Theorem[1](https://arxiv.org/html/2606.04053#Thmtheorem1)\)Using Lemma[1](https://arxiv.org/html/2606.04053#Thmlemma1)we write the extended Q\-value functions as a selection of state\-action slices from the universal and empty tasks\. We then use the isomorphism defined in Lemma[2](https://arxiv.org/html/2606.04053#Thmlemma2)to write the set of desired goals for the composite tasks, this is𝒢¬M1\+,𝒢M1∨M2\+\\mathcal\{G\}^\{\+\}\_\{\\neg M\_\{1\}\},\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\\lor M\_\{2\}\}and𝒢M1∧M2\+\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\\land M\_\{2\}\}, as composition of the sets of desired goals for tasksM1M\_\{1\}andM2M\_\{2\}, i\.e\.,𝒢¬M1\+=\(𝒢M1\+\)′=𝒢∖𝒢M1\+\\mathcal\{G\}^\{\+\}\_\{\\neg M\_\{1\}\}=\(\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}\)^\{\\prime\}=\\mathcal\{G\}\\setminus\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\},𝒢M1∨M2\+=𝒢M1\+∪𝒢M2\+\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\\lor M\_\{2\}\}=\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}\\cup\\mathcal\{G\}^\{\+\}\_\{M\_\{2\}\}and𝒢M1∧M2\+=𝒢M1\+∩𝒢M2\+\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\\land M\_\{2\}\}=\\mathcal\{G\}^\{\+\}\_\{M\_\{1\}\}\\cap\\mathcal\{G\}^\{\+\}\_\{M\_\{2\}\}\. ∎
### A\.3Proof that the optimal slices are ordered\.
See[1](https://arxiv.org/html/2606.04053#Thmcorollary1)
###### Proof\.
\(Corollary[1](https://arxiv.org/html/2606.04053#Thmcorollary1)\)Letg∈𝒢g\\in\\mathcal\{G\}\. Ifggis unreachable fromss, from the proof of Lemma 2 fromNangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\]we know thatℳ∅,g=ℳ𝒰,g\\mathcal\{M\}\_\{\\varnothing,g\}=\\mathcal\{M\}\_\{\\mathcal\{U\},g\}\. By uniqueness of the solution to the Bellman optimality equation, their optimal Q\-functions must be identical, thus obtaining equality in the statement\. Ifggis reachable fromss, by Corollary 1 fromNangue Tasseet al\.\[[2020](https://arxiv.org/html/2606.04053#bib.bib1)\], the extended Q\-value function decomposes for alls,as,aas follows:
Q¯∅∗\(s,g,a\)\\displaystyle\\bar\{Q\}^\{\*\}\_\{\\varnothing\}\(s,g,a\)=Gs:g,a∗\+r¯ℳ∅\(g,g,a′\)and\\displaystyle=G^\{\*\}\_\{s:g,a\}\+\\overline\{r\}\_\{\{\\mathcal\{M\}\_\{\\varnothing\}\}\}\(g,g,a^\{\\prime\}\)\\;\\;\\text\{and\}Q¯𝒰∗\(s,g,a\)\\displaystyle\\bar\{Q\}^\{\*\}\_\{\\mathcal\{U\}\}\(s,g,a\)=Gs:g,a∗\+r¯ℳ𝒰\(g,g,a′\)\.\\displaystyle=G^\{\*\}\_\{s:g,a\}\+\\overline\{r\}\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(g,g,a^\{\\prime\}\)\.The path\-dependent return,Gs:g,a∗G^\{\*\}\_\{s:g,a\}, is identical for both tasks as the optimal path toggis task\-independent\. We now compare the terminal rewards\. By definition of the tasks,r¯ℳ∅\(g,g,a′\)=r∅\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\varnothing\}\}\}\(g,g,a^\{\\prime\}\)=\{r\_\{\\varnothing\}\}andr¯ℳ𝒰\(g,g,a′\)=r𝒰\\bar\{r\}\_\{\{\\mathcal\{M\}\_\{\\mathcal\{U\}\}\}\}\(g,g,a^\{\\prime\}\)=\{r\_\{\\mathcal\{U\}\}\}\. By assumptionr∅≤r𝒰\{r\_\{\\varnothing\}\}\\leq\{r\_\{\\mathcal\{U\}\}\}\. SinceGs:g,a∗G^\{\*\}\_\{s:g,a\}is a shared constant, it follows directly thatQ¯∅∗\(s,g,a\)≤Q¯𝒰∗\(s,g,a\)\\bar\{Q\}^\{\*\}\_\{\\varnothing\}\(s,g,a\)\\leq\\bar\{Q\}^\{\*\}\_\{\\mathcal\{U\}\}\(s,g,a\)\. ∎
### A\.4Extending the Boolean Task Algebra to stochastic MDPs\.
Here we present the counterexample of extending the BTA framework to stochastic MDPs\. See[2](https://arxiv.org/html/2606.04053#Thmproposition2)
###### Proof\.
We will consider the following example MDP: We havenngoals and 1 initial non\-goal states0s\_\{0\}\. We consider all2𝒢2^\{\\mathcal\{G\}\}possible goal sets\. We also consider\|2𝒢\|\|2^\{\\mathcal\{G\}\}\|actions\. Each action is associated with a set of states𝒢~∈2𝒢\\tilde\{\\mathcal\{G\}\}\\in 2^\{\\mathcal\{G\}\}, so actions can be indexed accordingly asa𝒢~a\_\{\\tilde\{\\mathcal\{G\}\}\}\. The MDP is not discounted\. Transition probabilities and rewardsrs,ar\_\{s,a\}for non\-terminal states are defined as:
rs,a𝒢~=0\.5/\|𝒢~\|2,T\(s,a𝒢~,s′\)=\{1/\|𝒢~\|ifs′∈𝒢~0otherwise\.r\_\{s,a\_\{\\tilde\{\\mathcal\{G\}\}\}\}=0\.5/\|\\tilde\{\\mathcal\{G\}\}\|^\{2\},\\quad T\(s,a\_\{\\tilde\{\\mathcal\{G\}\}\},s^\{\\prime\}\)=\\begin\{cases\}1/\|\\tilde\{\\mathcal\{G\}\}\|&\\textnormal\{if \}s^\{\\prime\}\\in\\tilde\{\\mathcal\{G\}\}\\\\ 0&\\textnormal\{otherwise\.\}\\end\{cases\}
Let’s consider a goal set𝒢\+\\mathcal\{G\}^\{\+\}and actionsa𝒢\+a\_\{\\mathcal\{G\}^\{\+\}\}anda𝒢~,a\_\{\\tilde\{\\mathcal\{G\}\}\},with𝒢\+⊂𝒢~\\mathcal\{G\}^\{\+\}\\subset\\tilde\{\\mathcal\{G\}\}\. The return fora𝒢\+a\_\{\\mathcal\{G\}^\{\+\}\}will be−0\.5/\|𝒢\+\|2\+1\-0\.5/\|\\mathcal\{G\}^\{\+\}\|^\{2\}\+1, whereas the return fora𝒢~a\_\{\\tilde\{\\mathcal\{G\}\}\}will be−0\.5/\(\|𝒢\+\|\+δ\)2\+\|𝒢\+\|/\(\|𝒢\+\|\+δ\),\-0\.5/\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\)^\{2\}\+\|\\mathcal\{G\}^\{\+\}\|/\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\),withδ=\|𝒢~∖𝒢\+\|\\delta=\|\\tilde\{\\mathcal\{G\}\}\\setminus\\mathcal\{G\}^\{\+\}\|\. We now look at the difference between these two returns multiplied by\|𝒢\+\|2\(\|𝒢\+\|\+δ\)2\|\\mathcal\{G\}^\{\+\}\|^\{2\}\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\)^\{2\}:
\(\|𝒢\+\|2\(\|𝒢\+\|\+δ\)2\)\(−0\.5/\|𝒢\+\|2\+1\+0\.5/\(\|𝒢\+\|\+δ\)2−\|𝒢\+\|/\(\|𝒢\+\|\+δ\)\)\\displaystyle\(\|\\mathcal\{G\}^\{\+\}\|^\{2\}\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\)^\{2\}\)\\left\(\-0\.5/\|\\mathcal\{G\}^\{\+\}\|^\{2\}\+1\+0\.5/\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\)^\{2\}\-\|\\mathcal\{G\}^\{\+\}\|/\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\)\\right\)=−0\.5\(\|𝒢\+\|\+δ\)2\+\|𝒢\+\|2\(\|𝒢\+\|\+δ\)2\+0\.5\|𝒢\+\|2−\|𝒢\+\|3\(\|𝒢\+\|\+δ\)\\displaystyle\{\}=\-0\.5\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\)^\{2\}\+\|\\mathcal\{G\}^\{\+\}\|^\{2\}\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\)^\{2\}\+0\.5\|\\mathcal\{G\}^\{\+\}\|^\{2\}\-\|\\mathcal\{G\}^\{\+\}\|^\{3\}\(\|\\mathcal\{G\}^\{\+\}\|\+\\delta\)=\|𝒢\+\|3δ\+\|𝒢\+\|2δ2−\|𝒢\+\|δ−0\.5δ2=\(\|𝒢\+\|3−\|𝒢\+\|\)δ\+\(\|𝒢\+\|2−0\.5\)δ2\.\\displaystyle\{\}=\|\\mathcal\{G\}^\{\+\}\|^\{3\}\\delta\+\|\\mathcal\{G\}^\{\+\}\|^\{2\}\\delta^\{2\}\-\|\\mathcal\{G\}^\{\+\}\|\\delta\-0\.5\\delta^\{2\}=\(\|\\mathcal\{G\}^\{\+\}\|^\{3\}\-\|\\mathcal\{G\}^\{\+\}\|\)\\delta\+\(\|\\mathcal\{G\}^\{\+\}\|^\{2\}\-0\.5\)\\delta^\{2\}\.Since\|𝒢\+\|3≥\|𝒢\+\|\|\\mathcal\{G\}^\{\+\}\|^\{3\}\\geq\|\\mathcal\{G\}^\{\+\}\|and\|𝒢\+\|2\>0\.5\|\\mathcal\{G\}^\{\+\}\|^\{2\}\>0\.5when\|𝒢\+\|≥1\|\\mathcal\{G\}^\{\+\}\|\\geq 1andδ\>0\\delta\>0, this scaled difference is larger than zero, so choosing an action associated with𝒢~⊃𝒢\+\\tilde\{\\mathcal\{G\}\}\\supset\\mathcal\{G\}^\{\+\}is sub\-optimal\. We also know that
- ■\\blacksquareActionsa𝒢~a\_\{\\tilde\{\\mathcal\{G\}\}\}with𝒢~⊂𝒢\+\\tilde\{\\mathcal\{G\}\}\\subset\\mathcal\{G\}^\{\+\}are suboptimal, as they will achieve the same goal\-reaching reward but a more negative immediate reward ats0s\_\{0\}\.
- ■\\blacksquareActionsa𝒢~a\_\{\\tilde\{\\mathcal\{G\}\}\}with𝒢~⊄𝒢\+\\tilde\{\\mathcal\{G\}\}\\not\\subset\\mathcal\{G\}^\{\+\}and𝒢~⊅𝒢\+\\tilde\{\\mathcal\{G\}\}\\not\\supset\\mathcal\{G\}^\{\+\}, always achieve a worse return than actions associated with a set with the same cardinality, but that include more states from𝒢\+\\mathcal\{G\}^\{\+\}since the expected goal\-reaching reward will be higher\. So they certainly achieve a worse return thana𝒢\+a\_\{\\mathcal\{G\}^\{\+\}\}\.
Thus, for any goal set𝒢\+\\mathcal\{G\}^\{\+\}, only the policy that always picks actiona𝒢\+a\_\{\\mathcal\{G\}\}^\{\+\}is optimal\. ∎
## Appendix BAdditional experimental results
### B\.1Detailed Experimental Design
Unless stated otherwise, we use the default hyperparameters from the original environment or algorithm implementations used in the corresponding papers\[Nangue Tasseet al\.,[2020](https://arxiv.org/html/2606.04053#bib.bib1),[2024](https://arxiv.org/html/2606.04053#bib.bib33)\]\. This appendix only records the choices that are specific to our convergence experiments\.
##### Rooms environment\.
Training is based on tabular goal\-oriented Q\-learning\. The trained value functions are the universal task, the empty task, and the⌈log24⌉=2\\lceil\\log\_\{2\}4\\rceil=2binary basis tasks\. Training uses sparse rewards with shared terminal states\. Goal\-oriented Q\-learning uses the implementation defaults: discount11, learning rate11,ϵ\\epsilon\-greedy exploration withϵ=1\\epsilon=1, and a maximum of100100steps per episode\. The training budgets range between1010and50005000episodes per trained task, as we empirically see that this range is enough for convergence\. Evaluation samples tasks proportionally by task cardinality: with44goals this gives1515sampled tasks, of which the empty and universal tasks are excluded from the composed\-task evaluation, leaving1313tasks\. For each method we evaluate on10001000rollouts, with step horizon100100\.
##### Boxman environment\.
We train four DQN value functions per independent run:on,off,blue\(BB\), andsquare\(SS\)\. These are evaluated on five composed tasks:BB,SS,B\+SB\+S,B\.SB\.S, andBxorSB\\operatorname\{xor\}S\. We run33independent full runs\. Each DQN receives the RGB observation concatenated with the RGB goal image, giving66input channels\. The network is the original Boxman DQN architecture: convolutional layers3232filters of size88and stride44,6464filters of size44and stride22,6464filters of size33and stride11, followed by a fully connected layer of width512512and a linear action head\. We use Adam with learning rate10−410^\{\-4\}, discount0\.990\.99, batch size128128, replay buffer size3⋅1053\\cdot 10^\{5\}, learning starts after10410^\{4\}transitions, training frequency44, target network update frequency10001000, and Huber loss with gradient clipping to\[−1,1\]\[\-1,1\]\. Evaluation uses100100episodes per task, method, checkpoint, and completed run, with maximum trajectory length2020\.
##### Office\-World environment\.
Office\-World uses tabular value functions\. The evaluation tasks are Coffee, Patrol, Coffee and Mail, and Long and the primitive library isℬ=\{0,1\}∪\{pφ:φ∈\{A,B,C,D,c,d,m,o,tm,to\}\}∪\{cd\},\\mathcal\{B\}=\\\{0,1\\\}\\cup\\\{p\_\{\\varphi\}:\\varphi\\in\\\{A,B,C,D,c,d,m,o,tm,to\\\}\\\}\\cup\\\{c\_\{d\}\\\},so each run trains1313tabular world value functions\. The propositionsA,B,C,DA,B,C,Ddenote rooms,cccoffee,mmmail,oooffice,tmtmandtotothe dynamic mail/person availability propositions, andddthe decoration constraint\. We use Q\-learning with learning rate0\.10\.1, discount0\.90\.9,ϵ\\epsilon\-greedy exploration withϵ=0\.5\\epsilon=0\.5, and zero initialization\. We run33independent seeds\. Evaluation uses5050episodes per task, method, and checkpoint, with horizon10001000and evaluation discount0\.90\.9\. We compare Original, Univ\./empty, and Base tasks; the first two use only the learned0and11primitives, while Base tasks can use all primitives inℬ\\mathcal\{B\}\. The LTL specifications in the environment are:
1. Coffee:Deliver coffee to the office without breaking any decorations\.
2. Patrol:Visit rooms A, B, C, and D in sequence without breaking any decorations\.
3. Coffee & Mail:Deliver both coffee and mail to the office, in either order, without breaking any decorations\.
4. Long:Deliver mail until none remains, then deliver coffee while people are present in the office, and finally patrol rooms A, B, C, D, and A, without breaking any decorations\.
##### Safety\-Gym environment\.
Safety\-Gym is the continuous\-control function\-approximation experiment\. The learned primitive library is\{0,1,pbuttons,pgoal,phazards,chazards\},\\\{0,1,p\_\{\\textsc\{buttons\}\},p\_\{\\textsc\{goal\}\},p\_\{\\textsc\{hazards\}\},c\_\{\\textsc\{hazards\}\}\\\},so each run trains66world value functions\. Each primitive is trained with TD3\. The actor and critic use hidden layers\[2024,2024,2024\]\[2024,2024,2024\]and Gaussian action noise has standard deviation0\.20\.2\. The main training parameters are learning rate10−610^\{\-6\}, discount0\.990\.99, batch size3232and a replay buffer size10610^\{6\}for2×1062\\times 10^\{6\}steps\. We run22independent seeds and evaluation uses500500episodes per task, method, and checkpoint, with horizon10001000and evaluation discount0\.990\.99\. We now describe the LTL tasks from this experiment as defined inNangue Tasseet al\.\[[2024](https://arxiv.org/html/2606.04053#bib.bib33)\]:
1. Task 1: Navigate to a button and then to a cylinder\.
2. Task 2: Navigate to a button and then to a cylinder while never entering blue regions\.
3. Task 3: Navigate to a button, then to a cylinder without entering blue regions, then to a button inside a blue region, and finally to a cylinder again\.
4. Task 4: Navigate to a button and then to a cylinder in a blue region\.
5. Task 5: Navigate to a cylinder, then to a button in a blue region, and finally to a cylinder again\.
6. Task 6: Navigate to a blue region, then to a button with a cylinder, and finally to a cylinder while avoiding blue regions\.
### B\.2Additional experimental results
#### B\.2\.1Additional results for Rooms environment
##### Illustrating learned policies\.
In Figure[5](https://arxiv.org/html/2606.04053#A2.F5)we illustrate the policies derived from the learned value functions for the rooms environment\. Figure[5](https://arxiv.org/html/2606.04053#A2.F5)shows the optimal policy derived from maximizing over actions in the slice corresponding to a given goalgg, this isQ¯𝒰∗\(⋅,g,⋅\)\\bar\{Q\}\_\{\\mathcal\{U\}\}^\{\*\}\(\\cdot,g,\\cdot\)\. Similarly, Figure[5](https://arxiv.org/html/2606.04053#A2.F5)illustrates the optimal policy derived from not followinggg, this isQ¯∅∗\(⋅,g,⋅\)\\bar\{Q\}\_\{\\varnothing\}^\{\*\}\(\\cdot,g,\\cdot\)\. These goal slices are then combined via maximization over the goal and actions dimensions to obtain the policies illustrated in Figures[5](https://arxiv.org/html/2606.04053#A2.F5)and[5](https://arxiv.org/html/2606.04053#A2.F5)\. We note that both the slices and the global policies are similar in shape, the key difference being the range of rewards associated to each of them\.
\(c\)Optimal policy for pursuinggg\.
\(d\)Optimal policy for not pursuinggg\.
\(e\)Optimal policy for pursuing any goal in𝒢\\mathcal\{G\}\.
\(f\)Optimal policy for not pursuing any goal in𝒢\\mathcal\{G\}\.
Figure 5:Optimal policies derived from the learned value functions for the 16 goals environment\.
##### Performance upon convergence\.
Figure[6](https://arxiv.org/html/2606.04053#A2.F6)provides a side\-by\-side comparison of the returns obtained with the original and proposed composition methods\. We observe that the returns obtained for both methods are identical, given that we have provided enough maximum iterations for convergence and confirm that theoretical optimality also holds in a practical setting\. We also observe that expected returns are higher as the number of goals in the task increases, which can be explained by the fact that the agent requires less steps on average to reach any of the goals, hence incurring in lower time\-step penalties\.
Figure 6:Boxplots showing the quartiles of the returns of the original composition method \(using base tasks\) and composition using the universal and empty tasks\. We aggregate the results per goal set length\. The returns for each task is measured on10410^\{4\}episodes and the value functions are obtained with50005000maximum iterations of the training algorithm\.
#### B\.2\.2Additional results for Boxman environment
Figure[7](https://arxiv.org/html/2606.04053#A2.F7)shows per task returns and successes for the Boxman environment\. We confirm a similar pattern among all tasks, where the base tasks method converges faster to the local optimum but both methods eventually stabilize at a similar performance level\.
Figure 7:Average returns \(±\\pmstd\) for the boxman environment\. The x\-axis shows the number of training iterations per UVFA, hence base tasks requires twice as much compute power given that it needs to train more value functions\.
#### B\.2\.3Additional results for Office Gridworld environment
Figure[8](https://arxiv.org/html/2606.04053#A2.F8)shows per task returns and successes for the Office Gridworld domain\. It is more informative to look at the success graphs in Figure[8\(b\)](https://arxiv.org/html/2606.04053#A2.F8.sf2), where we see that the percentage of successes in the LTL specification converged to 1, meaning that the agent correctly learn to execute the task\. It is important to note that we re\-use the universal and empty value functions learned during the base task learning, hence there is a correlation in the performance of all three methods given by the particularities of the three runs \(e\.g\. sudden collapse Figure[8\(a\)](https://arxiv.org/html/2606.04053#A2.F8.sf1)for the Long and Patrol tasks, or the jittery pattern of Coffee and Mail\)\. However, if we attend to the average presented in Figure[4\(a\)](https://arxiv.org/html/2606.04053#S5.F4.sf1)we see a more stable convergence across iterations\.
\(a\)Total rewards disaggregated per task\.
\(b\)Total successes of the LTL specification disaggregated per task\.
Figure 8:Evolution of average \(±\\pmstd\) successes and returns across runs in the temporal composition of actions for the main 4 tasks of the Office Gridworld\. Results are reported over33runs\. See Appendix[B\.1](https://arxiv.org/html/2606.04053#A2.SS1)for explanation on the LTL task specifications\.
#### B\.2\.4Additional results for Safety gym environment
We ran the Safety Gym experiment in order to evaluate the temporal\-composition setting with continuous control and function approximation\. Figure[9](https://arxiv.org/html/2606.04053#A2.F9)shows the evolution of the average return and success rate as a function of the number of training iterations per extended value function\. The three methods exhibit similar learning trends, with performance improving as training progresses\. This confirms that the computational advantages derived from learning extra base tasks does not translate into a better overall performance as compared to using only the universal and empty tasks\. Moreover, this also means that the improved composition cost does not translate to a worse performance \(Original vs Univ\./Empty\)\.
Figure[10](https://arxiv.org/html/2606.04053#A2.F10)complements these results by reporting the composition\-time distribution for the same Safety Gym setting\. The Univ\./empty construction substantially reduces the cost of composing extended value functions, while preserving comparable return and success performance\. Thus, the Safety Gym experiment supports the main empirical conclusion: using only the universal and empty tasks improves computational efficiency without degrading the quality of the composed policy\.
\(a\)Average episode return\.
\(b\)Average success rate\.
Figure 9:Evolution of average return and success rate in the Safety Gym environment\. Results compare methods for a fixed number of training iterations per UVFA\.Figure 10:Distribution of composition times in the Safety Gym environment\. The Univ\./empty composition reduces the computational cost of composing extended value functions compared with the original Skill Machines composition, while maintaining comparable performance as shown in Figure[9](https://arxiv.org/html/2606.04053#A2.F9)\. See Appendix[B\.1](https://arxiv.org/html/2606.04053#A2.SS1)for explanation on the LTL task specifications\.Similar Articles
As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs
This paper investigates how instruction-tuned LLMs combine persona and task specifications in the residual stream, finding that near answer formation the combination is approximately additive, enabling substitution with minimal KL divergence, but this additive regime does not account for the full multi-token generation mechanism.
TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition
Proposes TD-Grokking, a training-time decomposition framework that recursively breaks down intractable zero-reward problems into verifiable subproblems, enabling LLMs to learn from failed trajectories. Outperforms vanilla GRPO and baselines on mathematical and medical reasoning tasks.
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
A controlled study of compound LLM agent design in an adversarial POMDP (CybORG CAGE-2), systematically varying context, reasoning, and hierarchy across five model families. Key findings: programmatic state abstraction yields large returns per token, hierarchy without deliberation tools achieves best absolute performance, and context engineering is more cost-effective than deeper reasoning.
BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
Introduces BOHM, a zero-cost hierarchical attribution method for compound AI systems that extracts attribution from routing weights, outperforming Shapley-based methods in many real-world deployments.
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization
This paper proposes PTD-PO, a privileged tutoring distillation framework that provides dense token-level supervision for reinforcement learning with verifiable rewards in multimodal reasoning tasks, without exposing the answer. It uses structured hints and a Top-K JS divergence objective to stabilize training, consistently outperforming existing methods on 2B-8B LVLMs.