Performance-Driven Environment Abstraction with Multi-Timescale Learning

arXiv cs.LG Papers

Summary

This paper proposes a performance-driven state abstraction method for reinforcement learning that directly optimizes decision quality, using a multi-timescale framework to jointly adapt the policy and a tree-structured abstraction. The algorithm refines or aggregates state space based on Q-value discrepancies, achieving better sample efficiency and faster replanning than baselines.

arXiv:2606.17377v1 Announce Type: new Abstract: We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We model abstraction as a controlled approximation obtained by aggregating the state space and enforcing a shared action distribution within each aggregated state. For a fixed partition, we establish a performance guarantee that separates value-function approximation error from the loss introduced by action sharing. Guided by this analysis, we develop a multi-timescale reinforcement learning framework that jointly adapts the policy and a tree-structured environment abstraction. The resulting algorithm refines and coarsens regions of the state space based on Q-value discrepancies, balancing performance against abstraction size and complexity. Empirical results demonstrate substantial state compression, improved sample efficiency, and faster replanning compared to actor-critic baselines.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:37 AM

# Performance-Driven Environment Abstraction with Multi-Timescale Learning
Source: [https://arxiv.org/html/2606.17377](https://arxiv.org/html/2606.17377)
[Dipankar Maity](https://arxiv.org/html/2606.17377v1/mailto:%[email protected]%3E?Subject=Your%20UAI%202026%20paper)Department of Electrical and Computer Engineering University of North Carolina at Charlotte Charlotte, NC, USA[Panagiotis Tsiotras](https://arxiv.org/html/2606.17377v1/mailto:%[email protected]%3E?Subject=Your%20UAI%202026%20paper)School of Aerospace Engineering Georgia Institute of Technology Atlanta, GA, USA

###### Abstract

We study performance\-driven environment abstraction for decision\-making in large Markov decision processes\. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality\. We model abstraction as a controlled approximation obtained by aggregating the state space and enforcing a shared action distribution within each aggregated state\. For a fixed partition, we establish a performance guarantee that separates value\-function approximation error from the loss introduced by action sharing\. Guided by this analysis, we develop a multi\-timescale reinforcement learning framework that jointly adapts the policy and a tree\-structured environment abstraction\. The resulting algorithm refines and coarsens regions of the state space based on Q\-value discrepancies, balancing performance against abstraction size and complexity\. Empirical results demonstrate substantial state compression, improved sample efficiency, and faster replanning compared to actor–critic baselines\.

## 1Introduction

Humans and other intelligent species mitigate complexity by forming reduced representations of their environment, discarding irrelevant details while retaining task\-relevant structure\[eppe2022intelligent\]\. A key mechanism is*hierarchical representation*\[hughes2024foundations\], where decisions are made at a coarse resolution and refined only when necessary\. By focusing reasoning at an appropriate level of abstraction, intelligent agents manage complex tasks while remaining adaptive to new tasks and evolving environments\. Consequently, the ability to automatically construct hierarchical representations online is widely regarded as a fundamental component of intelligence\[simon1962architecture,botvinick2008hierarchical\]\.

This hierarchical reasoning capability is increasingly required in modern large\-scale autonomy, where planning directly in the original state space becomes computationally infeasible and may fail to meet safety\-critical reaction times\. Existing approaches often compress the environment using information\-theoretic\[larsson2020q,ravichandran2022hierarchical\]or structural heuristics\[machado2017eigenoption,dean1995decomposition,guan2022hierarchical\]\. However, the appropriate abstraction is inherently task\- and performance\-dependent, as different tasks and performance requirements induce different compressed representations\.

This paper studies*performance\-driven state abstraction*\. Rather than preserving geometry or topology, we seek abstractions that directly optimize decision quality\. We treat environment abstraction as a controlled state space aggregation, and we adaptively refine and coarsen the state space only when doing so improves agent’s performance\. We derive performance guarantees identifying two sources of suboptimality: \(i\) value\-function approximation error and \(ii\) loss induced by enforcing a shared action distribution within each aggregated state\. Based on this analysis, we design a multi\-timescale reinforcement learning scheme in which the policy converges under a slowly varying environment abstraction which evolves through a Q\-value–guided refinement and aggregation mechanism\. The resulting algorithm continuously restructures the state space aggregation to maintain a compact representation while improving decision performance\. Empirically, the learned abstraction improves performance and transfers across related tasks\.

#### Contributions\.

The contribution of this work is threefold: \(i\) a performance bound for decision\-making under an environment abstraction, which separates value approximation error from action\-sharing error within aggregated states, \(ii\) a Q\-value–based criterion providing principled refinement and aggregation rules for tree\-based abstractions, and \(iii\) a multi\-timescale learning algorithm that jointly adapts the policy and the abstraction\.

### 1\.1Related Work

A central line of work in hierarchical decision\-making focuses on*temporal abstraction*\. The option framework\[sutton1999between\]and feudal hierarchies\[dayan1992feudal\]introduce multi\-level policies in which high\-level controllers select subgoals executed by lower\-level policies until termination\. Subsequent extensions\[mcgovern2001automatic,csimcsek2005identifying,mahadevan2007proto\]improve exploration and sample efficiency but operate over the original state space rather than compressing it\.

In contrast,*state space abstraction*methods aim to merge states into superstates and reason directly at the abstract level\. Proto\-value functions\[mahadevan2007proto,machado2017eigenoption\], information\-bottleneck\-based tree searches\[larsson2020q,larsson2023information\]and heuristic\-based methods\[ravichandran2022hierarchical,hughes2024foundations\]construct hierarchical representations that preserve topological or informational structure\. However, these abstractions are typically domain\-specific and not explicitly optimized for downstream task performance\.

Closer to our work are*performance\-driven abstractions*, where partitions are adapted to improve value rather than preserve geometry\. Baras and Borkar\[baras2000learning\]proposed a quantizer\-based hierarchical actor–critic refined via multi\-timescale stochastic approximation, later extended to continuous domains with value\-based distances and entropy regularization\[mavridis2021maximum\]\. While effective, these approaches do not explicitly analyze the performance loss induced by enforcing a shared action distribution within each aggregated state\. We provide theoretical guarantees that separate value\-function approximation error from action\-sharing error, yielding a Q\-value–guided aggregation criterion\.

Tree\-structured abstraction methods such as U\-Tree\[mccallum1996reinforcement,jonsson2000automated\]and conditional abstraction trees\[dadvar2023conditional\]adaptively split states using TD\-error statistics\. These approaches rely on iterative off\-policy procedures, and support only refinement but not aggregation\. In contrast, we jointly adapt the abstraction and policy via a unified multi\-timescale scheme, justify Q\-values as the structural criterion, and introduce a one\-step lookahead/backward mechanism enabling principled refinement and collapse operations\.

Overall, prior work either relies on fixed abstractions, adapts partitions heuristically without performance guarantees, or focuses on temporal rather than spatial abstraction\. Our framework addresses this gap by explicitly characterizing the performance loss and using it to guide principled refinement and aggregation within a unified learning scheme\.

## 2Problem Formulation

We consider a discounted Markov decision process \(MDP\)⟨𝒮,𝒜,R,P,β⟩\\langle\\mathcal\{S\},\\mathcal\{A\},R,P,\\beta\\rangle, where𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}are finite state and action spaces,P​\(s′\|s,a\)P\(s^\{\\prime\}\|s,a\)is the transition kernel,R​\(s\)R\(s\)is the reward function, andβ∈\[0,1\)\\beta\\in\[0,1\)is the discount factor\. A stationary policyπ​\(a\|s\)\\pi\(a\|s\)induces the value function

Vπ​\(s\)=𝔼π​\[∑t=0∞βt​R​\(St\)\|S0=s\]\.V^\{\\pi\}\(s\)=\\mathbb\{E\}\_\{\\pi\}\\\!\\left\[\\sum\_\{t=0\}^\{\\infty\}\\beta^\{t\}R\(S\_\{t\}\)\\penalty 10000\\ \|\\penalty 10000\\ S\_\{0\}=s\\right\]\.\(1\)We aim to find a policy maximizing the performanceJ​\(π;μ0\)=∑sμ0​\(s\)​Vπ​\(s\)J\(\\pi;\\mu\_\{0\}\)\\\!=\\\!\\sum\_\{s\}\\mu\_\{0\}\(s\)V^\{\\pi\}\(s\), under the initial state distributionμ0\\mu\_\{0\}\.

One way to construct such an optimal policy is through the optimal value functionV∗V^\{\*\}, which is the unique fixed point of the Bellman operator

\(ℬ𝒮​V∗\)​\(s\)=maxa∈𝒜⁡\[R​\(s\)\+β​∑s′∈𝒮P​\(s′\|s,a\)​V∗​\(s′\)\]\.\\big\(\\mathcal\{B\}\_\{\\mathcal\{S\}\}V^\{\*\}\\big\)\(s\)\\\!=\\\!\\max\_\{a\\in\\mathcal\{A\}\}\\Big\[R\(s\)\+\\beta\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\|s,a\)V^\{\*\}\(s^\{\\prime\}\)\\Big\]\.\(2\)However, evaluating \([2](https://arxiv.org/html/2606.17377#S2.E2)\) becomes computationally prohibitive when\|𝒮\|\|\\mathcal\{S\}\|is large, motivating reduced representations of the state space\.

#### State aggregation\.

We approximate the decision problem using a*state aggregation*Γ=\{γ1,…,γm\}\\Gamma=\\\{\\gamma\_\{1\},\\dots,\\gamma\_\{m\}\\\}, where∪k=1mγk=𝒮\\cup\_\{k=1\}^\{m\}\\gamma\_\{k\}=\\mathcal\{S\}andγi∩γj=∅\\gamma\_\{i\}\\cap\\gamma\_\{j\}=\\emptysetfori≠ji\\neq j\. Eachγk⊆𝒮\\gamma\_\{k\}\\subseteq\\mathcal\{S\}is called a*superstate*, and the aggregation mapϕΓ:𝒮→Γ\\phi\_\{\\Gamma\}:\\mathcal\{S\}\\to\\Gammaassigns each state to its superstate\. Conceptually,Γ\\Gammadefines an abstraction of the environment and serves as a reduced state representation for decision making\.

LetπΓ:Γ×𝒜→\[0,1\]\\pi\_\{\\Gamma\}:\\Gamma\\times\\mathcal\{A\}\\to\[0,1\]denote an aggregated policy\. Under this formulation, all states within the same superstate share the same action distribution, which we refer to as the*same\-action\-distribution \(SAD\)*constraint\. An aggregated policy always induces a policy on𝒮\\mathcal\{S\}via

π​\(a\|s\)=πΓ​\(a\|ϕΓ​\(s\)\)\.\\pi\(a\|s\)=\\pi\_\{\\Gamma\}\(a\|\\phi\_\{\\Gamma\}\(s\)\)\.
For aggregated policies, decision\-making is performed overΓ\\Gammainstead of𝒮\\mathcal\{S\}\. This reduces the policy search space and allows executing policies with “minimum attention"\[brockett1997minimum\]where the state representation is “coarse"\. Figure[1](https://arxiv.org/html/2606.17377#S2.F1)illustrates the effect of the SAD constraint\.

![Refer to caption](https://arxiv.org/html/2606.17377v1/x1.png)Figure 1:Effect of the SAD constraint in a grid\-world\.Left: Coarse aggregation around the bottom\-right forces action randomization, causing the agent to potentially stall at the red star\.Right: Refining aggregation removes the need for randomization, enabling faster navigation\.
#### Abstraction–policy optimization\.

The above formulation leads to the following optimization problem, where we jointly select a state aggregation and aggregated policy, balancing performance and representation complexity:

maxπΓ,Γ⁡J​\(πΓ;μ0\)−λ​\|Γ\|,\\max\_\{\\pi\_\{\\Gamma\},\\,\\Gamma\}\\;J\(\\pi\_\{\\Gamma\};\\mu\_\{0\}\)\-\\lambda\|\\Gamma\|,\(3\)whereλ\>0\\lambda\>0penalizes the number of superstates\.

The above problem is inherently coupled\. The SAD constraint introduces approximation error whose magnitude depends on the chosen aggregation\. Conversely, evaluating the performance of an aggregated policy requires sufficient resolution of the value function in regions where actions differ\. This intrinsic coupling betweenΓ\\GammaandπΓ\\pi\_\{\\Gamma\}makes it challenging to solve \([3](https://arxiv.org/html/2606.17377#S2.E3)\) directly\.

## 3Learning Policies under Fixed Aggregations

We introduce the aggregated Bellman operator and learn an optimal policy under a fixed aggregationΓ\\Gamma\. We then quantify the performance loss induced by the SAD constraint\.

Given a state distributionμ\\mu, define the aggregated reward and transition kernel

RΓ,μ​\(γ\)\\displaystyle R\_\{\\Gamma,\\mu\}\(\\gamma\)=∑s∈γμγ​\(s\)​R​\(s\),\\displaystyle=\\sum\_\{s\\in\\gamma\}\\mu\_\{\\gamma\}\(s\)R\(s\),PΓ,μ​\(γ′\|γ,a\)\\displaystyle P\_\{\\Gamma,\\mu\}\(\\gamma^\{\\prime\}\|\\gamma,a\)=∑s∈γ∑s′∈γ′μγ​\(s\)​P​\(s′\|s,a\),\\displaystyle=\\sum\_\{s\\in\\gamma\}\\sum\_\{s^\{\\prime\}\\in\\gamma^\{\\prime\}\}\\mu\_\{\\gamma\}\(s\)P\(s^\{\\prime\}\|s,a\),whereμγ​\(s\)=μ​\(s\)/∑s∈γμ​\(s\)\\mu\_\{\\gamma\}\(s\)=\\mu\(s\)\\big/\\sum\_\{s\\in\\gamma\}\\mu\(s\)is the normalized distribution withinγ\\gamma\. Then, the performance of an aggregated policyπΓ\\pi\_\{\\Gamma\}evaluated overΓ\\Gammasatisfies

V¯Γ,μπΓ​\(γ\)=RΓ,μ​\(γ\)\+β​∑a∈𝒜πΓ​\(a\|γ\)​∑γ′∈ΓPΓ,μ​\(γ′\|γ,a\)​V¯Γ,μπΓ​\(γ′\)\.\\bar\{V\}\_\{\\Gamma,\\mu\}^\{\\pi\_\{\\Gamma\}\}\(\\gamma\)\\\!=\\\!R\_\{\\Gamma,\\mu\}\(\\gamma\)\\\!\+\\\!\\beta\\\!\\sum\_\{a\\in\\mathcal\{A\}\}\\\!\\pi\_\{\\Gamma\}\(a\|\\gamma\)\\\!\\sum\_\{\\gamma^\{\\prime\}\\in\\Gamma\}\\\!P\_\{\\Gamma,\\mu\}\(\\gamma^\{\\prime\}\|\\gamma,a\)\\bar\{V\}\_\{\\Gamma,\\mu\}^\{\\pi\_\{\\Gamma\}\}\(\\gamma^\{\\prime\}\)\.
The optimal aggregated value function is the fixed point of

\(ℬΓ,μ​V¯\)​\(γ\)=maxa∈𝒜⁡\[RΓ,μ​\(γ\)\+β​∑γ′∈ΓPΓ,μ​\(γ′\|γ,a\)​V¯​\(γ′\)\],\(\\mathcal\{B\}\_\{\\Gamma,\\mu\}\\bar\{V\}\)\(\\gamma\)\\\!=\\\!\\max\_\{a\\in\\mathcal\{A\}\}\\Big\[R\_\{\\Gamma,\\mu\}\(\\gamma\)\+\\\!\\beta\\\!\\sum\_\{\\gamma^\{\\prime\}\\in\\Gamma\}\\\!P\_\{\\Gamma,\\mu\}\(\\gamma^\{\\prime\}\|\\gamma,a\)\\bar\{V\}\(\\gamma^\{\\prime\}\)\\Big\],\(4\)and we refer toℬΓ,μ\\mathcal\{B\}\_\{\\Gamma,\\mu\}as the*aggregated Bellman operator*, whose unique fixed point is denoted asV¯Γ,μ∗\\bar\{V\}\_\{\\Gamma,\\mu\}^\{\*\}\.

### 3\.1Aggregated Actor–Critic\.

To compute the optimal aggregated policy under a fixed partitionΓ\\Gamma, we employ a two\-timescale actor–critic \(AC\) algorithm\[borkar1997stochastic\]\. The critic estimates the aggregated value function, while the actor updates a softmax policy parameterization\.

LetΓt=ϕΓ​\(St\)\\Gamma\_\{t\}=\\phi\_\{\\Gamma\}\(S\_\{t\}\)denote the superstate containingStS\_\{t\}at timett\. The aggregated critic and actor updates are given by

V¯t\+1​\(Γt\)=\(1−ξt\)​V¯t​\(Γt\)\+ξt​\(Rt\+β​V¯t​\(Γt\+1\)\),\\displaystyle\\bar\{V\}\_\{t\+1\}\(\\Gamma\_\{t\}\)=\(1\-\\xi\_\{t\}\)\\bar\{V\}\_\{t\}\(\\Gamma\_\{t\}\)\+\\xi\_\{t\}\\big\(R\_\{t\}\+\\beta\\bar\{V\}\_\{t\}\(\\Gamma\_\{t\+1\}\)\\big\),\(5\)Q¯t\+1​\(Γt,At\)=Pq​\(Q¯t​\(Γt,At\)\+ζt​\(Rt\+β​V¯t​\(Γt\+1\)−V¯t​\(Γt\)\)\),\\displaystyle\\bar\{Q\}\_\{t\+1\}\\\!\(\\Gamma\_\{t\},A\_\{t\}\\\!\)\\\!=\\\!\\mathrm\{P\}\_\{q\}\\\!\\Big\(\\\!\\bar\{Q\}\_\{t\}\(\\Gamma\_\{t\},A\_\{t\}\\\!\)\\\!\+\\\!\\zeta\_\{t\}\\big\(\\\!R\_\{t\}\\\!\+\\\!\\\!\\beta\\bar\{V\}\_\{t\}\(\\Gamma\_\{t\+1\}\\\!\)\\\!\-\\\!\\bar\{V\}\_\{t\}\(\\Gamma\_\{t\}\)\\big\)\\\!\\Big\),wherePq\\mathrm\{P\}\_\{q\}projects onto\[−q,q\]\[\-q,q\]to ensure boundedness\.

The actionAtA\_\{t\}is sampled from the Boltzmann policy

ψQ¯t​\(Γt,a\)=eQ¯t​\(Γt,a\)∑a′eQ¯t​\(Γt,a′\)\.\\psi\_\{\\bar\{Q\}\_\{t\}\}\(\\Gamma\_\{t\},a\)=\\frac\{e^\{\\bar\{Q\}\_\{t\}\(\\Gamma\_\{t\},a\)\}\}\{\\sum\_\{a^\{\\prime\}\}e^\{\\bar\{Q\}\_\{t\}\(\\Gamma\_\{t\},a^\{\\prime\}\)\}\}\.\(6\)
To ensure convergence, the critic operates on a faster timescale than the actor, i\.e\.,limt→∞ζt/ξt=0\\lim\_\{t\\to\\infty\}\\zeta\_\{t\}/\\xi\_\{t\}=0\. This on\-policy scheme implicitly estimates the state distributionμ\\muused in \([4](https://arxiv.org/html/2606.17377#S3.E4)\) induced byπΓ\\pi\_\{\\Gamma\}through trajectory samples\.

Under the following standard conditions on step sizes, convergence follows from classical two\-timescale theory\.

###### Assumption 1\.

The learning rates\{ξt\}\\\{\\xi\_\{t\}\\\}and\{ζt\}\\\{\\zeta\_\{t\}\\\}satisfy

∑t=0∞ξt=∑t=0∞ζt=∞,∑t=0∞ξt2<∞,∑t=0∞ζt2<∞,\\displaystyle\\sum\_\{t=0\}^\{\\infty\}\\xi\_\{t\}=\\sum\_\{t=0\}^\{\\infty\}\\zeta\_\{t\}=\\infty,\\quad\\sum\_\{t=0\}^\{\\infty\}\\xi\_\{t\}^\{2\}<\\infty,\\quad\\sum\_\{t=0\}^\{\\infty\}\\zeta\_\{t\}^\{2\}<\\infty,limt→∞ξt\+1/ξt=1,limt→∞ζt/ξt=0\.\\displaystyle\\lim\_\{t\\to\\infty\}\\xi\_\{t\+1\}/\{\\xi\_\{t\}\}=1,\\qquad\\lim\_\{t\\to\\infty\}\{\\zeta\_\{t\}\}/\{\\xi\_\{t\}\}=0\.

###### Lemma 1\(\[baras2000learning\]\)\.

Under Assumption[1](https://arxiv.org/html/2606.17377#Thmassumption1), the aggregated actor–critic converges almost surely toQ¯∗\\bar\{Q\}^\{\*\}\. The converged actor induces a policyπΓ∗\\pi\_\{\\Gamma\}^\{\*\}and stationary distributionμ∗\\mu^\{\*\}satisfying

V¯Γ,μ∗∗​\(γ\)≤V¯Γ,μ∗πΓ∗​\(γ\)≤V¯Γ,μ∗∗​\(γ\)\+ϵ​\(q\),\\bar\{V\}\_\{\\Gamma,\\mu^\{\*\}\}^\{\*\}\(\\gamma\)\\leq\\bar\{V\}\_\{\\Gamma,\\mu^\{\*\}\}^\{\\pi\_\{\\Gamma\}^\{\*\}\}\(\\gamma\)\\leq\\bar\{V\}\_\{\\Gamma,\\mu^\{\*\}\}^\{\*\}\(\\gamma\)\+\\epsilon\(q\),whereϵ​\(q\)→0\\epsilon\(q\)\\to 0asq→∞q\\to\\infty\.

### 3\.2Performance Loss under fixed aggregations

We now quantify the performance loss due to state aggregation by comparing the optimal value functionV∗∈ℝ\|𝒮\|V^\{\*\}\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\}with its aggregated counterpartV¯Γ,μ∗∗∈ℝ\|Γ\|\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\\in\\mathbb\{R\}^\{\|\\Gamma\|\}\. Because the two value functions lie on different domains, we introduce the value projection and itsμ\\mu\-weighted pseudo\-inverse

\(ΨΓ​V¯\)​\(s\)=V¯​\(ϕ​\(s\)\),\\displaystyle\(\\Psi\_\{\\Gamma\}\\bar\{V\}\{\}\)\(s\)=\\bar\{V\}\{\}\(\\phi\(s\)\),\(ΨΓ,μ∗\+​V\)​\(γ\)=∑s∈γμγ∗​\(s\)​V​\(s\)\.\\displaystyle\(\\Psi^\{\+\}\_\{\\Gamma,\\mu^\{\*\}\}V\)\(\\gamma\)=\\sum\_\{s\\in\\gamma\}\\mu^\{\*\}\_\{\\gamma\}\(s\)V\(s\)\.
We then have the the following theorem, which quantifies the discrepancy betweenV¯Γ,μ∗∗\\bar\{V\}\_\{\\Gamma,\\mu^\{\*\}\}^\{\*\}andV∗V^\{\*\}, and provides the first insight toward an adaptive state aggregation algorithm\.

###### Theorem 2\.

Consider an MDP with state space𝒮\\mathcal\{S\}and a fixed aggregationΓ\\Gamma\. LetV∗∈ℝ\|𝒮\|V^\{\*\}\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\}denote the optimal value function of the MDP, and letV¯Γ,μ∗∗∈ℝ\|Γ\|\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\\in\\mathbb\{R\}^\{\|\\Gamma\|\}be the fixed point of \([4](https://arxiv.org/html/2606.17377#S3.E4)\)\. Then, we have

‖ΨΓ​V¯Γ,μ∗∗−V∗‖∞≤2​ϵΓ\+δΓ1−β,\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}\\;\\leq\\;\\frac\{2\\epsilon\_\{\\Gamma\}\+\\delta\_\{\\Gamma\}\}\{1\-\\beta\},\(7\)where we haveϵΓ=minV¯∈ℝ\|Γ\|⁡‖ΨΓ​V¯−V∗‖∞\\epsilon\_\{\\Gamma\}\\\!=\\\!\\min\_\{\\bar\{V\}\\in\\mathbb\{R\}^\{\|\\Gamma\|\}\}\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}\-V^\{\*\}\\\|\_\{\\infty\}, andδΓ=‖ΨΓ,μ∗\+∘ℬ𝒮∘ΨΓ​V¯opt−ℬΓ,μ∗​V¯opt‖∞\\delta\_\{\\Gamma\}=\\\|\\Psi^\{\+\}\_\{\\Gamma,\\mu^\{\*\}\}\\circ\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\circ\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\_\{\\infty\}withV¯opt∈argminV¯∈ℝ\|Γ\|‖ΨΓ​V¯−V∗‖∞\\bar\{V\}\_\{\\mathrm\{opt\}\}\\in\\operatorname\*\{argmin\}\_\{\\bar\{V\}\\in\\mathbb\{R\}^\{\|\\Gamma\|\}\}\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}\-V^\{\*\}\\\|\}\_\{\\infty\}\.

###### Proof\.

The detailed proof is delayed to Appendix[Appendix A](https://arxiv.org/html/2606.17377#A1)\. ∎

While Theorem[2](https://arxiv.org/html/2606.17377#Thmtheorem2)characterizes suboptimality, the quantitiesϵΓ\\epsilon\_\{\\Gamma\}andδΓ\\delta\_\{\\Gamma\}depend on the unknown optimal valueV∗V^\{\*\}\. We therefore derive a computable upper bound based on Bellman residual mismatch\.

###### Corollary 3\.

Consider an MDP and a fixed aggregationΓ\\Gamma, we have that

‖ΨΓ​V¯Γ,μ∗−V∗‖∞≤‖ΨΓ​ℬΓ,μ∗​V¯Γ,μ∗∗−ℬ𝒮​ΨΓ​V¯Γ,μ∗∗‖∞1−β\.\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu\}\\\!\-\\\!V^\{\*\}\\\|\_\{\\infty\}\\\!\\leq\\\!\\frac\{\\\|\\Psi\_\{\\Gamma\}\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\\\!\-\\\!\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\\\|\_\{\\infty\}\}\{1\-\\beta\}\.\(8\)

###### Proof\.

The proof is presented in Appendix[Appendix B](https://arxiv.org/html/2606.17377#A2)\. ∎

In the sequel, we denote

ϵ¯Γ=‖ΨΓ​ℬΓ,μ​V¯Γ,μ∗−ℬ𝒮​ΨΓ​V¯Γ,μ∗‖∞\.\\bar\{\\epsilon\}\_\{\\Gamma\}=\{\\\|\\Psi\_\{\\Gamma\}\\mathcal\{B\}\_\{\\Gamma,\\mu\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu\}\\\|\}\_\{\\infty\}\.
As discussed in Remarks[1](https://arxiv.org/html/2606.17377#Thmremark1)and[2](https://arxiv.org/html/2606.17377#Thmremark2),Γ=𝒮\\Gamma=\\mathcal\{S\}yields optimal performance but no reduction in state space complexity, whereas coarser partitions reduce complexity at the cost of suboptimality\. This trade\-off motivates a multi\-timescale algorithm that adaptively refines or aggregates regions of the state space to reduce estimated performance loss\.

## 4Tree\-based Adaptive State Aggregation

Thus far, our analysis assumed a fixed aggregationΓ\\Gamma\. We now introduce a tree\-structured scheme that adapts the aggregation online while the actor–critic operates on a faster timescale\.

We represent the state space𝒮\\mathcal\{S\}using a hierarchical tree𝒯=\(𝒱,ℰ\)\\mathcal\{T\}=\(\\mathcal\{V\},\\mathcal\{E\}\)\. Each leafγ∈𝒱leaf​\(𝒯\)\\gamma\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)defines a superstate, thereby inducing a state aggregation\. We use𝒯\\mathcal\{T\}interchangeably with the partition it induces\. Let𝒯𝒮\\mathcal\{T\}\_\{\\mathcal\{S\}\}denote the finest tree whose leaves are singletons \(i\.e\.,𝒱leaf​\(𝒯𝒮\)=𝒮\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\_\{\\mathcal\{S\}\}\)=\\mathcal\{S\}\)\. Throughout, we use quadtrees for concreteness, although the method extends to general hierarchical trees\. An example of a quadtree is presented in Figure[2](https://arxiv.org/html/2606.17377#S4.F2)\.

#### Tree operations\.

For a nodeγ\\gamma, letPrnt​\(γ\)\\mathrm\{Prnt\}\(\\gamma\)denote its parent andChldrn​\(γ\)\\mathrm\{Chldrn\}\(\\gamma\)its children\. A leafγ\\gammais*expandable*ifγ∉𝒱leaf​\(𝒯𝒮\)\\gamma\\notin\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\_\{\\mathcal\{S\}\}\)\. ForA⊆𝒱leaf​\(𝒯\)A\\subseteq\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\), the expansion operatorExpd​\(𝒯,A\)\\mathrm\{Expd\}\(\\mathcal\{T\},A\)replaces eachγ∈A\\gamma\\in Awith its children\. A nodeγ\\gammais*collapsible*if all its children are leaves of𝒯\\mathcal\{T\}\. ForBBa set of collapsible nodes, the collapse operatorCllps​\(𝒯,B\)\\mathrm\{Cllps\}\(\\mathcal\{T\},B\)merges children back into their parent\. For example, in Figure[2](https://arxiv.org/html/2606.17377#S4.F2)one may obtain𝒯𝒮=Expd​\(𝒯,\{γ2,γ3\}\)\\mathcal\{T\}\_\{\\mathcal\{S\}\}=\\mathrm\{Expd\}\(\\mathcal\{T\},\\\{\\gamma\_\{2\},\\gamma\_\{3\}\\\}\)\.

![Refer to caption](https://arxiv.org/html/2606.17377v1/x2.png)Figure 2:Tree representations of grid environments\.
### 4\.1Q\-Estimation for Tree Updates

The key research question in designing an adaptive state aggregation scheme is determining which nodes in tree𝒯\\mathcal\{T\}should be expanded \(refined\) and which should be collapsed \(coarsened\)\. Our approach addresses this by maintaining Q\-value estimates at the leaves and using them to guide dynamic refinement or coarsening of the partition\.

Given a tree representation𝒯\\mathcal\{T\}of the finite state space𝒮\\mathcal\{S\}, we maintain a Q\-value estimate at each leaf node, alongside the actor and critic, to evaluate policy performance locally\. Suppose the process is currently in stateStS\_\{t\}with corresponding superstateΓt=ϕ𝒯​\(St\)∈𝒱leaf​\(𝒯\)\\Gamma\_\{t\}=\\phi\_\{\\mathcal\{T\}\}\(S\_\{t\}\)\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)\. The Q\-estimate atΓt\\Gamma\_\{t\}is updated as

Q¯t\+1est​\(Γt,At\)=\(1−ηt\)​Q¯test​\(Γt,At\)\+ηt​\(Rt\+β​\(𝟙St\+1∈Γt​Q¯test​\(Γt,At\)\+𝟙St\+1∉Γt​maxa⁡Q¯test​\(Γt\+1,a\)\)\),\\displaystyle\\bar\{Q\}\_\{t\+1\}^\{\\mathrm\{est\}\}\(\\Gamma\_\{t\},A\_\{t\}\)=\(1\-\\eta\_\{t\}\)\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\Gamma\_\{t\},A\_\{t\}\)\+\\eta\_\{t\}\\Big\(R\_\{t\}\+\\beta\\big\(\\mathds\{1\}\_\{S\_\{t\+1\}\\in\\Gamma\_\{t\}\}\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\Gamma\_\{t\},A\_\{t\}\)\+\\mathds\{1\}\_\{S\_\{t\+1\}\\notin\\Gamma\_\{t\}\}\\max\_\{a\}\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\Gamma\_\{t\+1\},a\)\\big\)\\Big\),\(9\)whereAtA\_\{t\}is drawn according to \([6](https://arxiv.org/html/2606.17377#S3.E6)\) from the aggregated actor, andΓt\+1∈𝒱leaf​\(𝒯\)\\Gamma\_\{t\+1\}\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)denotes the next superstate reached under𝒯\\mathcal\{T\}\. We update the estimated Q\-values on a slower timescale than the actor to ensure that the partition remains approximately fixed \(quasi\-static\) while the actor learns the aggregated policy under the current partition\. Specifically, we enforce thatlimt→∞ηt/ζt=0\\lim\_\{t\\to\\infty\}\\eta\_\{t\}/\\zeta\_\{t\}=0, whereζt\\zeta\_\{t\}is the step size for the actor update\.

To assess whether a leaf node should be expanded, we maintain Q\-estimates not only for the node itself but also for its prospective children\. The intuition is that the discrepancies between parent and child Q\-values reflect the potential benefit of refining the representation in that region\. Formally, letΓt∈𝒱leaf​\(𝒯\)\\Gamma\_\{t\}\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)be an expandable leaf, and define the locally expanded tree𝒯~=Expd​\(𝒯,Γt\)\\widetilde\{\\mathcal\{T\}\}=\\mathrm\{Expd\}\(\\mathcal\{T\},\\Gamma\_\{t\}\)\.

LetΓ~t=ϕ𝒯~​\(St\)\\widetilde\{\\Gamma\}\_\{t\}=\\phi\_\{\\widetilde\{\\mathcal\{T\}\}\}\(S\_\{t\}\)andΓ~t\+1=ϕ𝒯~​\(St\+1\)\\widetilde\{\\Gamma\}\_\{t\+1\}=\\phi\_\{\\widetilde\{\\mathcal\{T\}\}\}\(S\_\{t\+1\}\)\. By construction, we always haveΓ~t∈Chldrn​\(Γt\)\\widetilde\{\\Gamma\}\_\{t\}\\in\\mathrm\{Chldrn\}\(\\Gamma\_\{t\}\)\. For the next time step, ifSt\+1∈ΓtS\_\{t\+1\}\\in\\Gamma\_\{t\}, the agent remains within the expanded region, soΓ~t\+1\\widetilde\{\\Gamma\}\_\{t\+1\}is eitherΓ~t\\widetilde\{\\Gamma\}\_\{t\}as before or a sibling ofΓ~t\\widetilde\{\\Gamma\}\_\{t\}\. IfSt\+1∉ΓtS\_\{t\+1\}\\notin\\Gamma\_\{t\}, then the agent has left the expanded region, andΓ~t\+1\\widetilde\{\\Gamma\}\_\{t\+1\}coincides with the superstateΓt\+1\\Gamma\_\{t\+1\}in the original process\. The Q\-estimate for a childΓ~t∈Chldrn​\(Γt\)\\widetilde\{\\Gamma\}\_\{t\}\\in\\mathrm\{Chldrn\}\(\\Gamma\_\{t\}\)is then updated as

Q¯t\+1est\(Γ~t,At\)=\(1−ηt\)Q¯test\(Γ~t,At\)\+ηt\(Rt\+β\(𝟙St\+1∈Γ~tQ¯test\(Γ~t,At\)\\displaystyle\\bar\{Q\}\_\{t\+1\}^\{\\mathrm\{est\}\}\(\\widetilde\{\\Gamma\}\_\{t\},A\_\{t\}\)=\(1\-\\eta\_\{t\}\)\\,\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\widetilde\{\\Gamma\}\_\{t\},A\_\{t\}\)\+\\eta\_\{t\}\\Big\(R\_\{t\}\+\\beta\\big\(\\mathds\{1\}\_\{S\_\{t\+1\}\\in\\widetilde\{\\Gamma\}\_\{t\}\}\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\widetilde\{\\Gamma\}\_\{t\},A\_\{t\}\)\+𝟙St\+1∈Γt∖Γ~t​maxa⁡Q¯test​\(Γ~t\+1,a\)\\displaystyle\+\\mathds\{1\}\_\{S\_\{t\+1\}\\in\\Gamma\_\{t\}\\setminus\\widetilde\{\\Gamma\}\_\{t\}\}\\max\_\{a\}\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\widetilde\{\\Gamma\}\_\{t\+1\},a\)\(10\)\+𝟙St\+1∉ΓtmaxaQ¯test\(Γt\+1,a\)\)\),\\displaystyle\+\\mathds\{1\}\_\{S\_\{t\+1\}\\notin\\Gamma\_\{t\}\}\\max\_\{a\}\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\Gamma\_\{t\+1\},a\)\\big\)\\Big\),where the three indicators correspond to the three cases discussed above\. An example of this update rule is shown in Figure[3](https://arxiv.org/html/2606.17377#S4.F3), which corresponds to the third case, where the agent leaves the parentΓt\\Gamma\_\{t\}after the transition\.

![Refer to caption](https://arxiv.org/html/2606.17377v1/x3.png)Figure 3:Illustration of the child Q\-estimate update\. Red circles mark the state transition and the corresponding superstates/nodes\{Γt\}\\\{\\Gamma\_\{t\}\\\}and\{Γ~t\}\\\{\\widetilde\{\\Gamma\}\_\{t\}\\\}when the tree is expanded\.For a collapsible parentΓ′=Prnt​\(Γt\)\\Gamma^\{\\prime\}=\\mathrm\{Prnt\}\(\\Gamma\_\{t\}\), we define𝒯~=Cllps​\(𝒯,Γ′\)\\widetilde\{\\mathcal\{T\}\}=\\mathrm\{Cllps\}\(\\mathcal\{T\},\\Gamma^\{\\prime\}\)and update the merged node Q\-estimate analogously to \([10](https://arxiv.org/html/2606.17377#S4.E10)\)\.

### 4\.2Node Expansion and Collapse Criteria

By Lemma[1](https://arxiv.org/html/2606.17377#Thmtheorem1), the aggregated actor–critic converges to the optimal aggregated policy for a fixed tree𝒯\\mathcal\{T\}\. Since the structural updates evolve on a slower timescale, we adopt the standard multi\-timescale argument: when updating the partition, the actor–critic is treated as converged\. Thus, the structural optimization reduces to selecting a tree𝒯\\mathcal\{T\}, with the corresponding optimal policyπ𝒯∗\\pi\_\{\\mathcal\{T\}\}^\{\*\}obtained implicitly\.

We therefore consider the regularized objective

max𝒯⁡J​\(π𝒯∗;μ0\)−λ​\|𝒱leaf​\(𝒯\)\|\.\\max\_\{\\mathcal\{T\}\}\\;\\;J\(\\pi\_\{\\mathcal\{T\}\}^\{\*\};\\mu\_\{0\}\)\-\\lambda\|\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)\|\.\(11\)
With only the aggregated value in hand, we use the approximationJ​\(π𝒯∗;μ0\)≈∑γ∑s∈γμ0​\(s\)​V¯𝒯,μ∗∗​\(γ\)J\(\\pi^\{\*\}\_\{\\mathcal\{T\}\};\\mu\_\{0\}\)\\approx\\sum\_\{\\gamma\}\\sum\_\{s\\in\\gamma\}\\mu\_\{0\}\(s\)\\bar\{V\}^\{\*\}\_\{\\mathcal\{T\},\\mu^\{\*\}\}\(\\gamma\)\. Subtracting the constantV∗​\(s\)V^\{\*\}\(s\)and bounding the difference by its absolute value yields the equivalent minimization

min𝒯​∑γ∈𝒱leaf​\(𝒯\)∑s∈γμ0​\(s\)​\|V¯𝒯,μ∗∗​\(γ\)−V∗​\(s\)\|\+λ​\|𝒱leaf​\(𝒯\)\|\.\\min\_\{\\mathcal\{T\}\}\\sum\_\{\\gamma\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)\}\\sum\_\{s\\in\\gamma\}\\mu\_\{0\}\(s\)\\big\|\\bar\{V\}^\{\*\}\_\{\\mathcal\{T\},\\mu^\{\*\}\}\(\\gamma\)\-V^\{\*\}\(s\)\\big\|\+\\lambda\|\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)\|\.Further bounding the weighted sum by the sup norm gives

min𝒯⁡‖Ψ𝒯​V¯𝒯,μ∗∗−V∗‖∞\+λ​\|𝒱leaf​\(𝒯\)\|\.\\min\_\{\\mathcal\{T\}\}\{\\\|\\Psi\_\{\\mathcal\{T\}\}\\bar\{V\}^\{\*\}\_\{\\mathcal\{T\},\\mu^\{\*\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}\+\\lambda\|\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)\|\.Applying Corollary[3](https://arxiv.org/html/2606.17377#Thmtheorem3)leads to the final optimization problem:

min𝒯⁡ϵ¯𝒯1−β\+λ​\|𝒱leaf​\(𝒯\)\|,\\min\_\{\\mathcal\{T\}\}\\penalty 10000\\ \\frac\{\\bar\{\\epsilon\}\_\{\\mathcal\{T\}\}\}\{1\-\\beta\}\+\\lambda\|\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)\|,\(12\)whereϵ¯𝒯=‖Ψ𝒯​ℬ𝒯,μ∗​V¯𝒯,μ∗∗−ℬ𝒮​Ψ𝒯​V¯𝒯,μ∗∗‖∞\\bar\{\\epsilon\}\_\{\\mathcal\{T\}\}=\{\\\|\\Psi\_\{\\mathcal\{T\}\}\\mathcal\{B\}\_\{\\mathcal\{T\},\\mu^\{\*\}\}\\bar\{V\}^\{\*\}\_\{\\mathcal\{T\},\\mu^\{\*\}\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\Psi\_\{\\mathcal\{T\}\}\\bar\{V\}^\{\*\}\_\{\\mathcal\{T\},\\mu^\{\*\}\}\\\|\}\_\{\\infty\}\. This objective still captures the fundamental trade\-off: reducing abstraction error versus maintaining a compact tree\.

#### Expansion criterion\.

Suppose the current quadtree is𝒯\\mathcal\{T\}, and let𝒯′\\mathcal\{T\}^\{\\prime\}be obtained by expanding a leafγ∈𝒱leaf​\(𝒯\)\\gamma\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)\. The error \([12](https://arxiv.org/html/2606.17377#S4.E12)\) of the two trees are

ϵ¯𝒯1−β\+λ​\|𝒱leaf​\(𝒯\)\|andϵ¯𝒯′1−β\+λ​\|𝒱leaf​\(𝒯′\)\|\.\\frac\{\\bar\{\\epsilon\}\_\{\\mathcal\{T\}\}\}\{1\-\\beta\}\+\\lambda\|\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)\|\\quad\\text\{and\}\\quad\\frac\{\\bar\{\\epsilon\}\_\{\\mathcal\{T\}^\{\\prime\}\}\}\{1\-\\beta\}\+\\lambda\|\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}^\{\\prime\}\)\|\.Since expanding one leaf in a quadtree adds three leaf nodes, a greedy one\-step lookahead therefore expandsγ\\gammaif

Δ​ϵ¯𝒯,γ:=ϵ¯𝒯−ϵ¯𝒯′≥3​\(1−β\)​λ\.\\Delta\\bar\{\\epsilon\}\_\{\\mathcal\{T\},\\gamma\}:=\\bar\{\\epsilon\}\_\{\\mathcal\{T\}\}\-\\bar\{\\epsilon\}\_\{\\mathcal\{T\}^\{\\prime\}\}\\;\\geq\\;3\(1\-\\beta\)\\lambda\.
Direct evaluation ofΔ​ϵ¯𝒯,γ\\Delta\\bar\{\\epsilon\}\_\{\\mathcal\{T\},\\gamma\}requires the full model and state\-level Bellman operatorℬ𝒮\\mathcal\{B\}\_\{\\mathcal\{S\}\}, which defeats the purpose of aggregation\. Instead, we approximate the improvement using Q\-estimates\. SinceΔ​ϵ¯𝒯,γ\\Delta\\bar\{\\epsilon\}\_\{\\mathcal\{T\},\\gamma\}captures the benefit of allowing children ofγ\\gammato adopt distinct action distributions \(rather than a shared one\), we approximate this gain by comparing the best action performance atγ\\gammawith the best performance attainable by its children:

Δ^​ϵ¯𝒯,γ:=maxγ′∈Chldrn​\(γ\)⁡maxa∈𝒜⁡Q¯test​\(γ′,a\)−maxa∈𝒜⁡Q¯test​\(γ,a\)\.\\hat\{\\Delta\}\\bar\{\\epsilon\}\_\{\\mathcal\{T\},\\gamma\}\\\!:=\\\!\\\!\\\!\\\!\\\!\\max\_\{\\gamma^\{\\prime\}\\in\\mathrm\{Chldrn\}\(\\gamma\)\}\\max\_\{a\\in\\mathcal\{A\}\}\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\gamma^\{\\prime\},a\)\\\!\-\\\!\\max\_\{a\\in\\mathcal\{A\}\}\\bar\{Q\}\_\{t\}^\{\\mathrm\{est\}\}\(\\gamma,a\)\.\(13\)This leads to the following expansion criterion:

Δ^​ϵ¯𝒯,γ≥3​\(1−β\)​λ\.\\hat\{\\Delta\}\\bar\{\\epsilon\}\_\{\\mathcal\{T\},\\gamma\}\\;\\geq\\;3\(1\-\\beta\)\\lambda\.\(14\)
Upon expansion, the children superstates inherit actor and critic values from the parent via a soft update\. TheExpansionroutine is presented in Appendix[Appendix C](https://arxiv.org/html/2606.17377#A3)\.

#### Collapse criterion\.

Analogous to the expansion case, let𝒯′\\mathcal\{T\}^\{\\prime\}denote the tree obtained by collapsing a nodeγ\\gamma\. Collapse is beneficial ifϵ¯𝒯′−ϵ¯𝒯<3​\(1−β\)​λ\.\\bar\{\\epsilon\}\_\{\\mathcal\{T\}^\{\\prime\}\}\-\\bar\{\\epsilon\}\_\{\\mathcal\{T\}\}<3\(1\-\\beta\)\\lambda\.Using the same Q\-based approximation, we collapseγ\\gammawhen

Δ^​ϵ¯𝒯′,γ<3​\(1−β\)​λ\.\\hat\{\\Delta\}\\bar\{\\epsilon\}\_\{\\mathcal\{T\}^\{\\prime\},\\gamma\}<3\(1\-\\beta\)\\lambda\.\(15\)
When a node is collapsed, its actor\-critic value is reinitialized from its children through a soft update, as specified in theCollapseroutine in Appendix[Appendix C](https://arxiv.org/html/2606.17377#A3)\.

The complete learning procedure, integrating adaptive aggregation with aggregated actor–critic updates, is summarized in Algorithm[3](https://arxiv.org/html/2606.17377#algorithm3)in Appendix[Appendix C](https://arxiv.org/html/2606.17377#A3)\.

## 5Numerical Examples

We evaluate the proposed method on both discrete navigation and continuous\-control tasks and compare its performance against several learning\-based baselines\.

Figure[4](https://arxiv.org/html/2606.17377#S5.F4)shows the sequence of abstractions learned until convergence at 15,000 episodes\. The color map represents the aggregated value functionV¯\\bar\{V\}, with warmer colors indicating higher values\.

The refinement process begins near the goal and propagates outward as value information spreads through the state space\. Furthermore, the algorithm adaptively refines the abstraction near obstacles and narrow corridors, allowing fine\-grained maneuvers in critical regions while preserving coarser partitions elsewhere\.

![Refer to caption](https://arxiv.org/html/2606.17377v1/x4.png)Figure 4:Sequence of learned quadtrees in a16×1616\\\!\\times\\\!16grid\.We next relocate the goal from the top\-left to the bottom\-right corner as shown in Figure[5](https://arxiv.org/html/2606.17377#S5.F5)\. While the actor–critic values are reinitialized, the algorithm is warm\-started with the previously learned abstraction\. As a result, the algorithm converges much faster and quickly adapts by refining near the new goal and coarsening regions no longer requiring high resolution\. This illustrates the benefit of retaining task\-relevant structure across related tasks\. Nonetheless, some regions near the previous goal remain overly refined due to exploration bias: since action selections are biased toward fine\-grained maneuvers, alternative actions are not adequately evaluated\.

![Refer to caption](https://arxiv.org/html/2606.17377v1/x5.png)Figure 5:Sequence of learned quadtree abstractions, starting with a previously learned abstraction\.Figure[6](https://arxiv.org/html/2606.17377#S5.F6)shows learned policies and state aggregations in randomly generated mazes\. Subplot \(a\) presents an8×88\\times 8maze\. Subplots \(b\) and \(c\) depict a refined16×1616\\times 16version of the same maze augmented with two additional narrow corridors highlighted in cyan\. Abstractions and policies in \(b\) and \(c\) are learned withλ=10\\lambda=10andλ=5\\lambda=5, respectively\. Subplot \(d\) shows a32×3232\\times 32maze\.

![Refer to caption](https://arxiv.org/html/2606.17377v1/x6.png)Figure 6:Learned aggregated policies and state aggregations in random mazes\. \(a\)8×88\\times 8maze\. \(b\)–\(c\)16×1616\\times 16maze with added corridors, learned withλ=10\\lambda=10andλ=5\\lambda=5\.The learned abstractions significantly compress the state space: 28 superstates out of 64 states in \(a\), 34/256 in \(b\), 55/256 in \(c\), and 232/1024 in \(d\)\. Despite this compression, the algorithm preserves task\-relevant topological structures\. Relating to robotics applications, the16×1616\\times 16grid in \(b\) and \(c\) can be interpreted as a higher\-resolution environment model of the8×88\\times 8model in \(a\)\. The policies and partitions in \(a\) and \(b\) are nearly identical except near the goal, because the algorithm in \(b\), withλ=10\\lambda=10, ignores the narrow corridors and produces a suboptimal policy in exchange for fewer aggregated states\. In contrast, the lower penalty \(λ=5\\lambda=5\) in \(c\) encourages use of the added corridors, yielding different policies that exploit the shorter paths\.

Figure[7](https://arxiv.org/html/2606.17377#S5.F7)compares the proposed method with a flat actor–critic baseline, the quantizer\-based hierarchical actor–critic\[baras2000learning\], and the CAT algorithm\[dadvar2023conditional\]across 40 randomly generated16×1616\\times 16navigation grids, a128×128128\\times 128Mars terrain map, and a discretized Mountain Car control problem\. We evaluate three variants of the proposed tree\-based actor–critic algorithm:Tree\-AC, initialized with a single superstate corresponding to the root node;Tree\-AC\-Depth2, initialized with a depth\-2 tree; andTree\-AC\-Replan, warm\-started from a previously learned partition after the goal location is changed\.

Across all applicable scenarios,Tree\-AC\-Replanconverges the fastest, demonstrating the benefit of reusing a previously learned abstraction to adapt to task changes\.111We do not evaluateTree\-AC\-Replanon the Mountain Car task because the policy learned for the original goal can already reach all target positions\.Moreover,Tree\-AC\-Depth2converges faster thanTree\-AC, demonstrating the benefit of the proposed approach’s flexibility in allowing the user to specify the initial abstraction granularity based on the task\.

![Refer to caption](https://arxiv.org/html/2606.17377v1/x7.png)Figure 7:Training curves of algorithms on discrete navigation tasks and the discretized Mountain\-Car control problem\.In the16×1616\\times 16grid\-navigation task,BaseActorCriticinitially converges faster thanTree\-ACinitialized at depth 0, indicating that additional training is required to learn an effective abstraction\. TheTree\-AC\-Depth2variant learns faster thanBaseActorCriticduring the early stages of training but slows during the intermediate stage, as it must further refine the abstraction before converging to a high\-performing policy\. TheTree\-AC\-Replanvariant substantially outperformsBaseActorCritic, highlighting the benefit of reusing a learned environment representation when adapting to a new task\. Moreover, bothTree\-ACvariants outperform their correspondingCATvariants, as well asQuantization\-AC\.

The advantages ofTree\-ACbecome more pronounced in the larger128×128128\\times 128Mars terrain\-navigation task and the Mountain Car control problem\. In particular,BaseActorCriticstruggles to learn effective policies in both settings, whileCATexhibits training instability, especially in the Mountain Car task\.

Finally, we investigate how the initial tree depth affects the number of training episodes required to achieve a 95% success rate in the grid\-navigation problem\. The U\-shaped trend in the left subplot of Figure[8](https://arxiv.org/html/2606.17377#S5.F8)reveals a trade\-off between abstraction granularity and sample efficiency\. When initialized at a large depth, the algorithm begins with a fine\-grained abstraction that enables more informative feedback but introduces a high\-dimensional decision space, thereby slowing learning\. At the other extreme, aggregating all states into a single superstate provides insufficient resolution for collecting informative feedback and requires additional episodes to expand the abstraction to an effective level\. Consequently, initializing the algorithm at an intermediate tree depth minimizes the number of training episodes required to achieve a high success rate, consistent with the training curves shown in Figure[7](https://arxiv.org/html/2606.17377#S5.F7)\.

The right subplot of Figure[8](https://arxiv.org/html/2606.17377#S5.F8)presents the Pareto frontier between abstraction size and policy performance, highlighting the trade\-off between the two\. As the abstraction becomes finer, the algorithm achieves a 99% success rate before fully refining the policy\. Consequently, the two solutions with the largest abstraction sizes are dominated by smaller abstractions with comparable performance and therefore do not lie on the Pareto frontier\. At the other extreme, abstractions containing fewer than 20% as many superstates as the original state space contains states fail to achieve a 60% success rate\. These low\-performing solutions are omitted from the plot for clarity\.

![Refer to caption](https://arxiv.org/html/2606.17377v1/x8.png)Figure 8:Left:Effect of the initial tree depth on the episodes required to reach a 95% success rate\.Right:The Pareto frontier between abstraction size and policy performance\.
## 6Conclusions

We studied performance\-driven abstractions for decision\-making in large state spaces\. For a fixed partition, we established a performance guarantee that separates value\-approximation error from the loss induced by the same\-action\-distribution \(SAD\) constraint\. Motivated by this analysis, we developed a multi\-timescale learning framework that jointly optimizes the policy and a tree\-based state abstraction\. The resulting algorithm adaptively refines and coarsens regions of the state space to balance performance and representational complexity\. Empirical results demonstrate that the learned hierarchical structure enables substantial state compression, accelerates replanning, and improves adaptability compared to flat actor–critic baselines\. A key future direction is to move beyond the current one\-level greedy lookahead and establish theoretical guarantees for multi\-level or globally optimal abstraction updates\. Another promising extension is to apply the framework beyond grid domains, including unstructured environments and joint state spaces in multi\-agent systems\.

## References

## Appendix Appendix AProof of Theorem[2](https://arxiv.org/html/2606.17377#Thmtheorem2)

See[2](https://arxiv.org/html/2606.17377#Thmtheorem2)

###### Proof\.

We first decompose the error‖ΨΓ​V¯Γ,μ∗∗−V∗‖\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-V^\{\*\}\\\|\}into two parts\.

‖ΨΓ​V¯Γ,μ∗∗−V∗‖∞\\displaystyle\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}≤‖ΨΓ​V¯Γ,μ∗∗−ΨΓ​V¯opt‖∞\+‖ΨΓ​V¯opt−V∗‖∞\\displaystyle\\leq\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}=‖ΨΓ​V¯Γ,μ∗∗−ΨΓ​V¯opt‖∞\+ϵΓ\\displaystyle=\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\\epsilon\_\{\\Gamma\}≤‖V¯Γ,μ∗∗−V¯opt‖∞\+ϵΓ,\\displaystyle\\leq\{\\\|\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\\epsilon\_\{\\Gamma\},\(16\)where we use the non\-expansiveness ofΨΓ\\Psi\_\{\\Gamma\}for the last inequality\. To bound the first term, we consider

‖V¯opt−ℬΓ,μ∗​V¯opt‖∞\\displaystyle\{\\\|\\bar\{V\}\_\{\\mathrm\{opt\}\}\-\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}≤‖ΨΓ,μ∗\+∘ΨΓ​V¯opt−ΨΓ,μ∗\+∘ℬ𝒮∘ΨΓ​V¯opt‖∞\+‖ΨΓ,μ∗\+∘ℬ𝒮∘ΨΓ​V¯opt−ℬΓ,μ∗​V¯opt‖∞⏟δΓ\\displaystyle\\leq\{\\\|\\Psi^\{\+\}\_\{\\Gamma,\\mu^\{\*\}\}\\circ\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-\\Psi^\{\+\}\_\{\\Gamma,\\mu^\{\*\}\}\\circ\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\circ\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\\underbrace\{\{\\\|\\Psi^\{\+\}\_\{\\Gamma,\\mu^\{\*\}\}\\circ\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\circ\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\}\_\{\\delta\_\{\\Gamma\}\}≤\(i\)‖ΨΓ​V¯opt−ℬ𝒮∘ΨΓ​V¯opt‖∞\+δΓ\\displaystyle\\stackrel\{\{\\scriptstyle\(\\text\{i\}\)\}\}\{\{\\leq\}\}\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\circ\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\\delta\_\{\\Gamma\}≤‖ΨΓ​V¯opt−V∗‖∞\+‖V∗−ℬ𝒮∘ΨΓ​V¯opt‖∞\+δΓ\\displaystyle\\leq\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}\+\{\\\|V^\{\*\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\circ\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\\delta\_\{\\Gamma\}=\(ii\)‖ΨΓ​V¯opt−V∗‖∞\+‖ℬ𝒮​V∗−ℬ𝒮∘ΨΓ​V¯opt‖∞\+δΓ\\displaystyle\\stackrel\{\{\\scriptstyle\(\\text\{ii\}\)\}\}\{\{=\}\}\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}\+\{\\\|\\mathcal\{B\}\_\{\\mathcal\{S\}\}V^\{\*\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\circ\\Psi\_\{\\Gamma\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\\delta\_\{\\Gamma\}≤\(iii\)ϵΓ\+β​ϵΓ\+δΓ,\\displaystyle\\stackrel\{\{\\scriptstyle\(\\text\{iii\}\)\}\}\{\{\\leq\}\}\\epsilon\_\{\\Gamma\}\+\\beta\\epsilon\_\{\\Gamma\}\+\\delta\_\{\\Gamma\},where inequality \(i\) follows from the non\-expansiveness ofΨΓ,μ∗\+\\Psi^\{\+\}\_\{\\Gamma,\\mu^\{\*\}\}; equality \(ii\) uses the fact thatV∗V^\{\*\}is the fixed point of the Bellman operatorℬ𝒮\\mathcal\{B\}\_\{\\mathcal\{S\}\}, i\.e\.,ℬ𝒮​V∗=V∗\\mathcal\{B\}\_\{\\mathcal\{S\}\}V^\{\*\}=V^\{\*\}, and inequality \(iii\) is due toℬ𝒮\\mathcal\{B\}\_\{\\mathcal\{S\}\}being aβ\\beta\-contractive operator\.

Then, we have

‖V¯Γ,μ∗∗−V¯opt‖∞\\displaystyle\{\\\|\\bar\{V\}\_\{\\Gamma,\\mu^\{\*\}\}^\{\*\}\-\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}≤‖V¯Γ,μ∗∗−ℬΓ,μ∗​V¯opt‖∞\+‖ℬΓ,μ∗​V¯opt−V¯opt‖∞\\displaystyle\\leq\{\\\|\\bar\{V\}\_\{\\Gamma,\\mu^\{\*\}\}^\{\*\}\-\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\{\\\|\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}=‖ℬΓ,μ∗​VΓ,μ∗∗−ℬΓ,μ∗​V¯opt‖∞\+‖ℬΓ,μ∗​V¯opt−V¯opt‖∞\\displaystyle=\{\\\|\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}V\_\{\\Gamma,\\mu^\{\*\}\}^\{\*\}\-\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\{\\\|\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}\_\{\\mathrm\{opt\}\}\-\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}≤β​‖V¯Γ,μ∗∗−V¯opt‖∞\+\(1\+β\)​ϵΓ\+δΓ,\\displaystyle\\leq\\beta\{\\\|\\bar\{V\}\_\{\\Gamma,\\mu^\{\*\}\}^\{\*\}\-\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\+\(1\+\\beta\)\\epsilon\_\{\\Gamma\}\+\\delta\_\{\\Gamma\},which implies that

‖V¯Γ,μ∗∗−V¯opt‖∞≤\(1\+β\)​ϵΓ\+δΓ1−β\.\{\\\|\\bar\{V\}\_\{\\Gamma,\\mu^\{\*\}\}^\{\*\}\-\\bar\{V\}\_\{\\mathrm\{opt\}\}\\\|\}\_\{\\infty\}\\leq\\frac\{\(1\+\\beta\)\\epsilon\_\{\\Gamma\}\+\\delta\_\{\\Gamma\}\}\{1\-\\beta\}\.\(17\)Substituting into \([16](https://arxiv.org/html/2606.17377#A1.E16)\), we obtain

‖ΨΓ​V¯Γ,μ∗∗−V∗‖∞≤\(1\+β\)​ϵΓ\+δΓ1−β\+ϵΓ=2​ϵΓ\+δΓ1−β\.\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}\\leq\\frac\{\(1\+\\beta\)\\epsilon\_\{\\Gamma\}\+\\delta\_\{\\Gamma\}\}\{1\-\\beta\}\+\\epsilon\_\{\\Gamma\}=\\frac\{2\\epsilon\_\{\\Gamma\}\+\\delta\_\{\\Gamma\}\}\{1\-\\beta\}\.\(18\)∎

## Appendix Appendix BProof of Corollary[3](https://arxiv.org/html/2606.17377#Thmtheorem3)

See[3](https://arxiv.org/html/2606.17377#Thmtheorem3)

###### Proof\.

Notice thatV¯Γ,μ∗∗\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}is the fixed point ofℬΓ,μ∗\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}andV∗V^\{\*\}is the fixed point ofℬ𝒮\\mathcal\{B\}\_\{\\mathcal\{S\}\}\. Thus, we have

‖ΨΓ​V¯Γ,μ∗∗−V∗‖∞\\displaystyle\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}=‖ΨΓ​ℬΓ,μ∗​V¯Γ,μ∗∗−ℬ𝒮​V∗‖∞\\displaystyle=\{\\\|\\Psi\_\{\\Gamma\}\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}V^\{\*\}\\\|\}\_\{\\infty\}≤‖ΨΓ​ℬΓ,μ∗​V¯Γ,μ∗∗−ℬ𝒮​ΨΓ​V¯Γ,μ∗∗‖∞\+‖ℬ𝒮​ΨΓ​V¯Γ,μ∗∗−ℬ𝒮​V∗‖∞\\displaystyle\\leq\{\\\|\\Psi\_\{\\Gamma\}\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\\\|\}\_\{\\infty\}\+\{\\\|\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}V^\{\*\}\\\|\}\_\{\\infty\}≤‖ΨΓ​ℬΓ,μ∗​V¯Γ,μ∗∗−ℬ𝒮​ΨΓ​V¯Γ,μ∗∗‖∞\+β​‖ΨΓ​V¯Γ,μ∗∗−V∗‖∞\.\\displaystyle\\leq\{\\\|\\Psi\_\{\\Gamma\}\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\\\|\}\_\{\\infty\}\+\\beta\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}\.Rearranging terms yields

‖ΨΓ​V¯Γ,μ∗∗−V∗‖∞≤‖ΨΓ​ℬΓ,μ∗​V¯Γ,μ∗∗−ℬ𝒮​ΨΓ​V¯Γ,μ∗∗‖∞1−β\.\{\\\|\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-V^\{\*\}\\\|\}\_\{\\infty\}\\leq\\frac\{\{\\\|\\Psi\_\{\\Gamma\}\\mathcal\{B\}\_\{\\Gamma,\\mu^\{\*\}\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\-\\mathcal\{B\}\_\{\\mathcal\{S\}\}\\Psi\_\{\\Gamma\}\\bar\{V\}^\{\*\}\_\{\\Gamma,\\mu^\{\*\}\}\\\|\}\_\{\\infty\}\}\{1\-\\beta\}\.
∎

## Appendix Appendix CPseudo\-code for the Learning Algorithm

1

2Function*ExpandNode\(*𝒯,γ,Q¯,V¯,Q¯est\\mathcal\{T\},\\gamma,\\bar\{Q\},\\bar\{V\},\\bar\{Q\}^\{\\mathrm\{est\}\}*\)*:

3

𝒯←Expd​\(𝒯,γ\)\\mathcal\{T\}\\leftarrow\\mathrm\{Expd\}\(\\mathcal\{T\},\\gamma\);

4

Q¯​\(γ′,a\)←\(1−ρ\)​Q¯​\(γ,a\)\+ρ​Q¯est​\(γ′,a\)\\bar\{Q\}\(\\gamma^\{\\prime\},a\)\\leftarrow\(1\-\\rho\)\\bar\{Q\}\(\\gamma,a\)\+\\rho\\bar\{Q\}^\{\\mathrm\{est\}\}\(\\gamma^\{\\prime\},a\)for all

γ′∈Chldrn​\(γ\)\\gamma^\{\\prime\}\\in\\mathrm\{Chldrn\}\(\\gamma\)and

a∈𝒜a\\in\\mathcal\{A\};

5

V¯​\(γ′\)←\(1−ρ\)​V¯​\(γ\)\+ρ​maxa∈𝒜⁡Q¯est​\(γ′,a\)\\bar\{V\}\(\\gamma^\{\\prime\}\)\\leftarrow\(1\-\\rho\)\\bar\{V\}\(\\gamma\)\+\\rho\\max\_\{a\\in\\mathcal\{A\}\}\\bar\{Q\}^\{\\mathrm\{est\}\}\(\\gamma^\{\\prime\},a\)for all

γ′∈Chldrn​\(γ\)\\gamma^\{\\prime\}\\in\\mathrm\{Chldrn\}\(\\gamma\);

6

7

Algorithm 1Expansion1

2Function*CollapseNode\(*𝒯,γ,Q¯,V¯,Q¯est\\mathcal\{T\},\\gamma,\\bar\{Q\},\\bar\{V\},\\bar\{Q\}^\{\\mathrm\{est\}\}*\)*:

3

𝒯←Cllps​\(𝒯,γ\)\\mathcal\{T\}\\leftarrow\\mathrm\{Cllps\}\(\\mathcal\{T\},\\gamma\);

4

Q¯​\(γ,a\)←\(1−ρ\)​∑γ′∈Chldrn​\(γ\)1\|Chldrn​\(γ\)\|​Q¯​\(γ′,a\)\+ρ​Q¯est​\(γ,a\)\\bar\{Q\}\(\\gamma,a\)\\leftarrow\(1\-\\rho\)\\sum\_\{\\gamma^\{\\prime\}\\in\\mathrm\{Chldrn\}\(\\gamma\)\}\\frac\{1\}\{\|\\mathrm\{Chldrn\}\(\\gamma\)\|\}\\bar\{Q\}\(\\gamma^\{\\prime\},a\)\+\\rho\\bar\{Q\}^\{\\mathrm\{est\}\}\(\\gamma,a\)for all

a∈𝒜a\\in\\mathcal\{A\};

5

V¯​\(γ′\)←\(1−ρ\)​∑γ′∈Chldrn​\(γ\)1\|Chldrn​\(γ\)\|​V¯​\(γ′\)\+ρ​maxa∈𝒜⁡Q¯est​\(γ,a\)\\bar\{V\}\(\\gamma^\{\\prime\}\)\\leftarrow\(1\-\\rho\)\\sum\_\{\\gamma^\{\\prime\}\\in\\mathrm\{Chldrn\}\(\\gamma\)\}\\frac\{1\}\{\|\\mathrm\{Chldrn\}\(\\gamma\)\|\}\\bar\{V\}\(\\gamma^\{\\prime\}\)\+\\rho\\max\_\{a\\in\\mathcal\{A\}\}\\bar\{Q\}^\{\\mathrm\{est\}\}\(\\gamma,a\);

6

7

Algorithm 2Collapse1

2Function*Main\(\)*:

3Initialize tree

𝒯\\mathcal\{T\};

4Initialize

V¯0​\(γ\)\\bar\{V\}\_\{0\}\(\\gamma\),

Q¯0​\(γ,a\)\\bar\{Q\}\_\{0\}\(\\gamma,a\)for all

γ∈𝒱leaf​\(𝒯\)\\gamma\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)and

a∈𝒜a\\in\\mathcal\{A\};

5Initialize

Q¯0est​\(γ′,a\)\\bar\{Q\}\_\{0\}^\{\\mathrm\{est\}\}\(\\gamma^\{\\prime\},a\)for all

γ′∈Chldrn​\(γ\)∪Prnt​\(γ\)\\gamma^\{\\prime\}\\in\\mathrm\{Chldrn\}\(\\gamma\)\\cup\\mathrm\{Prnt\}\(\\gamma\),

γ∈𝒱leaf​\(𝒯\)\\gamma\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\);

6Initialize learning rates

ξ0,ζ0,η0\\xi\_\{0\},\\zeta\_\{0\},\\eta\_\{0\};

7for*ep=0​to​epmax\\mathrm\{ep\}=0\\text\{ to \}\\mathrm\{ep\}\_\{\\mathrm\{max\}\}*do

8Initialize starting state

StS\_\{t\};

9while*StS\_\{t\}is not terminal*do

10Find current superstate

Γt=ϕ𝒯​\(St\)\\Gamma\_\{t\}=\\phi\_\{\\mathcal\{T\}\}\(S\_\{t\}\)and sample action

AtA\_\{t\}according to

ψQ¯​\(Γt,⋅\)\\psi\_\{\\bar\{Q\}\}\(\\Gamma\_\{t\},\\cdot\);

11Take action

AtA\_\{t\}, observe reward

RtR\_\{t\}and next state

St\+1S\_\{t\+1\}and superstate

Γt\+1=ϕ𝒯​\(St\+1\)\\Gamma\_\{t\+1\}=\\phi\_\{\\mathcal\{T\}\}\(S\_\{t\+1\}\);

12Update Aggregated Actor\-Critic according to \([5](https://arxiv.org/html/2606.17377#S3.E5)\);

13PerformUpdateQest\(*𝒯input=𝒯,St,At,St\+1,ηt\\mathcal\{T\}\_\{\\mathrm\{input\}\}=\\mathcal\{T\},S\_\{t\},A\_\{t\},S\_\{t\+1\},\\eta\_\{t\}*\);

14

15if*Γt\\Gamma\_\{t\}is expandable*then

16PerformUpdateQest\(*𝒯input=Expd​\(𝒯,Γt\),St,At,St\+1,ηt\\mathcal\{T\}\_\{\\mathrm\{input\}\}=\\mathrm\{Expd\}\(\\mathcal\{T\},\\Gamma\_\{t\}\),S\_\{t\},A\_\{t\},S\_\{t\+1\},\\eta\_\{t\}*\);

17

18if*Prnt​\(Γt\)\\mathrm\{Prnt\}\(\\Gamma\_\{t\}\)is collapsible*then

19PerformUpdateQest\(*𝒯input=Cllps​\(𝒯,Prnt​\(Γt\)\),St,At,St\+1,ηt\\mathcal\{T\}\_\{\\mathrm\{input\}\}=\\mathrm\{Cllps\}\(\\mathcal\{T\},\\mathrm\{Prnt\}\(\\Gamma\_\{t\}\)\),S\_\{t\},A\_\{t\},S\_\{t\+1\},\\eta\_\{t\}*\);

20

21

t←t\+1t\\leftarrow t\+1;

22

23if*tmodtree\_update\_frequency=0t\\bmod\\text\{tree\\\_update\\\_frequency\}=0*then

24for*γ∈𝒱leaf​\(𝒯\)\\gamma\\in\\mathcal\{V\}\_\{\\mathrm\{leaf\}\}\(\\mathcal\{T\}\)*do

25if*CheckExpand\(*𝒯,Q¯est,γ\\mathcal\{T\},\\bar\{Q\}^\{\\mathrm\{est\}\},\\gamma*\)*then

26append

γ\\gammatoexpand\_node\_list;

27

28if*CheckCollapse\(*𝒯,Q¯est,Prnt​\(γ\)\\mathcal\{T\},\\bar\{Q\}^\{\\mathrm\{est\}\},\\mathrm\{Prnt\}\(\\gamma\)*\)*then

29append

Prnt​\(γ\)\\mathrm\{Prnt\}\(\\gamma\)tocollapse\_node\_list;

30

31

32ExpandNode\(*𝒯,*expand\_node\_list*,Q¯,V¯,Q¯est\\mathcal\{T\},\\texttt\{expand\\\_node\\\_list\},\\bar\{Q\},\\bar\{V\},\\bar\{Q\}^\{\\mathrm\{est\}\}*\);

33CollapseNode\(*𝒯,*collapse\_node\_list*,Q¯,V¯,Q¯est\\mathcal\{T\},\\texttt\{collapse\\\_node\\\_list\},\\bar\{Q\},\\bar\{V\},\\bar\{Q\}^\{\\mathrm\{est\}\}*\);

34

35

36Return:Tree

𝒯\\mathcal\{T\}, Policy

ΨQ¯\\Psi\_\{\\bar\{Q\}\}
37

38

39Function*UpdateQest\(*𝒯input,St,At,St\+1,ηt\\mathcal\{T\}\_\{\\mathrm\{input\}\},S\_\{t\},A\_\{t\},S\_\{t\+1\},\\eta\_\{t\}*\)*:

40

Γt=ϕ𝒯input​\(St\)\\Gamma\_\{t\}=\\phi\_\{\\mathcal\{T\}\_\{\\mathrm\{input\}\}\}\(S\_\{t\}\),

Γt\+1=ϕ𝒯input​\(St\+1\)\\Gamma\_\{t\+1\}=\\phi\_\{\\mathcal\{T\}\_\{\\mathrm\{input\}\}\}\(S\_\{t\+1\}\);

41Update Q\-estimate according to \([9](https://arxiv.org/html/2606.17377#S4.E9)\);

42

43Function*CheckExpand\(*𝒯input,St,At,St\+1,ηt\\mathcal\{T\}\_\{\\mathrm\{input\}\},S\_\{t\},A\_\{t\},S\_\{t\+1\},\\eta\_\{t\}*\)*:

44

Γt=ϕ𝒯input​\(St\)\\Gamma\_\{t\}=\\phi\_\{\\mathcal\{T\}\_\{\\mathrm\{input\}\}\}\(S\_\{t\}\);

45Evaluate

Δ^​ϵ¯𝒯input,Γt\\hat\{\\Delta\}\\bar\{\\epsilon\}\_\{\\mathcal\{T\}\_\{\\mathrm\{input\}\},\\Gamma\_\{t\}\}according to \([13](https://arxiv.org/html/2606.17377#S4.E13)\);

46Return:Trueif

Δ^​ϵ¯𝒯input,γ≥3​\(1−β\)​λ\\hat\{\\Delta\}\\bar\{\\epsilon\}\_\{\\mathcal\{T\}\_\{\\mathrm\{input\}\},\\gamma\}\\geq 3\(1\-\\beta\)\\lambda,Falseotherwise\.

47Function*CheckCollapse\(*𝒯input,St,At,St\+1,ηt\\mathcal\{T\}\_\{\\mathrm\{input\}\},S\_\{t\},A\_\{t\},S\_\{t\+1\},\\eta\_\{t\}*\)*:

48

Γt=ϕ𝒯input​\(St\)\\Gamma\_\{t\}=\\phi\_\{\\mathcal\{T\}\_\{\\mathrm\{input\}\}\}\(S\_\{t\}\),

Γt′=Prnt​\(Γt\)\\Gamma^\{\\prime\}\_\{t\}=\\mathrm\{Prnt\}\(\\Gamma\_\{t\}\),

𝒯′=Cllps​\(𝒯input,Γt′\)\\mathcal\{T\}^\{\\prime\}=\\mathrm\{Cllps\}\(\\mathcal\{T\}\_\{\\mathrm\{input\}\},\\Gamma^\{\\prime\}\_\{t\}\);

49Evaluate

Δ^​ϵ¯𝒯′,Γt′\\hat\{\\Delta\}\\bar\{\\epsilon\}\_\{\\mathcal\{T\}^\{\\prime\},\\Gamma^\{\\prime\}\_\{t\}\}according to \([13](https://arxiv.org/html/2606.17377#S4.E13)\);

50Return:Trueif

Δ^​ϵ¯𝒯′,Γt′<3​\(1−β\)​λ\\hat\{\\Delta\}\\bar\{\\epsilon\}\_\{\\mathcal\{T\}^\{\\prime\},\\Gamma^\{\\prime\}\_\{t\}\}<3\(1\-\\beta\)\\lambda,Falseotherwise\.

Algorithm 3Tree\-based Abstraction Learning

Similar Articles

Towards Scalable Multi-Task Reinforcement Learning with Large Decision Models

arXiv cs.LG

This paper introduces LDM-v0, a large decision model trained offline on trajectories from thousands of diverse reinforcement learning environments, demonstrating that a single transformer policy can match the performance of task-specific policies across robotics, autonomous driving, inventory management, cybersecurity, trading, and video games.

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

arXiv cs.CL

This paper proposes the LLM-as-Environment-Engineer framework, where a policy model analyzes failures to automatically redesign the training environment for reinforcement learning, and introduces MAPF-FrozenLake as a controllable testbed. The framework, using Qwen3-4B, outperforms larger models like GPT and Gemini, showing that policy learning improves the model's ability to diagnose weaknesses.

Finding the Time to Think: Learning Planning Budgets in Real-Time RL

arXiv cs.LG

This paper introduces variable-delay real-time RL, where agents decide how long to deliberate in environments that progress during decision-making, and proposes a lightweight gating policy to select state-dependent planning budgets, outperforming fixed-budget and heuristic baselines in several real-time games.