Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

arXiv cs.AI Papers

Summary

This paper introduces CARL, a method for offline hierarchical reinforcement learning that exploits local dynamics regularity to learn reusable skills. The approach clusters state-goal pairs requiring similar action sequences, enabling more effective skill reuse and improved performance on complex humanoid tasks.

arXiv:2605.26371v1 Announce Type: new Abstract: Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and reusing temporally-extended skills. However, obtaining skills that are actually reusable remains an open challenge. Towards this end, we focus on abstractions that exploit the intuition of local dynamics: local transitions in different global contexts require similar kinds of action sequences. By aligning these contexts with the action sequences they require, we are able to learn which skills to reuse and where to reuse them. In principle, this information should benefit many HRL algorithms, where high-level policies have to reason about the low-level skills they use. The resulting algorithm CARL (Contrastive Action-based Representations for Reusable Local Control) shows both qualitative clustering of meaningful skills in complex humanoid environments and improved downstream performance on the OGBench benchmark when integrated with HIQL.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:04 AM

# Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL
Source: [https://arxiv.org/html/2605.26371](https://arxiv.org/html/2605.26371)
Sarthak Dayal Department of Computer Science University of Texas at Austin sarthak@utexas\.eduAbhinav Peri11footnotemark:1 Department of Computer Science University of Texas at Austin app2452@utexas\.eduCarl Qi UT Austin Claas Voelcker UT Austin Alexander Levine OpenAI Caleb Chuck UT Austin Amy Zhang UT Austin

###### Abstract

Hierarchical Reinforcement Learning \(HRL\) promises to solve long\-horizon Reinforcement Learning \(RL\) tasks more efficiently than non\-hierarchical counterparts by discovering and reusing temporally\-extended skills\. However, obtaining skills that are actually reusable remains an open challenge\. Towards this end, we focus on abstractions that exploit the intuition of local dynamics: local transitions in different global contexts require similar kinds of action sequences\. By aligning these contexts with the action sequences they require, we are able to learn which skills to reuse and where to reuse them\. In principle, this information should benefit many HRL algorithms, where high\-level policies have to reason about the low\-level skills they use\. The resulting algorithm CARL \(ContrastiveAction\-basedRepresentations for ReusableLocal Control\) shows both qualitative clustering of meaningful skills in complex humanoid environments and improved downstream performance on the OGBench benchmark when integrated with HIQL\. We visualize additional results and video rollouts on our accompanying website\.111[https://sites\.google\.com/view/behavior\-rep/home](https://sites.google.com/view/behavior-rep/home)

## 1Introduction

Hierarchical Reinforcement Learning \(HRL\) methods offer a framework\(Suttonet al\.,[1999](https://arxiv.org/html/2605.26371#bib.bib6)\)for abstracting low\-level policies or action sequences—commonly referred to as*skills*—so that they can be reused to accomplish complex tasks\. However, this promise of reusable skills has often been outweighed by the limitations of existing approaches\(Zhanget al\.,[2020](https://arxiv.org/html/2605.26371#bib.bib51)\), which include training instability when co\-learning skills with the policies that use those skills\(Levyet al\.,[2019](https://arxiv.org/html/2605.26371#bib.bib9); Wanget al\.,[2025](https://arxiv.org/html/2605.26371#bib.bib52)\), and failure to leverage reusability\. One promising direction involves representing low\-level skills as goal\-conditioned policies\(Parket al\.,[2023](https://arxiv.org/html/2605.26371#bib.bib5); Hafneret al\.,[2022](https://arxiv.org/html/2605.26371#bib.bib50); Eysenbachet al\.,[2019](https://arxiv.org/html/2605.26371#bib.bib54)\), which decouples low\- and high\-level policy training\. However, representing skills via a goal\-conditioned low\-level policy loses many of the desirable properties of skills, such as temporal abstraction and consistent reusability\. In this work, we formalize the idea of local dynamics—the idea that many local regions of the state space share similar dynamics—to enable reuse of low\-level goal\-conditioned skills across the state space\.

![Refer to caption](https://arxiv.org/html/2605.26371v1/x1.png)Figure 1:Learning state–goal representations for skill reuse\. Humanoid poses left\-to\-right in each pane reflect progress in time\. CARL learns a representationϕ​\(s,g\)\\phi\(s,g\)that clusters state–goal pairs if they admit the same kinds ofkk\-step action sequences\. CARL clusters “walking backward" and “standing up" behaviors for humanoids situated in globally different parts of the maze\.To formalize the idea of local dynamics we define local, finite\-horizon MDPs for each state to capture its surrounding states and their dynamics\. We then draw on bisimulation as a principled way to define state equivalences based on these MDPs\. To derive a practical algorithm from this principle, we leverage the idea of*behavioral similarity*: two state\-goal pairs\(s1,g1\)\(s\_\{1\},g\_\{1\}\)and\(s2,g2\)\(s\_\{2\},g\_\{2\}\)are considered similar when a policy can use the same skill\(a1,a2,a3,…,ak\)\(a\_\{1\},a\_\{2\},a\_\{3\},\\dots,a\_\{k\}\)to achieve the transition froms1s\_\{1\}tog1g\_\{1\}and froms2s\_\{2\}tog2g\_\{2\}\. Our hypothesis is that behavioral similarity will naturally emerge when many states share local dynamics structure, making it a useful heuristic for understanding when skills can be reused\. Our method is built upon these insights and leverages a contrastive learning objective to align state\-goal pairs with the action sequences that achieve them\.

We introduce CARL \(ContrastiveAction\-basedRepresentations for ReusableLocal Control\) to learn representations of local dynamics from offline datasets at a fixed skill horizon\. By recognizing the underlying local dynamics structure, CARL identifies where skills can be reused instead of forcing HRL methods to relearn low\-level policies from scratch\. We demonstrate this by integrating these abstractions into existing HRL algorithms, HIQL and HGCBC\(Parket al\.,[2023](https://arxiv.org/html/2605.26371#bib.bib5)\)which yields clear performance benefits on the OGBench benchmark\. Furthermore, our objective shapes the latent geometry so that state–goal pairs requiring similar local skills are embedded nearby, even when they occur in different regions of the environment\. Figure[1](https://arxiv.org/html/2605.26371#S1.F1)illustrates this in a humanoid maze environment, where CARL groups skills together regardless of where they occur in the maze\.

As noted, CARL relies on an offline dataset and fixed horizon to capture local dynamics, which may constrain expressible skills when data coverage is poor or the horizon fails to capture useful structure\. We include ablations analyzing the effects of coverage, imbalance, and horizon length\. Overall, our work shows that local dynamics structure provides a principled basis for skill extraction and reuse, helping HRL reconnect with the temporal abstractions central to long\-horizon decision\-making\.

## 2Related Work

### 2\.1Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning studies how to solve long\-horizon Reinforcement Learning \(RL\) tasks by introducing*temporal abstraction*, which refers to high\-level decisions that invoke temporally extended behaviors\. One of the first formulations of this idea is the options framework, which models a skill as a policy with an initiation set and termination condition, enabling re\-use of temporally extended actions within an SMDP view of control\(Suttonet al\.,[1999](https://arxiv.org/html/2605.26371#bib.bib6)\)\. Early HRL work explored a variety of hierarchical decompositions and skill abstractions, including feudal\-style manager–worker architectures\(Dayan and Hinton,[1992](https://arxiv.org/html/2605.26371#bib.bib53)\), value\-function decomposition methods such as MAXQ\(Dietterich,[2000](https://arxiv.org/html/2605.26371#bib.bib7)\), and policy\-constraining formalisms such as hierarchies of abstract machines \(HAMs\)\(Parr and Russell,[1997](https://arxiv.org/html/2605.26371#bib.bib56)\)\.

A core challenge in learning hierarchies*end\-to\-end*is non\-stationarity\. As the low\-level policy changes, the effective dynamics faced by the high\-level policy shift, often destabilizing joint optimization\. Many modern approaches therefore decouple the training of the hierarchy, by using subgoals \(goals in state space\) to train a local goal\-conditioned policy while learning a high\-level policy that proposes useful targets\(Nachumet al\.,[2018](https://arxiv.org/html/2605.26371#bib.bib57); Levyet al\.,[2019](https://arxiv.org/html/2605.26371#bib.bib9)\)\. Recent offline and model\-based hierarchical methods further improve this paradigm by extracting effective policies from offline data or learned dynamics models\(Hafneret al\.,[2022](https://arxiv.org/html/2605.26371#bib.bib50); Parket al\.,[2023](https://arxiv.org/html/2605.26371#bib.bib5)\)\. Although this paradigm substantially improves the training stability of HRL methods, reasoning directly in the global state–goal space makes it difficult for the low\-level policy to recognize when the same behavior can be reused\. To overcome this limitation, we focus on learning representations of state–goal pairs that highlight the short\-horizon action structure and thus enable appropriate reuse of behaviors\.

### 2\.2Abstractions via Behavioral Equivalences and Invariances

In this section, we discuss methods that seek to generalize by learning behavior\-preserving abstractions, where distinct states \(or state\-goal contexts\) are mapped to similar representations when they are interchangeable\(Agarwalet al\.,[2021](https://arxiv.org/html/2605.26371#bib.bib13); Hansen\-Estruchet al\.,[2022](https://arxiv.org/html/2605.26371#bib.bib41); Islamet al\.,[2023](https://arxiv.org/html/2605.26371#bib.bib46); Ajayet al\.,[2021](https://arxiv.org/html/2605.26371#bib.bib22); Parket al\.,[2025b](https://arxiv.org/html/2605.26371#bib.bib27)\)\. A common theme is to define an equivalence relation that preserves global decision\-relevant structure: for example via value\-, policy\-, or model\-based notions of similarity—and then learn representations that collapse equivalent situations while discarding irrelevant variation\. This perspective is closely related to bisimulation\-based ideas, which characterize when two states can be treated as equivalent without materially changing long\-horizon outcomes\(Zhanget al\.,[2021](https://arxiv.org/html/2605.26371#bib.bib40); Rudolphet al\.,[2024](https://arxiv.org/html/2605.26371#bib.bib48); Hansen\-Estruchet al\.,[2022](https://arxiv.org/html/2605.26371#bib.bib41); Castro,[2020](https://arxiv.org/html/2605.26371#bib.bib43)\)\.

Our work fits into this line of research, but fundamentally targets a complementary notion of equivalence grounded in*local reuse*\. Rather than requiring states to match in long\-horizon value or reward structure, we ask when different state\-goal pairs admit similar local goal\-reaching behaviors, which allows us to cluster exactly the state\-goal pairs which are the same from the perspective of short\-horizon control, regardless of the long\-horizon effects\.

### 2\.3Contrastive Representations for Control

Contrastive Representation Learning provides a general mechanism for extracting structure in embeddings that score positive pairs higher than selected negative pairs, typically trained via objectives such as InfoNCE or NCE\(van den Oordet al\.,[2019](https://arxiv.org/html/2605.26371#bib.bib38); Radfordet al\.,[2021](https://arxiv.org/html/2605.26371#bib.bib42)\)\. In RL, contrastive methods have been applied to learn task relevant structure and enable downstream generalization\(Eysenbachet al\.,[2022](https://arxiv.org/html/2605.26371#bib.bib36); Agarwalet al\.,[2021](https://arxiv.org/html/2605.26371#bib.bib13); Laskinet al\.,[2020](https://arxiv.org/html/2605.26371#bib.bib37)\)\. Particularly, Contrastive RL\(Eysenbachet al\.,[2022](https://arxiv.org/html/2605.26371#bib.bib36)\)connects contrastive objectives to goal\-conditioned RL by learning representations in which reaching a goal corresponds to matching future states under a learned similarity metric\. Our method is inspired by this viewpoint, but captures a different structure: instead of learning representations that reflect goal reachability, we capture when similar action sequences are admitted by the state\-goal transitions that the low\-level policy aims to achieve\. This greatly changes the contrastive structure compared to previous approaches that instead reason about the long\-horizon effects of actions\.

## 3Preliminaries

### 3\.1Offline Goal\-Conditioned RL

In this work, we investigate behavioral equivalences in the context of offline goal\-conditioned RL \(OGCRL\), which is defined by the Markov decision processℳ≔\(𝒮,𝒜,p,r,γ\)\\mathcal\{M\}\\coloneqq\(\\mathcal\{S\},\\mathcal\{A\},p,r,\\gamma\), where𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}is the action space,p:𝒮×𝒜→Δ​\(𝒮\)p:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\Delta\(\\mathcal\{S\}\)is the transition function, andΔ​\(𝒮\)\\Delta\(\\mathcal\{S\}\)is a distribution over states,r:𝒮×𝒢→ℝr:\\mathcal\{S\}\\times\\mathcal\{G\}\\rightarrow\\mathbb\{R\}is a reward function, andγ∈\(0,1\)\\gamma\\in\(0,1\)is a discount factor\. We consider the case where𝒢⊆𝒮\\mathcal\{G\}\\subseteq\\mathcal\{S\}\. The objective is to learn a goal conditioned policyπ:𝒮×𝒢→Δ​\(A\)\\pi:\\mathcal\{S\}\\times\\mathcal\{G\}\\rightarrow\\Delta\(A\)such that for time horizonHH, the expected reward is maximized:maxπ⁡J​\(π\)=maxπ⁡𝔼g∼ρ​\(g\),τ∼ρπ​\(τ\)​\[∑t=0Hγt​r​\(st,g\)\]\\max\_\{\\pi\}J\(\\pi\)=\\max\_\{\\pi\}\\mathbb\{E\}\_\{g\\sim\\rho\(g\),\\tau\\sim\\rho^\{\\pi\}\(\\tau\)\}\[\\sum\_\{t=0\}^\{H\}\\gamma^\{t\}r\(s\_\{t\},g\)\]\. We define trajectoryτ=\(s0,a0,s1,a1,…,sH\)\\tau=\(s\_\{0\},a\_\{0\},s\_\{1\},a\_\{1\},\\ldots,s\_\{H\}\)to be a sequence of states with a fixed horizonHH\. Trajectories are sampled according toρπ​\(τ\)=μ​\(s0\)​Πt=0H−1​π​\(a∣st,g\)​p​\(st\+1\|st,at\)\\rho^\{\\pi\}\(\\tau\)=\\mu\(s\_\{0\}\)\\Pi\_\{t=0\}^\{H\-1\}\\pi\(a\\mid s\_\{t\},g\)p\(s\_\{t\+1\}\|s\_\{t\},a\_\{t\}\), whereμ\\muis the initial distribution over states, andρ​\(g\)\\rho\(g\)is the goal distribution\. We writep​\(s′∣s,𝐚𝐤\)=∑s1,…,sk−1∈𝒮∏i=0k−1p​\(si\+1∣si,ai\),p\(s^\{\\prime\}\\mid s,\\mathbf\{a\_\{k\}\}\)=\\sum\_\{s\_\{1\},\\ldots,s\_\{k\-1\}\\in\\mathcal\{S\}\}\\prod\_\{i=0\}^\{k\-1\}p\(s\_\{i\+1\}\\mid s\_\{i\},a\_\{i\}\),to mean thekk\-step transition function, where𝐚𝐤\\mathbf\{a\_\{k\}\}denotes akk\-step action sequence starting at statess, andaia\_\{i\}denote individual actions in that sequence\. We aim to use upper case letters for random variables, script upper case for spaces, and lower case for values and functions\.

### 3\.2Local Dynamics

Here we formalize our intuition of local dynamics as a state equivalence relation\. Specifically, we use the transition dynamics within a fixed horizonkkin order to characterize a state’s local dynamics\.

###### Definition 3\.1\(kk\-ball\)\.

For anyk∈ℕk\\in\\mathbb\{N\}ands∈𝒮s\\in\\mathcal\{S\}, the*kk\-ball*ofss, denotedℬk​\(s\)\\mathcal\{B\}\_\{k\}\(s\), is the set of states reachable fromsswithinkksteps\. Formally,s′∈ℬk​\(s\)s^\{\\prime\}\\in\\mathcal\{B\}\_\{k\}\(s\)if there existt∈\{0,1,…,k\}t\\in\\\{0,1,\\dots,k\\\}and an action sequence𝐚𝐭\\mathbf\{a\_\{t\}\}such thatp​\(s′∣s,𝐚𝐭\)\>0p\(s^\{\\prime\}\\mid s,\\mathbf\{a\_\{t\}\}\)\>0\.

###### Definition 3\.2\(kk\-ball MDP\)\.

The*kk\-ball MDP*rooted atssis the finite\-horizon MDP

ℳs\(k\)=\(ℬk​\(s\),𝒜,p,r,k\),\\mathcal\{M\}\_\{s\}^\{\(k\)\}=\\bigl\(\\mathcal\{B\}\_\{k\}\(s\),\\ \\mathcal\{A\},\\ p,\\ r,\\ k\\bigr\),whose transition dynamicsPP, action space𝒜\\mathcal\{A\}, and rewardRRare inherited fromℳ\\mathcal\{M\}\.

We now use thekk\-ball MDP to formalize a notion of state similarity grounded in local dynamics\. Prior work adapting bisimulation to RL\(Fernset al\.,[2004](https://arxiv.org/html/2605.26371#bib.bib65); Zhanget al\.,[2021](https://arxiv.org/html/2605.26371#bib.bib40); Rudolphet al\.,[2024](https://arxiv.org/html/2605.26371#bib.bib48)\)measure similarity between MDPs through the sequences of rewards or single\-step controllability metrics under identical action sequences\. Our formalism instead considers dynamics directly, capturing whether two MDPs are interchangeable for goal reaching behavior\.

###### Definition 3\.3\(Dynamics Bisimilarity\)\.

Letℳ1=\(𝒮1,𝒜,p1,r1\)\\mathcal\{M\}\_\{1\}=\(\\mathcal\{S\}\_\{1\},\\mathcal\{A\},p\_\{1\},r\_\{1\}\)andℳ2=\(𝒮2,𝒜,p2,r2\)\\mathcal\{M\}\_\{2\}=\(\\mathcal\{S\}\_\{2\},\\mathcal\{A\},p\_\{2\},r\_\{2\}\)be two MDPs sharing an action space𝒜\\mathcal\{A\}\. We say thatℳ1\\mathcal\{M\}\_\{1\}andℳ2\\mathcal\{M\}\_\{2\}are*dynamics\-bisimilar*if there exists a total relationB⊆𝒮1×𝒮2B\\subseteq\\mathcal\{S\}\_\{1\}\\times\\mathcal\{S\}\_\{2\}such that every\(x,x′\)∈B\(x,x^\{\\prime\}\)\\in Bfor alla∈𝒜a\\in\\mathcal\{A\}satisfies

∀y∈𝒮1,∃y′∈𝒮2\(and vice versa\):p1\(y∣x,a\)=p2\(y′∣x′,a\),\\displaystyle\\forall\\,y\\in\\mathcal\{S\}\_\{1\},\\ \\exists\\,y^\{\\prime\}\\in\\mathcal\{S\}\_\{2\}\\text\{ \(and vice versa\)\}:\\quad p\_\{1\}\(y\\mid x,a\)=p\_\{2\}\(y^\{\\prime\}\\mid x^\{\\prime\},a\),∀C∈𝒮1/B:p1¯\(C∣x,a\)=p2¯\(B−1\(C\)∣x′,a\)\.\\displaystyle\\forall\\,C\\in\\mathcal\{S\}\_\{1\}/B:\\quad\\bar\{p\_\{1\}\}\(C\\mid x,a\)=\\bar\{p\_\{2\}\}\(B^\{\-1\}\(C\)\\mid x^\{\\prime\},a\)\.
where𝒮1/B\\mathcal\{S\}\_\{1\}/Bis the partition of𝒮1\\mathcal\{S\}\_\{1\}induced byBB\(the set of equivalence classesCCofBB\-related states in𝒮1\\mathcal\{S\}\_\{1\}\),B−1​\(C\)=\{y′∈𝒮2:∃y∈C,\(y,y′\)∈B\}B^\{\-1\}\(C\)=\\\{y^\{\\prime\}\\in\\mathcal\{S\}\_\{2\}:\\exists\\,y\\in C,\\ \(y,y^\{\\prime\}\)\\in B\\\}is the corresponding subset of𝒮2\\mathcal\{S\}\_\{2\}, andp¯​\(C∣s,a\)=∑s′∈Cp​\(s′∣s,a\)\\bar\{p\}\(C\\mid s,a\)=\\sum\_\{s^\{\\prime\}\\in C\}p\(s^\{\\prime\}\\mid s,a\)extendsppto subsets\.

We say that two statesssands′s^\{\\prime\}share local dynamics if their correspondingkk\-ball MDPs,ℳs\(k\)\\mathcal\{M\}\_\{s\}^\{\(k\)\}andℳs′\(k\)\\mathcal\{M\}\_\{s^\{\\prime\}\}^\{\(k\)\}, are*dynamics\-bisimilar*\.

## 4Method

### 4\.1Learning Representation for Skill Reuse

Although our state equivalence relates local dynamics, directly estimating this quantity in a practical algorithm for skill\-reuse is challenging because it would require constructing thekk\-ball MDP around each state and comparing their transition structure under all relevant actions\. In continuous, high\-dimensional domains, thekk\-ball MDP is especially hard to obtain and enumerate for comparison\. In the offline setting, this problem is compounded by the fact that we only observe a finite set of trajectories, which may sparsely cover the relevant local transitions\. As a result, we seek an approximation that preserves the key ideas from our formalism while remaining simple to implement\.

Our core insight is that offline datasets with near\-expert behavior can reveal local dynamics structure without requiring us to directly model dynamics\-bisimulation relations as described in Definition[3\.3](https://arxiv.org/html/2605.26371#S3.Thmtheorem3)\. Intuitively, when two states have dynamics\-bisimilarkk\-ball MDPs, the samekk\-step action sequence accomplishes equivalent transitions from both\. A near\-expert data\-collecting policy will exploit this, leaving repeated action\-sequence structure as a footprint in the dataset\. We call this footprint*behavioral similarity*: two state\-goal pairs\(s1,g1\)\(s\_\{1\},g\_\{1\}\)and\(s2,g2\)\(s\_\{2\},g\_\{2\}\)are behaviorally similar if they are solved by similarkk\-step action sequences in the dataset\. This yields a much simpler heuristic that lets us identify skill\-reuse opportunities directly from the dataset\.

This motivates our method CARL \(ContrastiveAction\-basedRepresentations for ReusableLocal Control\) which learns to relate skills with the global contexts in which they are used\. To do this, we sample batches of sizeBBfrom a diverse offline dataset to learn representationsϕ​\(s,gk\)\\phi\(s,g\_\{k\}\)andψ​\(𝐚𝐤\)\\psi\(\\mathbf\{a\_\{k\}\}\)contrastively such that the loss below is minimized\.

ℒInfoNCE​\(\{\(si,gki,𝐚𝐤i\)\}i=1B;ϕ,ψ\)=−1B​∑i=1Blog⁡exp⁡\(⟨ϕ​\(si,gki\),ψ​\(𝐚𝐤i\)⟩/τ\)∑j=1Bexp⁡\(⟨ϕ​\(si,gki\),ψ​\(𝐚𝐤j\)⟩/τ\)\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{InfoNCE\}\}\\\!\\big\(\\\{\(s^\{i\},g\_\{k\}^\{i\},\\mathbf\{a\_\{k\}\}^\{i\}\)\\\}\_\{i=1\}^\{B\};\\,\\phi,\\psi\\big\)=\-\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\log\\frac\{\\exp\\\!\\left\(\\langle\\phi\(s^\{i\},g\_\{k\}^\{i\}\),\\psi\(\\mathbf\{a\_\{k\}\}^\{i\}\)\\rangle/\\tau\\right\)\}\{\\sum\_\{j=1\}^\{B\}\\exp\\\!\\left\(\\langle\\phi\(s^\{i\},g\_\{k\}^\{i\}\),\\psi\(\\mathbf\{a\_\{k\}\}^\{j\}\)\\rangle/\\tau\\right\)\}\.\(1\)
Algorithm[1](https://arxiv.org/html/2605.26371#algorithm1)summarizes the training procedure, which alternates between training the encoders via contrastive loss and training a goal\-conditioned policy in the learned representation space\.

1Input:Offline dataset𝒟\\mathcal\{D\}, horizonkk, temperatureτ\\tau;

2Initialize:Encoders

ϕ​\(s,g\)\\phi\(s,g\)and

ψ​\(𝐚𝐤\)\\psi\(\\mathbf\{a\_\{k\}\}\);

3Initialize:Goal\-conditioned learner

π\(⋅∣ϕ\(s,g\)\)\\pi\(\\cdot\\mid\\phi\(s,g\)\);

4

51exwhile*not converged*do

6Sample batch

b∼𝒟b\\sim\\mathcal\{D\}of

kk\-step segments

\(st,𝐚𝐤,st\+k\)\(s\_\{t\},\\mathbf\{a\_\{k\}\},s\_\{t\+k\}\);

Train encoders:

ℒInfoNCE​\(b;ϕ,ψ\)\\mathcal\{L\}\_\{\\mathrm\{InfoNCE\}\}\(b;\\phi,\\psi\);

//Eq\.\([1](https://arxiv.org/html/2605.26371#S4.E1)\)

7Train policy:

𝔼b​\[J​\(π∣ϕ\)\]\\mathbb\{E\}\_\{b\}\[J\(\\pi\\mid\\phi\)\];

8

Algorithm 1CARL
1Input:dataset𝒟\\mathcal\{D\}, horizonkk, temperatureτ\\tau;

2Initialize:value

VθV​\(s,ϕ​\(s,g\)\)V\_\{\\theta\_\{V\}\}\(s,\\phi\(s,g\)\), policies

πθhh\(⋅∣s,g\)\\pi^\{h\}\_\{\\theta\_\{h\}\}\(\\cdot\\mid s,g\),

πθℓℓ\(⋅∣s,ϕ\(s,s′\)\)\\pi^\{\\ell\}\_\{\\theta\_\{\\ell\}\}\(\\cdot\\mid s,\\phi\(s,s^\{\\prime\}\)\);

3Initialize:encodersϕθϕ​\(s,s′\)\\phi\_\{\\theta\_\{\\phi\}\}\(s,s^\{\\prime\}\),ψθψ​\(aτ\)\\psi\_\{\\theta\_\{\\psi\}\}\(a\_\{\\tau\}\);

4

51exwhile*not converged*do

6Sample minibatch

\(st,at:t\+k−1,st\+k,g\)∼𝒟\(s\_\{t\},a\_\{t:t\+k\-1\},s\_\{t\+k\},g\)\\sim\\mathcal\{D\}, indices

j∼Unif​\[k\]j\\sim\\mathrm\{Unif\}\[k\];

7Train encoders:

8ℒInfoNCE​\(\{sti,st\+ji,at:t\+k−1i\}i;ϕθϕ,ψθψ\)\\mathcal\{L\}\_\{\\mathrm\{InfoNCE\}\}\(\\\{s\_\{t\}^\{i\},s\_\{t\+j\}^\{i\},a\_\{t:t\+k\-1\}^\{i\}\\\}\_\{i\};\\phi\_\{\\theta\_\{\\phi\}\},\\psi\_\{\\theta\_\{\\psi\}\}\);

9Update value:

VθV​\(st,ϕθϕ​\(st,g\)\)V\_\{\\theta\_\{V\}\}\(s\_\{t\},\\phi\_\{\\theta\_\{\\phi\}\}\(s\_\{t\},g\)\);

zt⋆←ϕθϕ​\(st,st\+k\)z\_\{t\}^\{\\star\}\\leftarrow\\phi\_\{\\theta\_\{\\phi\}\}\(s\_\{t\},s\_\{t\+k\}\);

//embed subgoal

10Train high\-level: fit

πθhh​\(zt⋆∣st,g\)\\pi^\{h\}\_\{\\theta\_\{h\}\}\(\\hbox\{\\pagecolor\{cyan\!15\}$z\_\{t\}^\{\\star\}$\}\\mid s\_\{t\},g\)\(AWR\);

11Train low\-level: fit

πθℓℓ​\(at∣st,zt⋆\)\\pi^\{\\ell\}\_\{\\theta\_\{\\ell\}\}\(a\_\{t\}\\mid s\_\{t\},\\hbox\{\\pagecolor\{cyan\!15\}$z\_\{t\}^\{\\star\}$\}\)\(AWR\);

12

Algorithm 2HIQL \+ CARL \(changes in blue\)

### 4\.2Integration with Hierarchical Offline RL

Aligning state\-goal pairs with action sequences that achieve the transition between them induces a behaviorally structure representation: state–goal pairs that can be solved by similarkk\-step action sequences are embedded nearby, regardless of where they occur in the state space\. This representation structure is well suited as an input to the low\-level policy in subgoal\-based HRL algorithms, where a policy can reuse the same skill in different contexts rather than relearning it separately in each region\.

Following this intuition, we integrate CARL into HIQL\(Parket al\.,[2023](https://arxiv.org/html/2605.26371#bib.bib5)\), a recent and competitive hierarchical offline RL algorithm, by co\-training its subgoal representation with CARL’s objective\. This introduces CARL’s skill reuse benefits into the input space of the low\-level policy and value function, and the output space of the high\-level policy\. These changes are highlighted in blue in Algorithm[2](https://arxiv.org/html/2605.26371#algorithm2)\. We provide further details on the specific sampling and training procedures that we used in Appendix[A](https://arxiv.org/html/2605.26371#A1), specific hyperparameter changes we made in Appendix[B](https://arxiv.org/html/2605.26371#A2), and present experiments analyzing these design choices in Appendix[C](https://arxiv.org/html/2605.26371#A3)\.

We also modify the HGCBC \(HierarchicalGoal\-ConditionedBehaviorCloning\) algorithm used as a baseline inParket al\.\([2023](https://arxiv.org/html/2605.26371#bib.bib5)\)to assess the generality of CARL’s benefits to HRL algorithms\. Similarly to HIQL, we change the low\-level policy inputs and high\-level policy outputs to use CARL’s representation for the subgoal space\. Section[5](https://arxiv.org/html/2605.26371#S5)provides strong empirical evidence for our generality claim: HIQL\+CARL achieves win rates of 20/22 on state\-based and 13/14 on image\-based tasks against HIQL, and HGCBC\+CARL achieves 21/21 on state\-based tasks against HGCBC\.

## 5Experiments

Our experiments analyze both the embedding structure produced by CARL and its effect on downstream performance when integrated into an existing HRL method\. The following sections seek to address the questions below\.

1. Q1\.Does CARL’s representation structure enable skills to be reused?
2. Q2\.How does CARL benefit HRL methods in downstream tasks?
3. Q3\.What are the important components for capturing behavioral similarity?
4. Q4\.How does CARL organize different skills in its representation structure?

### 5\.1Skill Reuse in Toy Environments

\(Q1\) Skill Reuse\.To test how CARL enables skill reuse, we design a toy environment consisting of five identical grid\-world rooms \(figure[2](https://arxiv.org/html/2605.26371#S5.F2)\)\. These rooms differ only in their global\(x,y\)\(x,y\)coordinates, which form the proprioceptive state for the agent\.

We test zero\-shot generalization in two settings\. First, a goal\-conditioned policy trained in a single room must solve the same task in held\-out rooms\. We compare HIQL to HIQL augmented with a CARL encoder trained on all rooms\. Since CARL aliases state\-goal pairs reachable by the same action sequences, the policy should transfer through the shared representation\. Our experiments confirm this: HIQL\+CARL solves all held\-out rooms, while HIQL fails to generalize\. Second, we scale to 20 identical grid\-world rooms, training on 4 and testing on the remaining 16\.

![Refer to caption](https://arxiv.org/html/2605.26371v1/x2.png)Figure 2:Examples of environments used to benchmark CARL, including a diagnostic grid world as well as as environments from the OGBench suite\.HIQL\+CARL generalizes to 12 unseen rooms versus 8 for HIQL alone, further demonstrating that explicitly modeling skill reusability improves transfer\.

### 5\.2Evaluating Performance Benefits for HRL

\(Q2\) HRL Integration\.Here, we examine how CARL benefits HRL by benchmarking on OGBench tasks\. We analyze results for HIQL\+CARL and HGCBC\+CARL, as described in Section[4\.2](https://arxiv.org/html/2605.26371#S4.SS2)\.

#### 5\.2\.1Experimental setup

We evaluate on two task categories from OGBench\. The first is the locomotion suite, where various embodiments \(point\-mass, ant, humanoid\) have to navigate throughout a maze\. These tasks test learning high\-level planning and low\-level locomotion skills from offline data\. We use the navigate datasets for these tasks, collected with a noisy expert policy that wanders the maze by achieving randomly sampled goals\. This strategy yields broad coverage with repeating low\-level structure, providing opportunities for skill reuse and leveraging CARL’s core capability\.

The second category is the manipulation suite, where a 6\-DoF UR5e arm performs tasks like stacking cubes, arranging tabletop scenes, or solving puzzles, testing object manipulation and combinatorial generalization\. We test on the play datasets, collected by open loop planners with temporally correlated noise\. These planners rely on repeated behavior primitives, which we expect CARL to take advantage of\.

These environments feature continuous control and span a range of complexity, from the low\-dimensional point maze to the high\-dimensional humanoid, allowing us to assess how CARL scales with state and action space complexity\. The robotics domains introduce additional challenges through multi\-entity interactions\. Together, these environments test CARL across a diverse range of control difficulties\. We defer to the OGBench paper for additional environment and dataset specifications\(Parket al\.,[2025a](https://arxiv.org/html/2605.26371#bib.bib4)\)\.

#### 5\.2\.2OGBench State\-Based Performance

We examine the effect of CARL on HRL algorithms in Table[1](https://arxiv.org/html/2605.26371#S5.T1)by comparing HIQL and HGCBC with their augmented variants in downstream OGBench performance\. Notably, HIQL\+CARL improves HIQL by at least 10% on giant maze variants, 17% on smaller cube environments, 28% on smaller puzzle environments, and an impressive 30% on the scene task\. HGCBC\+CARL tends to outperform HGCBC around 10% for most navigation environments\. This provides strong evidence that CARL’s representation benefits HRL in downstream tasks\.

CARL’s gains are more limited on antmaze\-teleport, antsoccer\-medium, and the harder cube and puzzle tasks\. The underlying causes remain unclear, but we hypothesize that these environments pose distinct challenges: stochastic dynamics \(portals in antmaze\-teleport\), long horizons with entity\-centric generalization \(antsoccer\-medium, cube\), and combinatorial generalization over large configuration spaces \(puzzle\)\. Addressing these will likely require advances in the representation structure, such as representations that are invariant to entity permutations or that explicitly support compositional reasoning at the level of the high\-level policy\.

\(a\) HRL Comparisons\\phantomsubcaption\(b\) CARL Ablations\\phantomsubcaptionTaskHIQL\+CARLHIQLHGCBC\+CARLHGCBCCARLSingle\-ActionCARLMulti\-ActionIDMSingle\-ActionIDMpointmaze\-medium84\.4±\\pm3\.179±\\pm4\.313\.6±\\pm1\.80\.0±\\pm0\.084\.4±\\pm3\.186\.6±\\pm3\.867\.6±\\pm4\.074\.1±\\pm5\.4pointmaze\-large75\.4±\\pm3\.158±\\pm4\.319\.5±\\pm3\.80\.4±\\pm0\.375\.4±\\pm3\.174\.7±\\pm8\.058\.1±\\pm9\.562\.6±\\pm7\.1pointmaze\-giant64\.5±\\pm15\.846±\\pm7\.60\.0±\\pm0\.00\.0±\\pm0\.064\.5±\\pm15\.853\.6±\\pm13\.222\.7±\\pm7\.318\.2±\\pm9\.2pointmaze\-teleport31\.4±\\pm14\.018±\\pm3\.329\.7±\\pm1\.816\.7±\\pm1\.231\.4±\\pm14\.025\.0±\\pm8\.018\.0±\\pm4\.719\.4±\\pm6\.4antmaze\-medium97\.9±\\pm0\.996±\\pm0\.973\.1±\\pm1\.559\.8±\\pm1\.697\.9±\\pm0\.995\.5±\\pm2\.496\.9±\\pm1\.796\.1±\\pm1\.2antmaze\-large91\.9±\\pm2\.491±\\pm1\.764\.9±\\pm0\.959\.3±\\pm1\.991\.9±\\pm2\.486\.4±\\pm6\.191\.5±\\pm3\.591\.5±\\pm1\.2antmaze\-giant75\.2±\\pm3\.865±\\pm4\.318\.4±\\pm1\.38\.5±\\pm0\.875\.2±\\pm3\.847\.1±\\pm7\.370\.0±\\pm4\.372\.3±\\pm3\.5antmaze\-teleport41\.4±\\pm2\.142±\\pm2\.638\.5±\\pm1\.336\.9±\\pm1\.941\.4±\\pm2\.142\.5±\\pm4\.546\.6±\\pm2\.447\.0±\\pm4\.5humanoidmaze\-medium90\.5±\\pm2\.489±\\pm1\.744\.4±\\pm1\.332\.4±\\pm0\.990\.5±\\pm2\.488\.9±\\pm4\.586\.5±\\pm3\.187\.1±\\pm2\.8humanoidmaze\-large58\.3±\\pm3\.849±\\pm3\.331\.0±\\pm0\.822\.6±\\pm0\.858\.3±\\pm3\.851\.1±\\pm3\.345\.9±\\pm3\.350\.0±\\pm4\.0humanoidmaze\-giant27\.2±\\pm3\.112±\\pm3\.320\.0±\\pm1\.19\.5±\\pm1\.327\.2±\\pm3\.118\.1±\\pm3\.817\.0±\\pm5\.216\.3±\\pm3\.1antsoccer\-medium12\.8±\\pm2\.413±\\pm1\.77\.0±\\pm0\.45\.9±\\pm0\.712\.8±\\pm2\.45\.6±\\pm1\.710\.9±\\pm2\.812\.1±\\pm2\.6antsoccer\-arena63\.7±\\pm3\.158±\\pm1\.724\.9±\\pm1\.612\.5±\\pm0\.863\.7±\\pm3\.156\.0±\\pm7\.356\.9±\\pm4\.063\.0±\\pm5\.2cube\-single32\.8±\\pm3\.115±\\pm2\.66\.9±\\pm0\.74\.4±\\pm0\.832\.8±\\pm3\.130\.5±\\pm6\.925\.9±\\pm4\.322\.4±\\pm3\.5cube\-double23\.4±\\pm4\.56±\\pm1\.71\.3±\\pm0\.31\.1±\\pm0\.323\.4±\\pm4\.518\.9±\\pm4\.05\.4±\\pm2\.65\.4±\\pm1\.2cube\-triple15\.2±\\pm3\.53±\\pm0\.92\.1±\\pm0\.60\.4±\\pm0\.315\.2±\\pm3\.524\.4±\\pm4\.06\.4±\\pm2\.19\.0±\\pm4\.0cube\-quadruple0\.1±\\pm0\.20±\\pm0\.00\.1±\\pm0\.10\.0±\\pm0\.00\.1±\\pm0\.21\.8±\\pm1\.70\.4±\\pm0\.70\.3±\\pm0\.5puzzle\-3x345\.5±\\pm7\.312±\\pm1\.76\.0±\\pm0\.94\.1±\\pm0\.345\.5±\\pm7\.315\.0±\\pm4\.327\.1±\\pm5\.228\.1±\\pm3\.8puzzle\-4x435\.2±\\pm7\.17±\\pm1\.75\.9±\\pm0\.90\.3±\\pm0\.235\.2±\\pm7\.122\.5±\\pm4\.014\.2±\\pm3\.113\.7±\\pm2\.4puzzle\-4x54\.2±\\pm2\.44±\\pm0\.91\.6±\\pm0\.60\.4±\\pm0\.24\.2±\\pm2\.45\.0±\\pm1\.43\.0±\\pm1\.93\.1±\\pm0\.9puzzle\-4x63\.6±\\pm2\.13±\\pm0\.90\.6±\\pm0\.30\.3±\\pm0\.23\.6±\\pm2\.13\.9±\\pm1\.72\.4±\\pm1\.42\.3±\\pm1\.7scene70\.5±\\pm5\.238±\\pm2\.615\.8±\\pm1\.08\.5±\\pm1\.170\.5±\\pm5\.269\.8±\\pm6\.455\.7±\\pm5\.764\.3±\\pm3\.1win\-rate20/222/2221/210/2116/225/220/221/22

Table 1:Success rates \(%\) on state\-based OGBench tasks, reported as mean±\\pm95% CI over 8 seeds\.\(a\) HRL Comparisoncontrasts HIQL and HGCBC with their CARL\-augmented variants, where weboldvalues within 95% of the better method in each pair\.\(b\) CARL Ablationscompares variants of CARL combined with HIQL, where weboldvalues within 95% of the best ablation\. The final row reports per\-algorithm win\-rates excluding ties, with the highestbolded\.TaskHIQL\+CARLHIQLvisual\-antmaze\-medium97\.3±2\.7\\mathbf\{97\.3\\pm 2\.7\}𝟗𝟑±6\.4\\mathbf\{93\\pm 6\.4\}visual\-antmaze\-large85\.5±5\.6\\mathbf\{85\.5\\pm 5\.6\}53±14\.353\\pm 14\.3visual\-antmaze\-giant43\.0±6\.2\\mathbf\{43\.0\\pm 6\.2\}6±6\.46\\pm 6\.4visual\-antmaze\-teleport45\.3±2\.7\\mathbf\{45\.3\\pm 2\.7\}37±3\.237\\pm 3\.2visual\-humanoidmaze\-medium2\.3±2\.4\\mathbf\{2\.3\\pm 2\.4\}0±0\.00\\pm 0\.0visual\-humanoidmaze\-large0\.0±0\.0\\mathbf\{0\.0\\pm 0\.0\}𝟎±0\.0\\mathbf\{0\\pm 0\.0\}visual\-humanoidmaze\-giant0\.0±0\.0\\mathbf\{0\.0\\pm 0\.0\}𝟎±0\.0\\mathbf\{0\\pm 0\.0\}visual\-cube\-single87\.8±8\.187\.8\\pm 8\.1𝟖𝟗±0\.0\\mathbf\{89\\pm 0\.0\}visual\-cube\-double41\.8±9\.4\\mathbf\{41\.8\\pm 9\.4\}39±3\.239\\pm 3\.2visual\-cube\-triple24\.8±2\.7\\mathbf\{24\.8\\pm 2\.7\}21±0\.021\\pm 0\.0visual\-cube\-quadruple14\.8±4\.5\\mathbf\{14\.8\\pm 4\.5\}14±1\.614\\pm 1\.6visual\-puzzle\-3x375\.5±4\.6\\mathbf\{75\.5\\pm 4\.6\}73±12\.773\\pm 12\.7visual\-puzzle\-4x483\.5±7\.0\\mathbf\{83\.5\\pm 7\.0\}60±65\.260\\pm 65\.2visual\-puzzle\-4x518\.0±3\.7\\mathbf\{18\.0\\pm 3\.7\}13±14\.313\\pm 14\.3visual\-puzzle\-4x615\.8±8\.0\\mathbf\{15\.8\\pm 8\.0\}9±9\.59\\pm 9\.5visual\-scene54\.3±2\.1\\mathbf\{54\.3\\pm 2\.1\}49±6\.449\\pm 6\.4win\-rate𝟏𝟑/𝟏𝟒\\mathbf\{13/14\}1/141/14

Table 2:HIQL\+CARL vs HIQL on visual environments\. Mean success rates \(%\)±\\pm95% CI over 4 seeds\. Values within 95% of the best per row arebolded; for win\-rates, ties are excluded and the highest isbolded\.In robotics domains, CARL improves HIQL substantially more than it improves HGCBC\. We attribute this to the difference in how each algorithm extracts its low\-level policy: HIQL uses AWR, which selectively reweights toward high\-advantage actions while HGCBC relies on unweighted behavioral cloning\. Combining our method \(to discover where a skill can be reused\) with AWR \(to discover which skills are optimal\) allows us to reuse skills that are more likely to achieve the global task, and thus results in better task performance\. While it remains unclear why robotics tasks in particular benefit disproportionately, the pattern suggests a complementary relationship between extracting high\-quality skills and knowing where to use them\.

#### 5\.2\.3OGBench Pixel\-Based Performance

Visual observations represent a higher\-dimensional and more relevant setting for real\-world control, making them an important test for CARL’s applicability beyond compact state representations\. Thus, we benchmark our HIQL\+CARL variant on OGBench’s visual suite without further hyperparameter tuning, and present results in Table[2](https://arxiv.org/html/2605.26371#S5.T2)\. We find significant gains on longer horizon antmaze tasks \(over 30% in large and giant\)\. Robotics domains also see at least a 5% improvement in scene and larger puzzle tasks, with gains of over 20% on puzzle\-4x4\. However, CARL does not always improve performance\. Visual humanoid control is still too difficult even with CARL’s representation, and cube tasks also benefit less than other robotics domains\.

### 5\.3Ablating CARL’s Design for Capturing Behavioral Similarity

\(Q3\) Component Ablation\.In this section, we analyze how different parts of our representation learning approach affect its ability to capture behavioral similarity\. Our full method optimizesℒInfoNCE\\mathcal\{L\}\_\{\\text\{InfoNCE\}\}\(Eq\.[1](https://arxiv.org/html/2605.26371#S4.E1)\) with the action sequence𝐚𝐤\\mathbf\{a\_\{k\}\}as the contrastive target\. The three ablations modify this along one of two axes\.Single\-Action CARLreplaces𝐚𝐤\\mathbf\{a\_\{k\}\}with the first actionak,1a\_\{k,1\}, keeping the contrastive objective\.Multi\-Action Predictionkeeps𝐚𝐤\\mathbf\{a\_\{k\}\}but replaces the contrastive loss with direct regression,‖ξ​\(ϕ​\(s,gk\)\)−𝐚𝐤‖22\\\|\\xi\\big\(\\phi\(s,g\_\{k\}\)\\big\)\-\\mathbf\{a\_\{k\}\}\\\|\_\{2\}^\{2\}, whereξ\\xiis a network head which takes in representations of state\-goal pairs produced byϕ\\phi\.Single\-Action Predictionapplies both changes\.

To compare these ablations, we use UMAP\(McInneset al\.,[2020](https://arxiv.org/html/2605.26371#bib.bib55)\)to visualize how well each method organizes their latent structure by behavioral similarity\. More concretely, we construct state\-goal pairs corresponding to motion in cardinal directions for the pointmaze environment from OGBench\. We use pointmaze, because the simple dynamics make it easy to reason about the ideal clustering structure: pairs should cluster by the direction of motion they induce\.

![Refer to caption](https://arxiv.org/html/2605.26371v1/x3.png)Figure 3:UMAP visualization of learned embeddings colored by behavioral mode\. When embeddings are trained using CARL, state\-goal pairs that require similar action sequences cluster tightly, even when the raw states differ substantially\. Other ablations show more overlap and looser clusters\.Figure[3](https://arxiv.org/html/2605.26371#S5.F3)shows that only CARL’s combination of action\-sequence modeling and a contrastive objective produces tight, disjoint clusters suitable for downstream skill reuse\. The ablations yield scattered, overlapping clusters, with the exception of Multi\-Action Prediction, whose clusters are looser but still disjoint enough to capture skill reuse meaningfully\. This suggests that modeling action sequences is important for capturing behavioral similarity and supporting skill reuse\.

We also evaluate downstream performance on state\-based OGBench environments in Table[1](https://arxiv.org/html/2605.26371#S5.T1)to contextualize these differences in representation geometry\. CARL generally outperforms the other variants, emphasizing the importance of both action\-sequence modeling and a contrastive objective\. Single\-Action CARL comes closest in performance, while the prediction methods perform comparably to each other\. This suggests that the contrastive objective is an effective way to translate behavioral similarity into useful downstream representations\.

![Refer to caption](https://arxiv.org/html/2605.26371v1/x4.png)Figure 4:Visualizations of nearest\-neighbor trajectories under CARL and a random encoder for two reference skills \(walking backwards and standing up\); although these neighbors originate from across the maze, we shift them to share the reference’s starting position to highlight their behavioral similarity\. The grey trajectory is the reference; blue, yellow, and red are its nearest neighbors\.
### 5\.4Visualizing CARL’s Representation Structure

\(Q4\) Skill Structure\.We’ve already observed how CARL’s geometry tightly clusters state\-goal pairs by behavioral similarity\. In this section, we examine whether this result extends to higher\-dimensional and less structured settings by studying our representations in the humanoid maze environment\.

To probe CARL’s clustering structure, we select nearest neighbors of two reference state\-goal pairs, one corresponding to walking backwards and one to standing up, and visualize their rollouts\. As shown in figure[4](https://arxiv.org/html/2605.26371#S5.F4), the nearest\-neighbor trajectories \(blue, yellow, and red\) are behaviorally coherent with the gray reference trajectory, whereas the random encoder’s neighbors are not\. Together with figure[3](https://arxiv.org/html/2605.26371#S5.F3), this demonstrates that CARL’s representation organizes less structured, complex skills by their behavioral similarity, even in scaled\-up environments\.

## 6Limitations and Opportunities for Future Work

### 6\.1Limitations

##### Skill extraction horizon\.

In this work, the notion of a skill depends on the choice of extraction horizonkk\. We analyze CARL’s sensitivity tokkin Appendix[C\.4](https://arxiv.org/html/2605.26371#A3.SS4)and find that the optimal value is environment\-dependent\. A more principled approach would remove this dependence, discovering relevant local dynamics structure at various horizons to solve a given task\.

##### Dataset coverage\.

Our representation depends on the diversity, coverage, and quality of the offline dataset\. Data from a random policy lacks the behaviors needed to learn useful representations, and sparse coverage can cause the collapse of distinct state\-goal pairs\. Appendix[C\.5](https://arxiv.org/html/2605.26371#A3.SS5)characterizes this sensitivity and shows CARL does not amplify it relative to HIQL\. Addressing these limitations in an online setting will require intelligent data collection or a more robust objective\.

### 6\.2Future Work

##### Skill discovery\.

The behavior\-centric view suggests a connection between representation learning and skill discovery\. Future work could explicitly extract, label, or compose the clusters in CARL’s representation structure, extending prior work\(Linet al\.,[2022](https://arxiv.org/html/2605.26371#bib.bib63)\)to yield a unified framework for offline skill discovery and hierarchical control\.

##### Action abstraction\.

Our InfoNCE objective aligns state\-goal pairs with their short\-horizon action sequences, suggesting a duality in how skills can be defined: by the contexts where they are used or by the behaviors that achieve them\. Future work could explore this duality by treating action\-sequence abstractions, defined by the state\-goal pairs they achieve, as skills in their own right\.

##### High\-level policy representations\.

CARL improves the low\-level policy, but the high\-level policy may be an equally important bottleneck\. A natural next step is to develop representations for high\-level decision\-making: ones that support better planning, guide exploration, or operate over more abstract action primitives\.

##### Online skill discovery\.

CARL currently operates offline, but the behavioral similarity objective could be extended to the online setting\. More broadly, addressing the dataset coverage limitations discussed earlier could yield sample\-efficiency gains and widen the applicability of hierarchical RL\.

## 7Closing Remarks

We introduced CARL, a representation learning approach for goal\-conditioned offline RL that explicitly captures reusable short\-horizon behaviors\. Our central claim is that many RL problems contain*local dynamical structure*, lending themselves to a separation between local and global information\. We demonstrated that building a notion of local re\-usability into the low\-level policy of a hierarchy unlocks performance benefits through skill reuse\. By learning embeddings that cluster state–goal pairs according to the short\-horizon behaviors they admit, our method provides a concrete mechanism for skill reuse that improves existing hierarchical RL frameworks\.

## References

- R\. Agarwal, M\. C\. Machado, P\. S\. Castro, and M\. G\. Bellemare \(2021\)Contrastive behavioral similarity embeddings for generalization in reinforcement learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=qda7-sVg84)Cited by:[§2\.2](https://arxiv.org/html/2605.26371#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.26371#S2.SS3.p1.1)\.
- A\. Ajay, A\. Kumar, P\. Agrawal, S\. Levine, and O\. Nachum \(2021\)\{opal\}: offline primitive discovery for accelerating offline reinforcement learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=V69LGwJ0lIN)Cited by:[§2\.2](https://arxiv.org/html/2605.26371#S2.SS2.p1.1)\.
- P\. S\. Castro \(2020\)Scalable methods for computing state similarity in deterministic markov decision processes\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 10069–10076\.Cited by:[§2\.2](https://arxiv.org/html/2605.26371#S2.SS2.p1.1)\.
- P\. Dayan and G\. E\. Hinton \(1992\)Feudal reinforcement learning\.InAdvances in Neural Information Processing Systems,S\. Hanson, J\. Cowan, and C\. Giles \(Eds\.\),Vol\.5,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/1992/file/d14220ee66aeec73c49038385428ec4c-Paper.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.26371#S2.SS1.p1.1)\.
- T\. G\. Dietterich \(2000\)Hierarchical reinforcement learning with the maxq value function decomposition\.J\. Artif\. Int\. Res\.13\(1\),pp\. 227–303\.External Links:ISSN 1076\-9757Cited by:[§2\.1](https://arxiv.org/html/2605.26371#S2.SS1.p1.1)\.
- L\. Espeholt, H\. Soyer, R\. Munos, K\. Simonyan, V\. Mnih, T\. Ward, Y\. Doron, V\. Firoiu, T\. Harley, I\. Dunning, S\. Legg, and K\. Kavukcuoglu \(2018\)IMPALA: scalable distributed deep\-RL with importance weighted actor\-learner architectures\.InProceedings of the 35th International Conference on Machine Learning,J\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 1407–1416\.External Links:[Link](https://proceedings.mlr.press/v80/espeholt18a.html)Cited by:[Table 3](https://arxiv.org/html/2605.26371#A2.T3.10.10.17.2)\.
- B\. Eysenbach, R\. R\. Salakhutdinov, and S\. Levine \(2019\)Search on the replay buffer: bridging planning and reinforcement learning\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2605.26371#S1.p1.1)\.
- B\. Eysenbach, T\. Zhang, S\. Levine, and R\. R\. Salakhutdinov \(2022\)Contrastive learning as goal\-conditioned reinforcement learning\.Advances in Neural Information Processing Systems35,pp\. 35603–35620\.Cited by:[§2\.3](https://arxiv.org/html/2605.26371#S2.SS3.p1.1)\.
- N\. Ferns, P\. Panangaden, and D\. Precup \(2004\)Metrics for finite markov decision processes\.InProceedings of the 20th Conference on Uncertainty in Artificial Intelligence,UAI ’04,Arlington, Virginia, USA,pp\. 162–169\.External Links:ISBN 0974903906Cited by:[§3\.2](https://arxiv.org/html/2605.26371#S3.SS2.p2.1)\.
- D\. Hafner, K\. Lee, I\. Fischer, and P\. Abbeel \(2022\)Deep hierarchical planning from pixels\.Advances in Neural Information Processing Systems35,pp\. 26091–26104\.Cited by:[§1](https://arxiv.org/html/2605.26371#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26371#S2.SS1.p2.1)\.
- P\. Hansen\-Estruch, A\. Zhang, A\. Nair, P\. Yin, and S\. Levine \(2022\)Bisimulation makes analogies in goal\-conditioned reinforcement learning\.InInternational Conference on Machine Learning,pp\. 8407–8426\.Cited by:[§2\.2](https://arxiv.org/html/2605.26371#S2.SS2.p1.1)\.
- D\. Hendrycks and K\. Gimpel \(2023\)Gaussian error linear units \(gelus\)\.External Links:1606\.08415,[Link](https://arxiv.org/abs/1606.08415)Cited by:[Table 3](https://arxiv.org/html/2605.26371#A2.T3.10.10.20.2)\.
- R\. Islam, M\. Tomar, A\. Lamb, Y\. Efroni, H\. Zang, A\. Didolkar, D\. Misra, X\. Li, H\. Van Seijen, R\. T\. Des Combes, and J\. Langford \(2023\)Principled offline rl in the presence of rich exogenous information\.InProceedings of the 40th International Conference on Machine Learning,ICML’23\.Cited by:[§2\.2](https://arxiv.org/html/2605.26371#S2.SS2.p1.1)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: A method for stochastic optimization\.In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7\-9, 2015, Conference Track Proceedings,Y\. Bengio and Y\. LeCun \(Eds\.\),External Links:[Link](http://arxiv.org/abs/1412.6980)Cited by:[Table 3](https://arxiv.org/html/2605.26371#A2.T3.10.10.13.2)\.
- M\. Laskin, A\. Srinivas, and P\. Abbeel \(2020\)CURL: contrastive unsupervised representations for reinforcement learning\.InProceedings of the 37th International Conference on Machine Learning,H\. D\. III and A\. Singh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.119,pp\. 5639–5650\.External Links:[Link](https://proceedings.mlr.press/v119/laskin20a.html)Cited by:[§2\.3](https://arxiv.org/html/2605.26371#S2.SS3.p1.1)\.
- A\. Levy, R\. Platt, and K\. Saenko \(2019\)Hierarchical reinforcement learning with hindsight\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ryzECoAcY7)Cited by:[§1](https://arxiv.org/html/2605.26371#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26371#S2.SS1.p2.1)\.
- X\. Lin, C\. Qi, Y\. Zhang, Z\. Huang, K\. Fragkiadaki, Y\. Li, C\. Gan, and D\. Held \(2022\)Planning with spatial\-temporal abstraction from point clouds for deformable object manipulation\.In6th Annual Conference on Robot Learning,External Links:[Link](https://openreview.net/forum?id=tyxyBj2w4vw)Cited by:[§6\.2](https://arxiv.org/html/2605.26371#S6.SS2.SSS0.Px1.p1.1)\.
- L\. McInnes, J\. Healy, and J\. Melville \(2020\)UMAP: uniform manifold approximation and projection for dimension reduction\.External Links:1802\.03426,[Link](https://arxiv.org/abs/1802.03426)Cited by:[§5\.3](https://arxiv.org/html/2605.26371#S5.SS3.p2.1)\.
- O\. Nachum, S\. S\. Gu, H\. Lee, and S\. Levine \(2018\)Data\-efficient hierarchical reinforcement learning\.Advances in neural information processing systems31\.Cited by:[§2\.1](https://arxiv.org/html/2605.26371#S2.SS1.p2.1)\.
- S\. Park, K\. Frans, B\. Eysenbach, and S\. Levine \(2025a\)OGBench: benchmarking offline goal\-conditioned rl\.InInternational Conference on Learning Representations,Y\. Yue, A\. Garg, N\. Peng, F\. Sha, and R\. Yu \(Eds\.\),Vol\.2025,pp\. 94937–94982\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/file/ecd92623ac899357312aaa8915853699-Paper-Conference.pdf)Cited by:[§B\.1](https://arxiv.org/html/2605.26371#A2.SS1.p1.1),[Appendix B](https://arxiv.org/html/2605.26371#A2.p1.1),[§5\.2\.1](https://arxiv.org/html/2605.26371#S5.SS2.SSS1.p3.1)\.
- S\. Park, D\. Ghosh, B\. Eysenbach, and S\. Levine \(2023\)Hiql: offline goal\-conditioned rl with latent states as actions\.Advances in Neural Information Processing Systems36,pp\. 34866–34891\.Cited by:[§B\.1](https://arxiv.org/html/2605.26371#A2.SS1.p1.1),[§C\.3](https://arxiv.org/html/2605.26371#A3.SS3.p1.1),[§1](https://arxiv.org/html/2605.26371#S1.p1.1),[§1](https://arxiv.org/html/2605.26371#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.26371#S2.SS1.p2.1),[§4\.2](https://arxiv.org/html/2605.26371#S4.SS2.p2.1),[§4\.2](https://arxiv.org/html/2605.26371#S4.SS2.p3.1)\.
- S\. Park, D\. Mann, and S\. Levine \(2025b\)Dual goal representations\.External Links:2510\.06714,[Link](https://arxiv.org/abs/2510.06714)Cited by:[§2\.2](https://arxiv.org/html/2605.26371#S2.SS2.p1.1)\.
- R\. Parr and S\. Russell \(1997\)Reinforcement learning with hierarchies of machines\.Advances in neural information processing systems10\.Cited by:[§2\.1](https://arxiv.org/html/2605.26371#S2.SS1.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 8748–8763\.External Links:[Link](https://proceedings.mlr.press/v139/radford21a.html)Cited by:[§2\.3](https://arxiv.org/html/2605.26371#S2.SS3.p1.1)\.
- M\. Rudolph, C\. Chuck, K\. Black, M\. Lvovsky, S\. Niekum, and A\. Zhang \(2024\)Learning action\-based representations using invariance\.Reinforcement Learning Journal1,pp\. 342–365\.Cited by:[§2\.2](https://arxiv.org/html/2605.26371#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.26371#S3.SS2.p2.1)\.
- R\. S\. Sutton, D\. Precup, and S\. Singh \(1999\)Between mdps and semi\-mdps: a framework for temporal abstraction in reinforcement learning\.Artificial Intelligence112\(1\),pp\. 181–211\.External Links:ISSN 0004\-3702,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0004-3702%2899%2900052-1),[Link](https://www.sciencedirect.com/science/article/pii/S0004370299000521)Cited by:[§1](https://arxiv.org/html/2605.26371#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26371#S2.SS1.p1.1)\.
- \[27\]mjswan: MuJoCo Simulation on Web Assembly with Neural NetworksExternal Links:[Link](https://github.com/ttktjmt/mjswan)Cited by:[§B\.2](https://arxiv.org/html/2605.26371#A2.SS2.p1.1)\.
- A\. van den Oord, Y\. Li, and O\. Vinyals \(2019\)Representation learning with contrastive predictive coding\.External Links:1807\.03748,[Link](https://arxiv.org/abs/1807.03748)Cited by:[§2\.3](https://arxiv.org/html/2605.26371#S2.SS3.p1.1)\.
- V\. H\. Wang, T\. Wang, and J\. Pajarinen \(2025\)Hierarchical reinforcement learning with uncertainty\-guided diffusional subgoals\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=1YOYA2zN1j)Cited by:[§1](https://arxiv.org/html/2605.26371#S1.p1.1)\.
- A\. Zhang, R\. T\. McAllister, R\. Calandra, Y\. Gal, and S\. Levine \(2021\)Learning invariant representations for reinforcement learning without reconstruction\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=-2FCwDKRREu)Cited by:[§2\.2](https://arxiv.org/html/2605.26371#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.26371#S3.SS2.p2.1)\.
- T\. Zhang, S\. Guo, T\. Tan, X\. Hu, and F\. Chen \(2020\)Generating adjacency\-constrained subgoals in hierarchical reinforcement learning\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 21579–21590\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/f5f3b8d720f34ebebceb7765e447268b-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.26371#S1.p1.1)\.

## Appendix ATraining Details

##### InfoNCE Loss\.

In practice, we calculate our InfoNCE loss using the normalized dot product

f​\(ϕ​\(s,g\),ψ​\(𝐚k\)\)=⟨ϕ​\(s,g\)∥ϕ​\(s,g\)∥2,ψ​\(𝐚k\)∥ψ​\(𝐚k\)∥2⟩,f\\\!\\left\(\\phi\(s,g\),\\,\\psi\(\\mathbf\{a\}\_\{k\}\)\\right\)=\\left\\langle\\frac\{\\phi\(s,g\)\}\{\\lVert\\phi\(s,g\)\\rVert\_\{2\}\},\\;\\frac\{\\psi\(\\mathbf\{a\}\_\{k\}\)\}\{\\lVert\\psi\(\\mathbf\{a\}\_\{k\}\)\\rVert\_\{2\}\}\\right\\rangle,
where the loss function is

ℒInfoNCE=−1B​∑i=1Blog⁡exp⁡\(f​\(ϕ​\(si,gki\),ψ​\(𝐚ki\)\)/τ\)∑j=1Bexp⁡\(f​\(ϕ​\(si,gki\),ψ​\(𝐚kj\)\)/τ\)\.\\mathcal\{L\}\_\{\\mathrm\{InfoNCE\}\}=\-\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\log\\frac\{\\exp\\\!\\left\(f\\\!\\left\(\\phi\(s^\{i\},g\_\{k\}^\{i\}\),\\,\\psi\(\\mathbf\{a\}\_\{k\}^\{i\}\)\\right\)/\\tau\\right\)\}\{\\sum\_\{j=1\}^\{B\}\\exp\\\!\\left\(f\\\!\\left\(\\phi\(s^\{i\},g\_\{k\}^\{i\}\),\\,\\psi\(\\mathbf\{a\}\_\{k\}^\{j\}\)\\right\)/\\tau\\right\)\}\.
Normalizing the InfoNCE score function prevents trivial solutions such as scaling the norms of embeddings, and bounds the scale of logits which makes the temperature parameter meaningful\.

##### Goal Sampling\.

Given akk\-step triplet\(st,𝐚𝐤,st\+k\)\(s\_\{t\},\\mathbf\{a\_\{k\}\},s\_\{t\+k\}\)sampled from our dataset, we construct positive pairs by uniformly sampling goals from the rangest\+1​…​st\+ks\_\{t\+1\}\\ldots s\_\{t\+k\}\. In other words, we obtain positive pairs that consist of\(st,st\+i\)\(s\_\{t\},s\_\{t\+i\}\)and𝐚𝐤\\mathbf\{a\_\{k\}\}, where1≤i≤k1\\leq i\\leq k\. This allows us to capture state\-goal equivalences for all goals reachable withinkksteps from a state, aligning closer to our theoretical formulation\. In addition, this allows to capture more temporal rich structure in the environment since skills at different horizons may become useful for the high\-level policy\. The benefits from this strategy are further explored in Appendix[C\.2](https://arxiv.org/html/2605.26371#A3.SS2)\.

##### Action Sequence Striding\.

For humanoid maze, we found that long action sequences and large action spaces presented computational challenges due to the memory required to store large sequences of large actions\. As a result, we introduced an action stride hyperparameter which avoids encoding the entire action sequence\. We empirically selected an action stride of 4 for humanoid tasks because we found that this satisfied our computational budget constraints and minimally changed the action sequence\. We additionally note that this minimally affects downstream performance, likely because the action sequence with striding captures most of the structure in the action sequence with no striding\.

##### Action Sequence Truncation\.

An edge case arises when constructing state–goal pairs near the end of a trajectory, where the fullkk\-step action sequence and goal are unavailable\. In this case, we pair the state with the last available goal in the trajectory and pad the action sequence to lengthkkby repeating the final action\.

##### Auxiliary Loss Weighting\.

When combining our method with HIQL and HGCBC, we introduce a hyperparameterλa​u​x\\lambda\_\{aux\}, which weights the policy and value losses against our added objectives,ℒa​u​x\\mathcal\{L\}\_\{aux\}\. Specifically, the final loss is calculated as

ℒt​o​t​a​l=\(1−λa​u​x\)​\(ℒv​a​l​u​e\+ℒh​i​g​h\+ℒl​o​w\)\+λa​u​x​ℒa​u​x\.\\mathcal\{L\}\_\{total\}=\(1\-\\lambda\_\{aux\}\)\(\\mathcal\{L\}\_\{value\}\+\\mathcal\{L\}\_\{high\}\+\\mathcal\{L\}\_\{low\}\)\+\\lambda\_\{aux\}\\mathcal\{L\}\_\{aux\}\.
Here,ℒl​o​w\\mathcal\{L\}\_\{low\}refers to the low\-level policy loss,ℒh​i​g​h\\mathcal\{L\}\_\{high\}refers to the high\-level policy loss, andℒv​a​l​u​e\\mathcal\{L\}\_\{value\}refers to the value function loss\.ℒa​u​x\\mathcal\{L\}\_\{aux\}refers to the auxilary loss applied \(which isℒInfoNCE\\mathcal\{L\}\_\{\\text\{InfoNCE\}\}for CARL\)\. We report theλa​u​x\\lambda\_\{aux\}we select empirically in Table[3](https://arxiv.org/html/2605.26371#A2.T3)\.

## Appendix BImplementation Details

We base our implementation of CARL on OGBench\(Parket al\.,[2025a](https://arxiv.org/html/2605.26371#bib.bib4)\)\. We run our experiments on an internal GPU Cluster composed of Quadro RTX 5000 GPUs with around 16GB of memory\. Similar to times reported inParket al\.\([2025a](https://arxiv.org/html/2605.26371#bib.bib4)\), each run typically takes around 4 hours \(state\-based tasks\) or around 10 hours \(pixel\-based tasks\) when running HIQL\+CARL and HGCBC\+CARL\. We estimate a total of around 10k GPU\-hours were required for all experiments, with some additional hours for preliminary experiments\.

HyperparameterValueLearning rate0\.0003OptimizerAdam\(Kingma and Ba,[2015](https://arxiv.org/html/2605.26371#bib.bib58)\)\# gradient steps1000000 \(states\), 500000 \(pixels\)Minibatch size1024 \(states\), 256 \(pixels\)MLP dimensions\(512, 512, 512\)Representation architecture \(pixel\-based\)Impala CNN\(Espeholtet al\.,[2018](https://arxiv.org/html/2605.26371#bib.bib1)\)Image augmentation probability0\.5 \(pixel\-based manipulation\), 0 \(others\)AWR Temperature3\.0State\-Goal MLP dimensionsϕ​\(s,g\)\\phi\(s,g\)\(512, 512, 512\)Action sequence MLP dimensionsψ​\(𝐚𝐤\)\\psi\(\\mathbf\{a\_\{k\}\}\)\(256, 256\)NonlinearityGELU\(Hendrycks and Gimpel,[2023](https://arxiv.org/html/2605.26371#bib.bib59)\)Target smoothing coefficient0\.005Discount factorγ\\gamma0\.995 \(\{antmaze, pointmaze\}\-giant, humanoidmaze\), 0\.99 \(others\)Expectileκ\\kappa0\.7Subgoal step100 \(humanoidmaze\), 25 \(other locomotion\), 10 \(others\)Representation horizonkk100 \(humanoidmaze\), 25 \(other locomotion\), 10 \(others\)Action stride4 \(humanoidmaze\), 1 \(other\)Representation dimension100Single\-Action CARL auxiliary loss weightλaux\\lambda\_\{\\text\{aux\}\}0\.1 \(humanoidmaze\), 0\.5 \(other locomotion\), 0\.1 \(others\)CARL auxiliary loss weightλaux\\lambda\_\{\\text\{aux\}\}0\.1 \(humanoidmaze\), 0\.3 \(other locomotion\), 0\.7 \(others\)Prediction auxiliary loss weightλaux\\lambda\_\{\\text\{aux\}\}0\.1 \(humanoidmaze\), 0\.7 \(other locomotion\), 0\.9 \(others\)Single\-Action CARL temperatureτ\\tau0\.1 \(locomotion\), 0\.05 \(other\)CARL temperatureτ\\tau0\.1 \(locomotion\), 0\.05 \(other\)Evaluation Episodes20

Table 3:Hyperparameters used for comparisons across CARL, and other algorithms used in this paper\. "Prediction" refers to parameters used for both Singe\-Action and Multi\-Action Prediction\.### B\.1Hyperparameters

We base our implementation of HIQL\+CARL on the implementation of HIQL in the OGBench codebase\(Parket al\.,[2025a](https://arxiv.org/html/2605.26371#bib.bib4)\)\. To ensure fair comparisons, we keep most hyperparameters the same when integrating CARL with HIQL and HGCBC\. One notable exception was the size of the representation dimension\. HIQL and HGCBC originally use representation dimensions of 10, but we change this to 100 in their CARL variants\. For fair comparison, we compare HIQL\+CARL to HIQL with a bigger representation dimension in Table[6](https://arxiv.org/html/2605.26371#A3.T6)\. We find that HIQL’s performance does not change significantly overall, with effects varying by environment: increasing the dimension causes regressions on pointmaze tasks while substantially improving performance on puzzle\-3x3\. Therefore, we use the original hyperparameters fromParket al\.\([2023](https://arxiv.org/html/2605.26371#bib.bib5)\)in our main benchmark\.

Our selections of auxiliary loss weightsλa​u​x\\lambda\_\{aux\}and temperaturesτ\\tauwere tuned for all algorithms through hyperparameter sweeps of equal effort\.

For a full description of the hyperparameters used in our training, refer to Table[3](https://arxiv.org/html/2605.26371#A2.T3)\.

### B\.2Humanoid Visualization Details

For qualitative humanoid evaluations, we use a custom tool to visualize rollouts of trajectories associated with state\-goal pairs in our dataset\. We used Muwanx[Tsujimoto](https://arxiv.org/html/2605.26371#bib.bib66)222Muwanx Github:[https://github\.com/ttktjmt/muwanx](https://github.com/ttktjmt/muwanx), a project combining Mujoco, Web Assembly, ThreeJS, and Onnx to bring live Mujoco simulation to web browsers\. Building off of this tool, we added various UI features and visualizations suited for our experiments\.

To perform the evaluation, we first find all of the nearest neighbors for a reference state\-goal pair\. We embed every k step segment in our dataset and bin them into sequential groups of 500\. From these bins, we select the best matches to the reference state\-goal pair, and then select the top 30\. Because the state\-goal pairs in the humanoidmaze dataset span a variety of global\(x,y\)\(x,y\)locations, we center them to visualize relative differences in joint behavior\.

In figure[4](https://arxiv.org/html/2605.26371#S5.F4), we inspect nearest neighbors for different encoders by comparingkk\-step trajectories to a reference behavior\. Although it is clear in videos of these 30 matches that representation learning methods cluster better than random encoders, snapshots of trajectories show only part of the picture\. Therefore, we choose trajectories representative of the behavior seen in the videos we release on our website:[https://sites\.google\.com/view/behavior\-rep/home](https://sites.google.com/view/behavior-rep/home)\.

## Appendix CAdditional Experiments

### C\.1Pre\-training vs Co\-training in HIQL

TaskCo\-trainPre\-trainpointmaze\-medium84\.4±\\pm3\.175\.0±\\pm6\.7pointmaze\-large75\.4±\\pm3\.167\.0±\\pm6\.6pointmaze\-giant64\.5±\\pm15\.848\.5±\\pm7\.6pointmaze\-teleport31\.4±\\pm14\.018\.0±\\pm5\.2antmaze\-medium97\.9±\\pm0\.997\.2±\\pm2\.3antmaze\-large91\.9±\\pm2\.492\.8±\\pm2\.4antmaze\-giant75\.2±\\pm3\.868\.2±\\pm5\.0antmaze\-teleport41\.4±\\pm2\.150\.5±\\pm2\.0humanoidmaze\-medium90\.5±\\pm2\.466\.8±\\pm4\.0humanoidmaze\-large58\.3±\\pm3\.817\.2±\\pm1\.4humanoidmaze\-giant27\.2±\\pm3\.14\.0±\\pm2\.0antsoccer\-medium12\.8±\\pm2\.410\.8±\\pm4\.2antsoccer\-arena63\.7±\\pm3\.158\.5±\\pm4\.7cube\-single32\.8±\\pm3\.120\.5±\\pm6\.0cube\-double23\.4±\\pm4\.58\.2±\\pm3\.1scene70\.5±\\pm5\.249\.5±\\pm7\.2

Table 4:Success rates \(%\) comparing Co\-train \(jointly trained representations\) against Pre\-train \(frozen representations\)\. Results reported as mean±\\pm95% confidence interval; values within 95% of the best per row arebolded\.We explored two ways of integrating CARL into HIQL: pretraining, where the subgoal representation is learned with CARL and then finetuned with low\-level policy gradients from HIQL, and co\-training, where CARL’s objective is optimized jointly with HIQL’s value and policy losses\. Pretraining keeps CARL isolated from HIQL’s value learning — the encoder shapes the low\-level policy’s input but does not interact with the value function\. Co\-training instead shares the subgoal representation across the low\-level policy and value function, fully integrating skill reuse into the algorithm\. Pretraining alone improved performance on most state\-based tasks but regressed on humanoid maze tasks\. Co\-training, by contrast, improved results across all tasks \(Table[4](https://arxiv.org/html/2605.26371#A3.T4)\), with the largest gains on longer\-horizon tasks\. We attribute this to the shared subgoal representation benefiting both the policy and value function, and view it as further evidence for the importance of integrated skill\-reuse abstractions in HRL\.

### C\.2Surface Sampling vs Interior Sampling

TaskInteriorSurfacepointmaze\-medium84\.4±\\pm3\.171\.4±\\pm5\.9pointmaze\-large75\.4±\\pm3\.19\.9±\\pm9\.7pointmaze\-giant64\.5±\\pm15\.87\.2±\\pm5\.9pointmaze\-teleport31\.4±\\pm14\.026\.7±\\pm6\.1antmaze\-medium97\.9±\\pm0\.995\.0±\\pm4\.7antmaze\-large91\.9±\\pm2\.492\.4±\\pm3\.1antmaze\-giant75\.2±\\pm3\.871\.8±\\pm6\.4antmaze\-teleport41\.4±\\pm2\.139\.7±\\pm5\.7humanoidmaze\-medium90\.5±\\pm2\.489\.3±\\pm3\.1humanoidmaze\-large58\.3±\\pm3\.852\.6±\\pm5\.7humanoidmaze\-giant27\.2±\\pm3\.121\.5±\\pm3\.3antsoccer\-medium12\.8±\\pm2\.411\.4±\\pm2\.4antsoccer\-arena63\.7±\\pm3\.156\.9±\\pm4\.0cube\-single32\.8±\\pm3\.128\.5±\\pm3\.8cube\-double23\.4±\\pm4\.515\.7±\\pm4\.3cube\-triple15\.2±\\pm3\.59\.9±\\pm4\.0cube\-quadruple0\.1±\\pm0\.20\.3±\\pm0\.5puzzle\-3x345\.5±\\pm7\.347\.9±\\pm8\.0puzzle\-4x435\.2±\\pm7\.138\.2±\\pm5\.0puzzle\-4x54\.2±\\pm2\.46\.5±\\pm2\.1puzzle\-4x63\.6±\\pm2\.14\.0±\\pm1\.4scene70\.5±\\pm5\.271\.6±\\pm3\.5win\-rate15/227/22

Table 5:Success rates \(%\) comparing CARL representations trained with Interior and Surface sampling techniques\. Results report mean±\\pm95% confidence interval over 8 seeds\. Win\-rates for each sampling technique are reported, with the best win\-rate bolded\.Given a tuplest,𝐚𝐤,st\+ks\_\{t\},\\mathbf\{a\_\{k\}\},s\_\{t\+k\}, one simple sampling technique would be to select positive examples as\(st,st\+k\)\(s\_\{t\},s\_\{t\+k\}\)and𝐚𝐤\\mathbf\{a\_\{k\}\}\. We call this surface sampling, because this technique samples state\-goal pairs that are on the surface of akk\-step walk fromsts\_\{t\}\. However, we could also form positive pairs for\(st,st\+i\)\(s\_\{t\},s\_\{t\+i\}\), wherei≤ki\\leq k, because these shorter state pairs are still achieved with a portion of these longer action sequences\. This also benefits the low\-level policy, which may have to reason about state\-goal pairs that are smaller thankk\-steps away\. We call this interior sampling, because it samples state\-goal pairs that are on the interior of thekk–step walk fromsts\_\{t\}\. Empirically, we find that interior sampling performs better on average, obtaining a win\-rate of 15/22 as shown in Table[5](https://arxiv.org/html/2605.26371#A3.T5)\.

### C\.3Training HIQL with larger representation dimensions

HIQL Variants\\phantomsubcaptionHGCBC Variants\\phantomsubcaptionTaskHIQL\+CARLHIQL\(rep\_dim=100\)HIQLHGCBC\+CARLHGCBC\(rep\_dim=100\)HGCBCpointmaze\-medium84\.4±\\pm3\.165\.5±\\pm7\.379±\\pm4\.313\.6±\\pm1\.80\.0±\\pm0\.00\.0±\\pm0\.0pointmaze\-large75\.4±\\pm3\.160\.1±\\pm10\.258±\\pm4\.319\.5±\\pm3\.80\.3±\\pm0\.20\.4±\\pm0\.3pointmaze\-giant64\.5±\\pm15\.821\.7±\\pm6\.946±\\pm7\.60\.0±\\pm0\.00\.0±\\pm0\.00\.0±\\pm0\.0pointmaze\-teleport31\.4±\\pm14\.016\.0±\\pm4\.518±\\pm3\.329\.7±\\pm1\.812\.6±\\pm1\.116\.7±\\pm1\.2antmaze\-medium97\.9±\\pm0\.996\.5±\\pm1\.796±\\pm0\.973\.1±\\pm1\.560\.7±\\pm2\.259\.8±\\pm1\.6antmaze\-large91\.9±\\pm2\.489\.9±\\pm3\.391±\\pm1\.764\.9±\\pm0\.948\.0±\\pm0\.959\.3±\\pm1\.9antmaze\-giant75\.2±\\pm3\.868\.8±\\pm5\.465±\\pm4\.318\.4±\\pm1\.34\.4±\\pm1\.18\.5±\\pm0\.8antmaze\-teleport41\.4±\\pm2\.145\.6±\\pm6\.642±\\pm2\.638\.5±\\pm1\.334\.6±\\pm2\.136\.9±\\pm1\.9humanoidmaze\-medium90\.5±\\pm2\.487\.6±\\pm4\.589±\\pm1\.744\.4±\\pm1\.340\.8±\\pm1\.832\.4±\\pm0\.9humanoidmaze\-large58\.3±\\pm3\.852\.9±\\pm5\.249±\\pm3\.331\.0±\\pm0\.819\.8±\\pm1\.722\.6±\\pm0\.8humanoidmaze\-giant27\.2±\\pm3\.122\.8±\\pm4\.512±\\pm3\.320\.0±\\pm1\.112\.0±\\pm1\.39\.5±\\pm1\.3antsoccer\-medium12\.8±\\pm2\.411\.7±\\pm2\.813±\\pm1\.77\.0±\\pm0\.44\.9±\\pm1\.05\.9±\\pm0\.7antsoccer\-arena63\.7±\\pm3\.157\.9±\\pm3\.158±\\pm1\.724\.9±\\pm1\.615\.2±\\pm1\.212\.5±\\pm0\.8cube\-single32\.8±\\pm3\.119\.9±\\pm4\.015±\\pm2\.66\.9±\\pm0\.75\.6±\\pm1\.14\.4±\\pm0\.8cube\-double23\.4±\\pm4\.54\.5±\\pm2\.46±\\pm1\.71\.3±\\pm0\.31\.3±\\pm0\.51\.1±\\pm0\.3cube\-triple15\.2±\\pm3\.53\.3±\\pm1\.23±\\pm0\.92\.1±\\pm0\.60\.0±\\pm0\.00\.4±\\pm0\.3cube\-quadruple0\.1±\\pm0\.20\.3±\\pm0\.50±\\pm0\.00\.1±\\pm0\.10\.0±\\pm0\.00\.0±\\pm0\.0puzzle\-3x345\.5±\\pm7\.363\.4±\\pm8\.312±\\pm1\.76\.0±\\pm0\.94\.0±\\pm0\.84\.1±\\pm0\.3puzzle\-4x435\.2±\\pm7\.14\.3±\\pm2\.67±\\pm1\.75\.9±\\pm0\.90\.3±\\pm0\.20\.3±\\pm0\.2puzzle\-4x54\.2±\\pm2\.45\.2±\\pm1\.94±\\pm0\.91\.6±\\pm0\.61\.1±\\pm0\.50\.4±\\pm0\.2puzzle\-4x63\.6±\\pm2\.12\.5±\\pm1\.73±\\pm0\.90\.6±\\pm0\.30\.6±\\pm0\.20\.3±\\pm0\.2scene70\.5±\\pm5\.241\.0±\\pm2\.638±\\pm2\.615\.8±\\pm1\.07\.7±\\pm1\.38\.5±\\pm1\.1win\-rate17/224/221/2219/190/190/19

Table 6:Success rates \(%\) comparing CARL\-augmented HRL algorithms against their 100 dimensional representation variants and normal variants\. Results report mean±\\pm95% confidence interval over 8 seeds, with the performance within 95% of bestbolded\. For win\-rate calculations, we ignore ties andboldthe highest win\-rate\.We use a different representation dimension thanParket al\.\([2023](https://arxiv.org/html/2605.26371#bib.bib5)\), who do not ablate this choice\. To verify that our gains do not come from the change in representation dimension alone, Table[6](https://arxiv.org/html/2605.26371#A3.T6)compares our CARL\-augmented HRL algorithms against both their default versions and variants with a 100\-dimensional representation\. Increasing the representation dimension alone yields mixed results: HIQL \(rep\_dim=100\) shows degraded performance on most pointmaze environments and improves only onpuzzle\-3x3andcube\-single\. This confirms that the increased capacity alone cannot explain our performance gains — CARL’s representation structure is essential for the consistent improvements we see across nearly every task\.

### C\.4K\-step Sensitivity Analysis

Environmentk=5k=10k=25k=50k=75pointmaze\-medium76\.50±10\.876\.50\\pm 10\.880\.75±3\.2\\mathbf\{80\.75\\pm 3\.2\}84\.00±0\.7\\mathbf\{84\.00\\pm 0\.7\}81\.00±3\.4\\mathbf\{81\.00\\pm 3\.4\}72\.75±4\.972\.75\\pm 4\.9pointmaze\-large45\.75±17\.745\.75\\pm 17\.761\.00±13\.261\.00\\pm 13\.270\.00±6\.870\.00\\pm 6\.888\.25±5\.5\\mathbf\{88\.25\\pm 5\.5\}88\.75±4\.8\\mathbf\{88\.75\\pm 4\.8\}pointmaze\-giant22\.50±15\.422\.50\\pm 15\.447\.50±14\.147\.50\\pm 14\.162\.50±13\.362\.50\\pm 13\.367\.75±12\.467\.75\\pm 12\.472\.25±9\.5\\mathbf\{72\.25\\pm 9\.5\}pointmaze\-teleport24\.00±8\.524\.00\\pm 8\.526\.50±4\.026\.50\\pm 4\.028\.50±7\.828\.50\\pm 7\.827\.75±6\.827\.75\\pm 6\.836\.25±6\.4\\mathbf\{36\.25\\pm 6\.4\}antmaze\-medium89\.75±0\.889\.75\\pm 0\.895\.75±1\.4\\mathbf\{95\.75\\pm 1\.4\}96\.50±1\.1\\mathbf\{96\.50\\pm 1\.1\}90\.75±2\.590\.75\\pm 2\.590\.75±3\.490\.75\\pm 3\.4antmaze\-large51\.75±3\.251\.75\\pm 3\.274\.75±1\.474\.75\\pm 1\.486\.50±6\.5\\mathbf\{86\.50\\pm 6\.5\}86\.00±3\.5\\mathbf\{86\.00\\pm 3\.5\}68\.25±1\.768\.25\\pm 1\.7antmaze\-giant3\.75±2\.13\.75\\pm 2\.138\.75±6\.638\.75\\pm 6\.673\.00±2\.5\\mathbf\{73\.00\\pm 2\.5\}57\.75±4\.757\.75\\pm 4\.726\.25±6\.426\.25\\pm 6\.4antmaze\-teleport36\.25±4\.536\.25\\pm 4\.541\.50±6\.141\.50\\pm 6\.140\.75±2\.740\.75\\pm 2\.748\.00±3\.1\\mathbf\{48\.00\\pm 3\.1\}43\.25±5\.843\.25\\pm 5\.8antsoccer\-medium6\.25±2\.86\.25\\pm 2\.84\.75±2\.54\.75\\pm 2\.512\.25±1\.7\\mathbf\{12\.25\\pm 1\.7\}12\.25±1\.1\\mathbf\{12\.25\\pm 1\.1\}6\.50±0\.56\.50\\pm 0\.5antsoccer\-arena58\.75±6\.2\\mathbf\{58\.75\\pm 6\.2\}53\.25±3\.653\.25\\pm 3\.658\.75±5\.6\\mathbf\{58\.75\\pm 5\.6\}60\.00±6\.3\\mathbf\{60\.00\\pm 6\.3\}59\.00±4\.0\\mathbf\{59\.00\\pm 4\.0\}Table 7:Sensitivity to horizon lengthkkon locomotion environments\. Success rates \(%\) averaged over 8 seeds with 95% confidence intervals reported\. Results within 95% of the best arebolded\.Environmentk=25k=75k=100k=125k=150humanoidmaze\-medium88\.00±4\.4\\mathbf\{88\.00\\pm 4\.4\}85\.00±1\.7\\mathbf\{85\.00\\pm 1\.7\}87\.00±3\.0\\mathbf\{87\.00\\pm 3\.0\}86\.67±5\.4\\mathbf\{86\.67\\pm 5\.4\}87\.33±4\.6\\mathbf\{87\.33\\pm 4\.6\}humanoidmaze\-large44\.67±5\.644\.67\\pm 5\.648\.67±2\.7\\mathbf\{48\.67\\pm 2\.7\}48\.50±4\.8\\mathbf\{48\.50\\pm 4\.8\}49\.33±6\.2\\mathbf\{49\.33\\pm 6\.2\}50\.00±5\.2\\mathbf\{50\.00\\pm 5\.2\}humanoidmaze\-giant18\.00±4\.718\.00\\pm 4\.726\.67±2\.4\\mathbf\{26\.67\\pm 2\.4\}22\.75±3\.922\.75\\pm 3\.928\.00±5\.9\\mathbf\{28\.00\\pm 5\.9\}21\.67±0\.521\.67\\pm 0\.5Table 8:Sensitivity to horizon lengthkkon humanoidmaze environments\. Success rates \(%\) averaged over 8 seeds with 95% confidence intervals reported\. Results within 95% of the best arebolded\.Environmentk=3k=5k=10k=15k=30cube\-single27\.50±7\.827\.50\\pm 7\.829\.25±3\.829\.25\\pm 3\.836\.00±3\.1\\mathbf\{36\.00\\pm 3\.1\}31\.00±1\.031\.00\\pm 1\.030\.75±5\.230\.75\\pm 5\.2cube\-double11\.50±7\.011\.50\\pm 7\.015\.75±3\.015\.75\\pm 3\.021\.00±5\.2\\mathbf\{21\.00\\pm 5\.2\}21\.00±5\.1\\mathbf\{21\.00\\pm 5\.1\}9\.75±0\.49\.75\\pm 0\.4cube\-triple1\.25±0\.81\.25\\pm 0\.83\.25±2\.23\.25\\pm 2\.25\.75±2\.45\.75\\pm 2\.48\.75±1\.48\.75\\pm 1\.410\.50±4\.0\\mathbf\{10\.50\\pm 4\.0\}cube\-quadruple0\.00±0\.00\.00\\pm 0\.00\.00±0\.00\.00\\pm 0\.00\.25±0\.4\\mathbf\{0\.25\\pm 0\.4\}0\.00±0\.00\.00\\pm 0\.00\.25±0\.4\\mathbf\{0\.25\\pm 0\.4\}scene74\.75±6\.5\\mathbf\{74\.75\\pm 6\.5\}66\.75±1\.366\.75\\pm 1\.370\.75±1\.770\.75\\pm 1\.771\.00±7\.571\.00\\pm 7\.550\.25±3\.650\.25\\pm 3\.6puzzle\-3x38\.50±2\.28\.50\\pm 2\.211\.75±3\.011\.75\\pm 3\.031\.75±5\.531\.75\\pm 5\.550\.50±10\.7\\mathbf\{50\.50\\pm 10\.7\}40\.00±12\.040\.00\\pm 12\.0puzzle\-4x431\.00±1\.231\.00\\pm 1\.232\.00±3\.632\.00\\pm 3\.634\.00±1\.234\.00\\pm 1\.238\.00±2\.6\\mathbf\{38\.00\\pm 2\.6\}29\.25±2\.829\.25\\pm 2\.8puzzle\-4x56\.00±1\.76\.00\\pm 1\.76\.75±1\.6\\mathbf\{6\.75\\pm 1\.6\}3\.75±0\.43\.75\\pm 0\.44\.00±2\.04\.00\\pm 2\.02\.75±0\.82\.75\\pm 0\.8puzzle\-4x64\.75±2\.5\\mathbf\{4\.75\\pm 2\.5\}4\.25±1\.34\.25\\pm 1\.33\.25±1\.33\.25\\pm 1\.33\.25±1\.63\.25\\pm 1\.61\.75±1\.31\.75\\pm 1\.3Table 9:Sensitivity to horizon lengthkkon manipulation environments\. Success rates \(%\) averaged over 8 seeds with 95% confidence intervals reported\. Results within 95% of the best arebolded\.We provide sensitivity analysis for ourkk\-step hyperparameter in Table[7](https://arxiv.org/html/2605.26371#A3.T7)for locomotion environments \(except humanoidmaze\), Table[8](https://arxiv.org/html/2605.26371#A3.T8)for humanoidmaze, and Table[9](https://arxiv.org/html/2605.26371#A3.T9)for manipulation environments\. For some environments, we observe better performance with largerkk\(pointmaze\-large, pointmaze\-giant, and humanoidmaze\-large\)\. We also observe the opposite trend where performance drops off after increasingkktoo much \(puzzle\-3x3, cube\-double, and scene\)\. Other environments are largely robust to the chosen value of the horizonkk\.

### C\.5Data Sensitivity Analysis

MethodEnvironment0% Coverage25% Coverage50% Coverage75% CoverageCARL \+ HIQLpointmaze\-medium10\.7±30\.510\.7\\pm 30\.576\.3±12\.976\.3\\pm 12\.980\.7±3\.080\.7\\pm 3\.085\.3±14\.6\\mathbf\{85\.3\\pm 14\.6\}antmaze\-medium11\.0±24\.111\.0\\pm 24\.195\.7±1\.3\\mathbf\{95\.7\\pm 1\.3\}99\.7±1\.3\\mathbf\{99\.7\\pm 1\.3\}98\.0±5\.2\\mathbf\{98\.0\\pm 5\.2\}humanoidmaze\-medium40\.0±9\.040\.0\\pm 9\.092\.7±11\.2\\mathbf\{92\.7\\pm 11\.2\}88\.3±5\.2\\mathbf\{88\.3\\pm 5\.2\}89\.3±1\.3\\mathbf\{89\.3\\pm 1\.3\}HIQLpointmaze\-medium12\.7±26\.712\.7\\pm 26\.762\.7±28\.462\.7\\pm 28\.472\.0±25\.0\\mathbf\{72\.0\\pm 25\.0\}69\.7±12\.0\\mathbf\{69\.7\\pm 12\.0\}antmaze\-medium0\.0±0\.00\.0\\pm 0\.093\.0±6\.5\\mathbf\{93\.0\\pm 6\.5\}96\.7±3\.9\\mathbf\{96\.7\\pm 3\.9\}95\.0±4\.3\\mathbf\{95\.0\\pm 4\.3\}humanoidmaze\-medium33\.0±19\.833\.0\\pm 19\.889\.0±13\.3\\mathbf\{89\.0\\pm 13\.3\}85\.0±9\.9\\mathbf\{85\.0\\pm 9\.9\}89\.3±1\.3\\mathbf\{89\.3\\pm 1\.3\}

Table 10:Robustness to reduced dataset coverage\. Success rates \(%\) are reported as mean±\\pm95% confidence interval over 3 seeds\. Results within 95% of the best in each row arebolded\.MethodEnvironment25% Removed50% Removed75% Removed100% RemovedCARL \+ HIQLpointmaze\-medium81\.0±1\.7\\mathbf\{81\.0\\pm 1\.7\}81\.0±2\.6\\mathbf\{81\.0\\pm 2\.6\}59\.7±18\.159\.7\\pm 18\.119\.7±1\.319\.7\\pm 1\.3antmaze\-medium99\.3±1\.3\\mathbf\{99\.3\\pm 1\.3\}96\.3±7\.3\\mathbf\{96\.3\\pm 7\.3\}96\.7±3\.0\\mathbf\{96\.7\\pm 3\.0\}18\.0±0\.018\.0\\pm 0\.0humanoidmaze\-medium90\.3±7\.3\\mathbf\{90\.3\\pm 7\.3\}89\.0±4\.3\\mathbf\{89\.0\\pm 4\.3\}86\.7±1\.3\\mathbf\{86\.7\\pm 1\.3\}16\.0±5\.216\.0\\pm 5\.2HIQLpointmaze\-medium66\.0±5\.2\\mathbf\{66\.0\\pm 5\.2\}63\.7±16\.8\\mathbf\{63\.7\\pm 16\.8\}36\.3±11\.636\.3\\pm 11\.619\.0±4\.319\.0\\pm 4\.3antmaze\-medium96\.0±2\.6\\mathbf\{96\.0\\pm 2\.6\}95\.3±8\.6\\mathbf\{95\.3\\pm 8\.6\}95\.3±1\.3\\mathbf\{95\.3\\pm 1\.3\}15\.0±4\.315\.0\\pm 4\.3humanoidmaze\-medium88\.3±5\.2\\mathbf\{88\.3\\pm 5\.2\}87\.0±9\.9\\mathbf\{87\.0\\pm 9\.9\}84\.0±6\.5\\mathbf\{84\.0\\pm 6\.5\}21\.0±4\.321\.0\\pm 4\.3

Table 11:Robustness to dataset imbalance induced by removing action sequences in the left half of the maze that move downwards\. Success rates \(%\) are reported as mean±\\pm95% confidence interval over 3 seeds\. Results within 95% of the best in each row arebolded\.In this section, we analyze the sensitivity of CARL to two types of data imbalance\. First, to test performance under low coverage, we remove portions of data from the left half of the maze\. Specifically, we retain 0, 25, 50, and 75% of the data present in the left of the maze and test the accuracy of a policy learned with CARL \+HIQL in these conditions and compare it to the performance of HIQL\. We report results in Table[10](https://arxiv.org/html/2605.26371#A3.T10)and find that CARL \+HIQL shows a similar drop in performance over time as HIQL\. This suggests that CARL improves the performance of HIQL while making data sensitivity no worse than the baseline\.

Second, we test performance under dataset imbalance\. We remove a percentage of action sequences in the left half of the maze that move downwards, biasing the action sequence distribution\. We report results for this experiment in Table[11](https://arxiv.org/html/2605.26371#A3.T11)\. Similar to the result in the first case, we find that CARL \+HIQL’s drop in performance as we increase dataset imbalance is very similar to the performance drop of HIQL\. The similar robustness trends between CARL \+ HIQL and HIQL suggest that dataset quality impacts performance primarily through the core HRL components rather than the representation\. While data diversity and coverage should in principle affect CARL’s performance, these results indicate that CARL does not further exacerbate existing data sensitivity\.

Similar Articles

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

arXiv cs.AI

This paper proposes SGDR (State-Grounded Dynamic Retrieval), an online skill learning method for web agents that enables stepwise, state-aware skill reuse rather than static task-level retrieval. Experiments on WebArena show SGDR achieves 37.5% success rate with GPT-4.1, a ~10.6% relative gain over strong baselines.

Stochastic Neural Networks for hierarchical reinforcement learning

OpenAI Blog

OpenAI researchers propose a framework using stochastic neural networks for hierarchical reinforcement learning that pre-trains useful skills guided by a proxy reward, then leverages these skills for faster learning in downstream tasks with sparse rewards or long horizons.

SkillOS: Learning Skill Curation for Self-Evolving Agents

Hugging Face Daily Papers

This paper introduces SkillOS, a reinforcement learning framework that enables LLM agents to learn long-term skill curation policies for self-evolution, improving performance and generalization across tasks.

Learning a hierarchy

OpenAI Blog

OpenAI research proposes hierarchical reinforcement learning where agents break down complex tasks into sequences of high-level actions rather than low-level ones, significantly improving efficiency for long-horizon tasks by reducing search complexity from thousands of steps to dozens.