R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
Summary
Proposes R2R2, a regularization method for self-predictive learning in reinforcement learning to mitigate overfitting under high update-to-data ratios, achieving significant improvements on continuous control tasks.
View Cached Full Text
Cached at: 05/15/26, 06:26 AM
# Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
Source: [https://arxiv.org/html/2605.14026](https://arxiv.org/html/2605.14026)
###### Abstract
For reinforcement learning in data\-scarce domains like real\-world robotics, intensive data reuse enhances efficiency but induces overfitting\. While prior works focus on critic bias, representation\-level instability in Self\-Predictive Learning \(SPL\) under high Update\-to\-Data \(UTD\) regimes remains underexplored\. To bridge this gap, we propose Robust Representation via Redundancy Reduction \(R2R2\), a regularization method within SPL\. We theoretically identify that standard zero\-centering conflicts with SPL’s spectral properties and design a non\-centered objective accordingly\. We verify R2R2 on SPL\-native algorithms like TD7\. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state\-of\-the\-art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2\-SPL\. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by∼\\sim22% and provides additional gains on top of SimbaV2\-SPL, which itself establishes a new state\-of\-the\-art\. The code can be found at:[https://github\.com/songsang7/R2R2](https://github.com/songsang7/R2R2)
Machine Learning, ICML
\{NoHyper\}
## 1Introduction
Figure 1:Performance comparison across UTD ratios 1, 10, and 20\. Our method demonstrates robustness with minimal loss or gains in high UTD regimes\. Notably, our proposed SimbaV2\-SPL outperforms the current state\-of\-the\-art, SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), and R2R2 achieves further performance gains on top of this enhanced baseline\.Historically, reinforcement learning \(RL\) has evolved to improve sample efficiency\. Off\-policy methods\(Mnihet al\.,[2015](https://arxiv.org/html/2605.14026#bib.bib16); Lillicrapet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib1); Fujimotoet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib9); Haarnojaet al\.,[2018a](https://arxiv.org/html/2605.14026#bib.bib3),[b](https://arxiv.org/html/2605.14026#bib.bib4)\)have achieved this by reusing past transitions via an experience replay buffer, while model\-based RL methods\(Sutton,[1990](https://arxiv.org/html/2605.14026#bib.bib18); Janneret al\.,[2019](https://arxiv.org/html/2605.14026#bib.bib17); Ha and Schmidhuber,[2018](https://arxiv.org/html/2605.14026#bib.bib19)\)have done so by generating synthetic experiences from a learned model\. Subsequently, Self\-Predictive Learning \(SPL\)\(Schwarzeret al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib7); van den Oordet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib33)\)has emerged as a promising framework, employing an auxiliary task to predict the latent representation of the next state conditioned on the current state and action\. SPL maximizes data efficiency by providing additional information about environmental dynamics for actor\-critic training\. Parallel to these advancements, another approach for improving sample efficiency is to intensively reuse collected experiences by increasing the Update\-to\-Data \(UTD\) ratio, defined as the number of gradient updates per environment interaction\. To enable such intensive updates while mitigating overfitting, most prior works operate within the model\-free paradigm, primarily focusing on addressing value estimation bias\. Prominent methods such as REDQ\(Chenet al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib20)\)and CrossQ\(Bhattet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib21)\)have successfully stabilized training by using ensembles or normalization\.
However, these value\-centric advancements primarily target the critic’s output and do not explicitly address the representation\-level overfitting\. Consequently, the intersection of high UTD training and SPL remains underexplored\. We argue that addressing representation\-level instability is an orthogonal challenge to mitigating value bias; while high UTD regimes induce overfitting across all components, existing value\-centric strategies are insufficient to prevent the representational degradation within the SPL encoder and latent dynamics model\.
To address this, we propose Robust Representation via Redundancy Reduction \(R2R2\), a regularization approach incorporating redundancy reduction principles\(Barlow and others,[1961](https://arxiv.org/html/2605.14026#bib.bib24); Zbontaret al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib23); Bardeset al\.,[2022](https://arxiv.org/html/2605.14026#bib.bib14)\)to stabilize SPL performance across varying update frequencies, ranging from standard to high UTD settings \(UTD≈20\\text\{UTD\}\\approx 20\)\. Fundamentally, we diverge from standard redundancy reduction methods\(Zbontaret al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib23); Bardeset al\.,[2022](https://arxiv.org/html/2605.14026#bib.bib14)\)in computer vision by avoiding zero\-centering\. Building on the theoretical insight that SPL performs spectral decomposition of transition dynamics\(Tanget al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib15)\), we identify that centering mathematically eliminates the dominant eigenvector corresponding to the global dynamics information\. Consequently, R2R2 employs a non\-centered regularization scheme, preserving this information while ensuring robust feature learning even with intensive experience reuse\. Crucially, since our approach targets the regularization of the encoder rather than the value function, our representation\-level intervention is orthogonal to, and thus naturally compatible with, existing value\-centric high UTD strategies\.
To verify the efficacy and universality of R2R2, we first applied it to SPL\-native algorithms such as TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\)\. On standard benchmarks, our method effectively mitigates performance degradation at high\-UTD \(UTD=20\\text\{UTD\}=20\), and improves the aggregate normalized score of TD7 by approximately 22%\. This advantage is even more pronounced in DMC\-Hard\(Leeet al\.,[2025a](https://arxiv.org/html/2605.14026#bib.bib22)\), a challenging subset of the standard DMC suite\(Tassaet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib27),[2020](https://arxiv.org/html/2605.14026#bib.bib41)\), where R2R2 significantly boosts the baseline performance \(1\.02 → 1\.32\)\. Furthermore, to demonstrate its orthogonality to architectural advancements, we targeted the state\-of\-the\-art SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\)\. Since our approach necessitates an SPL framework, we first integrated a tailored SPL module into this architecture, termed SimbaV2\-SPL\. This integration alone establishes a new state\-of\-the\-art performance on continuous control benchmarks\. Crucially, applying R2R2 to this enhanced baseline yields further gains, reaching a normalized score of 1\.38, thereby confirming that our R2R2 provides complementary benefits even to the strongest architectures\.
In summary, the contributions of this paper are:
- •\(Method\)We propose R2R2, a regularization approach grounded in our theoretical analysis\. By leveraging redundancy reduction principles, our method explicitly mitigates representation\-level instability caused by intensive experience reuse in high UTD regimes\.
- •\(Theory\)We provide a theoretical analysis identifying a fundamental conflict between the spectral decomposition properties of SPL and feature centralization \(zero\-centering\)\. We demonstrate that zero\-centering eliminates the dominant eigenmode representing global dynamics\.
- •\(Architecture\)We construct SimbaV2\-SPL by integrating a tailored SPL framework into the state\-of\-the\-art, SimbaV2\. Distinct from our proposed regularization, this architectural extension bridges the gap between latent dynamics modeling and modern model\-free backbones\.
- •\(Performance\)We demonstrate that R2R2 generally improves various backbone algorithms, particularly under high UTD regimes\. Furthermore, our proposed architecture, SimbaV2\-SPL, alone establishes a new state\-of\-the\-art performance on continuous control benchmarks and on top of this enhanced baseline, SimbaV2\-SPL \+R2R2 achieves additional performance gains\.
## 2Related Works
### 2\.1Update\-to\-Data Ratio
Standard off\-policy algorithms\(Mnihet al\.,[2015](https://arxiv.org/html/2605.14026#bib.bib16); Lillicrapet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib1); Fujimotoet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib9); Haarnojaet al\.,[2018a](https://arxiv.org/html/2605.14026#bib.bib3),[b](https://arxiv.org/html/2605.14026#bib.bib4)\)typically limit the Update\-to\-Data \(UTD\) ratio to 1 to avoid overfitting caused by intensive data reuse\. However, recent trends favor high UTD ratios \(e\.g\.,UTD≈20\\text\{UTD\}\\approx 20\)\. To mitigate this overfitting, Dyna\-style approaches augment training with synthetic transitions\(Sutton,[1990](https://arxiv.org/html/2605.14026#bib.bib18); Janneret al\.,[2019](https://arxiv.org/html/2605.14026#bib.bib17); Voelckeret al\.,[2025](https://arxiv.org/html/2605.14026#bib.bib35)\), while model\-free methods utilize randomized ensembles\(Chenet al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib20)\)or successfully incorporate Batch Normalization\(Bhattet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib21); Ioffe and Szegedy,[2015](https://arxiv.org/html/2605.14026#bib.bib38)\)to mitigate bias\. More recently, attention has shifted toward architectural innovations; SimBa\(Leeet al\.,[2025a](https://arxiv.org/html/2605.14026#bib.bib22)\)and BRO\(Naumanet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib36)\)demonstrate that regularization techniques like Layer Normalization\(Baet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib34)\)and Dropout\(Srivastavaet al\.,[2014](https://arxiv.org/html/2605.14026#bib.bib37)\)can rival model\-based efficiency\. Building on this, SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\)further scaled network capacity and refined normalization, establishing a new state\-of\-the\-art\.
Despite these advancements, prior works primarily target instability through value function ensembles or architectural constraints\. In contrast, the specific issue of representation\-level degradation presents an orthogonal challenge that has received relatively limited attention\.
### 2\.2Self\-Supervised Learning
Self\-Supervised Learning \(SSL\) has evolved from contrastive methods like SimCLR\(Chenet al\.,[2020](https://arxiv.org/html/2605.14026#bib.bib28)\)to non\-contrastive asymmetric approaches such as BYOL\(Grillet al\.,[2020](https://arxiv.org/html/2605.14026#bib.bib13)\)and SimSiam\(Chen and He,[2021](https://arxiv.org/html/2605.14026#bib.bib12)\)\. Distinct from these architectural solutions, Redundancy Reduction\(Barlow and others,[1961](https://arxiv.org/html/2605.14026#bib.bib24)\)\-based methods, including Barlow Twins\(Zbontaret al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib23)\)and VICReg\(Bardeset al\.,[2022](https://arxiv.org/html/2605.14026#bib.bib14)\), prevent collapse by explicitly decorrelating feature dimensions\. We leverage these principles to learn robust state representations in overfitting\-prone RL scenarios\.
### 2\.3Self\-Predictive Learning
To enhance sample efficiency, prior works utilized auxiliary objectives ranging from reconstruction\(Yaratset al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib32); Jaderberget al\.,[2017](https://arxiv.org/html/2605.14026#bib.bib31)\)to contrastive learning\(van den Oordet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib33); Laskinet al\.,[2020](https://arxiv.org/html/2605.14026#bib.bib30)\)\. Moving beyond these, Self\-Predictive Learning \(SPL\) focuses on capturing latent temporal dynamics rather than static features\. Prominent implementations include SPR\(Schwarzeret al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib7)\), which adapts BYOL\(Grillet al\.,[2020](https://arxiv.org/html/2605.14026#bib.bib13)\), and TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\), which employs a SimSiam\(Chen and He,[2021](https://arxiv.org/html/2605.14026#bib.bib12)\)\-style framework\. Recently, theoretical studies\(Tanget al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib15); Niet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib11)\)have further generalized the SPL framework, offering insights into its spectral properties\.
## 3Preliminaries
### 3\.1Reinforcement Learning Problem
We address the reinforcement learning problem within the framework of a Markov Decision Process \(MDP\), formalized as a tuple\(𝒮,𝒜,P,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\)\(Sutton and Barto,[2018](https://arxiv.org/html/2605.14026#bib.bib6)\)\. The goal of an agent is to learn a policyπ\\pithat maximizes the expected discounted cumulative return\. In the context of SPL, we focus on learning an informative latent representationϕ:𝒮→ℝk\\phi:\\mathcal\{S\}\\rightarrow\\mathbb\{R\}^\{k\}that encapsulates essential information about the transition dynamicsPP, which serves as a basis for sample\-efficient learning\.
### 3\.2Spectral Perspective on Self\-Predictive Learning
SPL typically involves training an encoderϕ\\phialongside a latent transition predictor𝒯\\mathcal\{T\}\. The objective is to minimize the discrepancy between the predicted and actual future representations\. Formally, for a transition tuple\(st,at,st\+1\)\(s\_\{t\},a\_\{t\},s\_\{t\+1\}\), the SPL loss \(ℒSPL\\mathcal\{L\}\_\{\\text\{SPL\}\}\) is given by:
ℒSPL=𝔼\(st,at,st\+1\)∼𝒟\[‖𝒯\(ϕ\(st\),at\)−sg\(ϕ\(st\+1\)\)‖22\],\\mathcal\{L\}\_\{\\text\{SPL\}\}=\\mathbb\{E\}\_\{\(s\_\{t\},a\_\{t\},s\_\{t\+1\}\)\\sim\\mathcal\{D\}\}\\left\[\\\|\\mathcal\{T\}\(\\phi\(s\_\{t\}\),a\_\{t\}\)\-\\text\{sg\}\(\\phi\(s\_\{t\+1\}\)\)\\\|\_\{2\}^\{2\}\\right\],\(1\)
wheresg\(⋅\)\\text\{sg\}\(\\cdot\)indicates the stop\-gradient operation, a critical component for training stability\.
Connection to Spectral Decomposition\.Recent theoretical advancements have established a link between the learning dynamics of SPL and the spectral decomposition of the state transition matrix\. Consider the representation matrixΦ∈ℝ\|𝒮\|×k\\Phi\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\\times k\}, where each row corresponds to the embeddingϕ\(s\)\\phi\(s\)of a statess\. Specifically,\(Tanget al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib15)\)demonstrate that minimizing the SPL objective implicitly maximizes a trace functional involving the transition matrixPπP^\{\\pi\}:
maxΦTr\(\(Φ⊤PπΦ\)⊤\(Φ⊤PπΦ\)\)s\.t\.Φ⊤Φ=Ik\.\\max\_\{\\Phi\}\\text\{Tr\}\\left\(\(\\Phi^\{\\top\}P^\{\\pi\}\\Phi\)^\{\\top\}\(\\Phi^\{\\top\}P^\{\\pi\}\\Phi\)\\right\)\\kern 5\.0pt\\text\{s\.t\.\}\\kern 5\.0pt\\Phi^\{\\top\}\\Phi=I\_\{k\}\.\(2\)The optimal representationΦ∗\\Phi^\{\*\}derived from this objective spans the subspace defined by the top\-kkrighteigenvectors ofPπP^\{\\pi\}\. A fundamental property of a Markov chain with a row\-stochastic transition matrixPPis that the eigenvalue11always exists\. In particular, the constant vector𝟏\\mathbf\{1\}is a right eigenvector satisfyingP𝟏=𝟏P\\mathbf\{1\}=\\mathbf\{1\}, reflecting conservation of probability mass\. This constant eigenvector aligns with the global bias\. This observation implies that preserving the constant mode is needed for capturing the global dynamics\.
### 3\.3Redundancy Reduction in Self\-Supervised Learning
To learn robust representations without collapse, redundancy reduction methods impose explicit constraints on the statistical properties of the embeddings\. Among these approaches, VICReg\(Bardeset al\.,[2022](https://arxiv.org/html/2605.14026#bib.bib14)\)is particularly effective as it decouples the learning objective into three independent regularization terms: Variance, Invariance, and Covariance\.
For a formal definition, letZ∈ℝN×dZ\\in\\mathbb\{R\}^\{N\\times d\}denote a batch of feature vectors, whereNNis the batch size andddis the feature dimension\. We denotezb,⋅z\_\{b,\\cdot\}as thebb\-th sample vector andz⋅,jz\_\{\\cdot,j\}as the vector of thejj\-th dimension across the batch\. The components are defined as follows:
- •Variance \(ℒVar\\mathcal\{L\}\_\{\\text\{Var\}\}\):Prevents representation collapse by maintaining the variance of each feature dimension above a thresholdvthv\_\{th\}, computed along the batch dimension: ℒVar\(Z\)=1d∑j=1dmax\(0,vth−Var\(z⋅,j\)\),\\mathcal\{L\}\_\{\\text\{Var\}\}\(Z\)=\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}\\max\(0,v\_\{th\}\-\\sqrt\{\\text\{Var\}\(z\_\{\\cdot,j\}\)\}\),\\vskip\-2\.15277pt\(3\)whereVar\(z⋅,j\)\\text\{Var\}\(z\_\{\\cdot,j\}\)is the variance of thejj\-th dimension\.
- •Invariance \(ℒInv\\mathcal\{L\}\_\{\\text\{Inv\}\}\):Minimizes the mean squared error between the representations of two views \(ZA,ZBZ^\{A\},Z^\{B\}\) to learn invariant features against augmentations: ℒInv\(ZA,ZB\)=1N∑b=1N‖zb,⋅A−zb,⋅B‖22\.\\mathcal\{L\}\_\{\\text\{Inv\}\}\(Z^\{A\},Z^\{B\}\)=\\frac\{1\}\{N\}\\sum\_\{b=1\}^\{N\}\\\|z^\{A\}\_\{b,\\cdot\}\-z^\{B\}\_\{b,\\cdot\}\\\|\_\{2\}^\{2\}\.\\vskip\-2\.15277pt\(4\)
- •Covariance \(ℒCov\\mathcal\{L\}\_\{\\text\{Cov\}\}\):Decorrelates the feature dimensions by penalizing the off\-diagonal coefficients of the covariance matrix\. Crucially, the covariance matrixCov\(Z\)\\text\{Cov\}\(Z\)is typically centered: ℒCov\(Z\)=1d∑i≠j\(1N−1∑b=1N\(zb,i−μi\)\(zb,j−μj\)\)2\\displaystyle\\hskip\-8\.00003pt\\mathcal\{L\}\_\{\\text\{Cov\}\}\(Z\)=\\frac\{1\}\{d\}\\sum\_\{i\\neq j\}\\left\(\\frac\{1\}\{N\{\-\}1\}\\sum\_\{b=1\}^\{N\}\(z\_\{b,i\}\{\-\}\\mu\_\{i\}\)\(z\_\{b,j\}\{\-\}\\mu\_\{j\}\)\\right\)^\{2\}\(5\)whereμk=𝔼b\[zb,k\]\.\\displaystyle\\hskip\-8\.00003pt\\text\{where \}\\mu\_\{k\}=\\mathbb\{E\}\_\{b\}\[z\_\{b,k\}\]\.
The total objective is a weighted sum:ℒVIC=λInvℒInv\+λVarℒVar\+λCovℒCov\\mathcal\{L\}\_\{\\text\{VIC\}\}=\\lambda\_\{\\text\{Inv\}\}\\mathcal\{L\}\_\{\\text\{Inv\}\}\+\\lambda\_\{\\text\{Var\}\}\\mathcal\{L\}\_\{\\text\{Var\}\}\+\\lambda\_\{\\text\{Cov\}\}\\mathcal\{L\}\_\{\\text\{Cov\}\}, where eachλ\\lambdadenotes a balancing coefficient\. Notably, the standardℒCov\\mathcal\{L\}\_\{\\text\{Cov\}\}relies on zero\-centering, thereby treating the feature mean as a bias to be eliminated\.
## 4Method
In this section, we introduce our approach for robust representation learning in high UTD regimes\. We start by analyzing the spectral conflict arises from applying conventional zero\-centering to latent dynamics modeling \(Sec\.[4\.1](https://arxiv.org/html/2605.14026#S4.SS1)\)\. This analysis reveals that the standard redundancy reduction, which enforces feature centering, is unsuitable for capturing environmental dynamics\. Guided by this insight, we propose an objective refinements—such as omitting the projector and removing the centering constraint \(Sec\.[4\.2](https://arxiv.org/html/2605.14026#S4.SS2)\)\. Furthermore, to extend our method to the state\-of\-the\-art SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), which originally lacks an SPL component, we introduce an augmented architecture termed SimbaV2\-SPL \(Sec\.[4\.3](https://arxiv.org/html/2605.14026#S4.SS3)\)\.
### 4\.1Conflict between Zero\-Centering and SPL
To formally demonstrate the conflict between SPL and zero\-centering, we first define the batch\-wise mean subtraction operation in matrix form and identify its property\.
###### Lemma 1\(Centering Matrix\)\.
For a batch of sizeNN, the zero\-centering operation can be represented as a linear transformation by the centering matrixH=IN−1N𝟏𝟏⊤H=I\_\{N\}\-\\frac\{1\}\{N\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}, whereINI\_\{N\}is the identity matrix and𝟏\\mathbf\{1\}is a column vector of ones\. For any constant vector𝐜=c𝟏\\mathbf\{c\}=c\\mathbf\{1\}\(wherec∈ℝc\\in\\mathbb\{R\}\), the orthogonality condition holds:
H𝐜=𝟎\.H\\mathbf\{c\}=\\mathbf\{0\}\.\(6\)
The detailed proof is provided in Appendix[A\.1](https://arxiv.org/html/2605.14026#A1.SS1)\.
This lemma implies that the centering operation removes any signal comprised of a constant component\. Building on this, we show that centering explicitly eliminates the dominant eigenmode from the representation matrixΦ\\Phi, which is critical information captured by SPL\.
###### Proposition 2\(Elimination of Constant Eigenmode\)\.
LetΦ∗\\Phi^\{\*\}be the optimal representation spanning the principal subspace of the row\-stochastic transition matrixPπP^\{\\pi\}\. The zero\-centering operationHHeliminates the projection ofΦ∗\\Phi^\{\*\}onto the constant vectoru1u\_\{1\}\(corresponding to the dominant mode\), i\.e\.,
‖HΦproj,u1∗‖2=0\.\\\|H\\Phi^\{\*\}\_\{\\text\{proj\},u\_\{1\}\}\\\|\_\{2\}=0\.\(7\)
The detailed proof is provided in Appendix[A\.2](https://arxiv.org/html/2605.14026#A1.SS2)\.
Proposition[2](https://arxiv.org/html/2605.14026#Thmtheorem2)suggests that even if the representationΦ\\Phisuccessfully aligns withu1u\_\{1\}to capture the global information of transition dynamics, the application of zero\-centering mathematically erases this information\.
While the neural network’s bias parameters could theoretically recover this eliminated component, relying on such implicit adaptation imposes an optimization disadvantage\. To avoid this structural inefficiency, we adopt a non\-centered regularization scheme that explicitly preserves the spectral information of SPL\.
### 4\.2Dominant\-Mode Preserving Regularization
Figure 2:Overview of R2R2\.The figure illustrates our proposed framework where direct regularization is applied to the latent representationztz\_\{t\}\. This explicitly stabilizes feature learning\.Guided by the analysis in Sec\.[4\.1](https://arxiv.org/html/2605.14026#S4.SS1), we propose a modified redundancy reduction scheme that explicitly excludes mean subtraction to preserve the spectral information of SPL\. Regarding the SPL loss, Eq\. \([1](https://arxiv.org/html/2605.14026#S3.E1)\), we posit that an explicit redundancy reduction scheme based on VICReg\(Bardeset al\.,[2022](https://arxiv.org/html/2605.14026#bib.bib14)\)is better suited to preserve the superior properties of the original SPL than Barlow Twins\(Zbontaret al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib23)\)\. Consequently, we introduce the following modifications to the original VICReg formulation:
Non\-centered Redundancy Reduction\.Instead of the standard covariance loss, we employ a non\-centered inner product form to explicitly avoid eliminating the global information\. We formally denote this objective as the Redundancy Reduction loss \(ℒRR\\mathcal\{L\}\_\{\\text\{RR\}\}\)\. It is computed using the non\-centered correlation matrix in Eq\. \([8](https://arxiv.org/html/2605.14026#S4.E8)\) and normalized by the total number of off\-diagonal elements,d\(d−1\)d\(d\-1\), to ensure that the magnitude remains invariant to the feature dimension size\.
ℒRR=1d\(d−1\)∑i≠j\(\[C\(Z\)\]ij\)2,\\displaystyle\\mathcal\{L\}\_\{\\text\{RR\}\}=\\frac\{1\}\{d\(d\-1\)\}\\sum\_\{i\\neq j\}\\left\(\[C\(Z\)\]\_\{ij\}\\right\)^\{2\},\(8\)where\[C\(Z\)\]ij=1N−1∑b=1Nzb,izb,j\.\\displaystyle\[C\(Z\)\]\_\{ij\}=\\frac\{1\}\{N\-1\}\\sum\_\{b=1\}^\{N\}z\_\{b,i\}z\_\{b,j\}\\kern 5\.0pt\.Direct Regularization\.We apply regularization directly to the output of the encoder without an additional projector module\. This design choice is grounded in our theoretical analysis, which provides a precise understanding of SPL dynamics, indicating that the auxiliary projector is redundant\.
Based on these modifications, our final objective,ℒR2R2\\mathcal\{L\}\_\{\\text\{R2R2\}\}, explicitly enforces redundancy reduction, SPL\-dynamics, and feature uniformity\. \(Eq\. \([9](https://arxiv.org/html/2605.14026#S4.E9)\)\):
ℒR2R2\(Z\)=ℒSPL\(Z\)\+λRRℒRR\(Z\)\+λVarℒVar\(Z\)\.\\mathcal\{L\}\_\{\\text\{R2R2\}\}\(Z\)=\\mathcal\{L\}\_\{\\text\{SPL\}\}\(Z\)\+\\lambda\_\{\\text\{RR\}\}\\mathcal\{L\}\_\{\\text\{RR\}\}\(Z\)\+\\lambda\_\{\\text\{Var\}\}\\mathcal\{L\}\_\{\\text\{Var\}\}\(Z\)\.\(9\)
Algorithm
Algorithm 1Training Procedure of R2R21:Input:UTD ratio
GG, Batch size
NN, Regularization coefficients
λRR,λVar\\lambda\_\{\\text\{RR\}\},\\lambda\_\{\\text\{Var\}\}
2:Initialize:Encoder
ϕ\\phi, Predictor
𝒯\\mathcal\{T\}, Actor
π\\pi, Critic
QQ, Replay Buffer
𝒟\\mathcal\{D\}
3:foreach environment step
ttdo
4:Observe state
sts\_\{t\}, select action
at∼π\(ϕ\(st\)\)a\_\{t\}\\sim\\pi\(\\phi\(s\_\{t\}\)\)
5:Execute
ata\_\{t\}, observe reward
rtr\_\{t\}, next state
st\+1s\_\{t\+1\}
6:Store transition
\(st,at,rt,st\+1\)\(s\_\{t\},a\_\{t\},r\_\{t\},s\_\{t\+1\}\)in
𝒟\\mathcal\{D\}
7:
⊳\\trianglerightHigh UTD Update Loop
8:for
u=1u=1to
GGdo
9:Sample batch
B∼𝒟B\\sim\\mathcal\{D\}
10:
⊳\\triangleright1\. Self\-Predictive Learning Block
11:Encode states:
Z←ϕ\(s\),Z′←sg\(ϕ\(s′\)\)Z\\leftarrow\\phi\(s\),\\ Z^\{\\prime\}\\leftarrow\\text\{sg\}\(\\phi\(s^\{\\prime\}\)\)
12:
ℒSPL←\\mathcal\{L\}\_\{\\text\{SPL\}\}\\leftarrowSPL loss using
𝒯\(Z,a\)\\mathcal\{T\}\(Z,a\)and
Z′Z^\{\\prime\}Eq\. \([1](https://arxiv.org/html/2605.14026#S3.E1)\)
13:
ℒRR←\\mathcal\{L\}\_\{\\text\{RR\}\}\\leftarrowRR loss onZZEq\. \([8](https://arxiv.org/html/2605.14026#S4.E8)\)
14:
ℒVar←\\mathcal\{L\}\_\{\\text\{Var\}\}\\leftarrowVariance loss onZZEq\. \([3](https://arxiv.org/html/2605.14026#S3.E3)\)
15:
ℒR2R2←ℒSPL\+λRRℒRR\+λVarℒVar\\mathcal\{L\}\_\{\\text\{R2R2\}\}\\leftarrow\\mathcal\{L\}\_\{\\text\{SPL\}\}\+\\lambda\_\{\\text\{RR\}\}\\mathcal\{L\}\_\{\\text\{RR\}\}\+\\lambda\_\{\\text\{Var\}\}\\mathcal\{L\}\_\{\\text\{Var\}\}Eq\. \([9](https://arxiv.org/html/2605.14026#S4.E9)\)
16:Update Encoder
ϕ\\phiand Predictor
𝒯\\mathcal\{T\}using
∇ℒR2R2\\nabla\\mathcal\{L\}\_\{\\text\{R2R2\}\}
17:
⊳\\triangleright2\. Reinforcement Learning Block \(Base Algo\.\)
18:Update Actor
π\\piand Critic
QQ, using latent state
ZZ\(following base algorithm\)
19:endfor
20:endfor
### 4\.3Architecture: SimbaV2\-SPL
Figure 3:SimbaV2\-SPL\.We augment the backbone with a tailored SPL module \(encoderϕ\\phi, predictor𝒯\\mathcal\{T\}\)\. The Actor and Critic networks are adapted to align with the SimbaV2 architecture, ensuring seamless integration of latent representations\.Our proposed regularization scheme is designed to be algorithm\-agnostic, assuming only the existence of SPL, thereby offering universal applicability\. To demonstrate this, we introduce SimbaV2\-SPL, which integrates the tailored SPL framework into SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), the current state\-of\-the\-art model in continuous control benchmarks\. Since SimbaV2 is originally designed as a purely model\-free architecture, it lacks an explicit mechanism to learn environmental dynamics\. We address this by augmenting the architecture with an additional encoder and a transition predictor, trained specifically under the SPL framework\. Specifically, rather than employing a generic encoder architecture typical of prior methods, such as TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\), we align our design with the distinct architectural philosophy of SimbaV2\. Crucially, to ensure the preservation of raw information—particularly high\-frequency details—that might be lost during encoding, we maintain the original state as a parallel input to the actor\-critic networks, rather than entirely replacing it with the latent representationzz\. To integrate this input while adhering to SimbaV2’s specific constraints, the state \(or state\-action pair\) is first linearly projected and subsequently normalized using anL2L\_\{2\}norm\. This processed input is then concatenated with the latent representationzz\. This integrated architecture, termed SimbaV2\-SPL, serves as our enhanced baseline and is illustrated in Fig\.[3](https://arxiv.org/html/2605.14026#S4.F3)\.
## 5Experiments
In this section, we validate the effectiveness and universality of our proposed method\. We design our experiments to verify six key aspects: \(1\) robustness in high UTD regimes, \(2\) independence from algorithmic specifics, \(3\) distinctness from architectural normalization \(Layer Normalization;Baet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib34)\), \(4\) complementarity with state\-of\-the\-art architectures, \(5\) the validity of our design choices via ablation studies, and \(6\) the spectral analysis of the learned representations through singular value spectrum\.
Environments\.We utilize a total of 11 environments\. From OpenAI Gym MuJoCo\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib26); Todorovet al\.,[2012](https://arxiv.org/html/2605.14026#bib.bib25)\), we select 4 environments:Ant\-v5,Walker2d\-v5,Hopper\-v5, andHumanoid\-v5\. From the DeepMind Control Suite \(DMC\)\(Tassaet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib27),[2020](https://arxiv.org/html/2605.14026#bib.bib41)\), we use the subset known as DMC\-Hard\(Leeet al\.,[2025a](https://arxiv.org/html/2605.14026#bib.bib22)\), which comprises 7 environments:Humanoid\-\{Walk, Run, Stand\}andDog\-\{Trot, Walk, Stand, Run\}\.
Metric\.To capture sample efficiency, we measure the average return over the final 20% of training steps, followingMachadoet al\.\([2018](https://arxiv.org/html/2605.14026#bib.bib29)\)\. To allow for aggregation, we compute normalized scores relative to the corresponding baseline trained atUTD=1\\text\{UTD\}=1\(see Appendix[D](https://arxiv.org/html/2605.14026#A4)for mathematical details\)\. We report aggregated results here, while individual task performance is provided in Appendix[E](https://arxiv.org/html/2605.14026#A5)\.
Baselines\.To rigorously evaluate the efficacy of our proposed regularization across varying levels of algorithmic complexity, we utilize the following baselines:
- •TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\):A representative SPL\-native algorithm, selected to demonstrate robustness within a standard latent\-dynamics framework\.
- •Minimalistϕ\\phi\(Niet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib11)\):A simplified SPL variant, chosen to verify that our performance gains are fundamental to the SPL objective rather than artifacts of complex auxiliary mechanisms\.
- •TD7\+Layer Normalization \(TD7\+LN\)\(Baet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib34)\):This variant incorporates Layer Normalization within the encoder to demonstrate that the performance improvements of R2R2 are distinct from and independent of standard architectural normalization\.
- •SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\):The current state\-of\-the\-art \(SOTA\) architecture in continuous control\. Although it inherently lacks an SPL framework, we select it as the strongest available backbone\. By integrating a tailored SPL module \(termed SimbaV2\-SPL\), we aim to demonstrate that our method is orthogonal to architectural advancements and provides complementary gains\.
We omit purely value\-centric high\-UTD algorithms, such as CrossQ\(Bhattet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib21)\), from direct comparison for two reasons\. First, as discussed in Sec\.[2](https://arxiv.org/html/2605.14026#S2), these methods address Q\-function bias, which presents an orthogonal challenge to our representation\-focused contribution\. Second, since SimbaV2 has already demonstrated superior performance over these approaches, we select it as the representative state\-of\-the\-art baseline\.
Training\.Training is conducted for a fixed budget of 500k decision steps\. Notably, we fix the regularization coefficientsλVar\\lambda\_\{\\text\{Var\}\}andλRR\\lambda\_\{\\text\{RR\}\}to0\.010\.01, and the variance thresholdvthv\_\{th\}to11across all experiments without per\-task tuning\. All other hyperparameters follow the default settings of the base algorithms\. For more details, see Appendix[B](https://arxiv.org/html/2605.14026#A2)\.
Table 1:Aggregated normalized performance comparison\.We report the aggregated mean scores with 95% confidence intervals \[Lower, Upper\] across Gym MuJoCo\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib26); Todorovet al\.,[2012](https://arxiv.org/html/2605.14026#bib.bib25)\)and DMC\-Hard\(Tassaet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib27),[2020](https://arxiv.org/html/2605.14026#bib.bib41)\)benchmarks\. The Total column represents the aggregate score across all 11 environments\. For SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), we explicitly show the performance gain from SPL and our proposed regularization method separately\.







Figure 4:Aggregated score curves of TD7 and R2R2\.Solid lines and shaded regions represent the mean and 95% confidence intervals, respectively\. Our approach significantly outperforms the baseline at UTD=20 while maintaining comparable performance at UTD=1\.### 5\.1Robustness in High UTD Regimes
Our primary hypothesis is that adding our regularization term to SPL\-based algorithms enhances their robustness, mitigating potential degradation while unlocking further performance gains under high UTD settings\. As shown in the TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\)section of Table[1](https://arxiv.org/html/2605.14026#S5.T1)and Fig\.[4](https://arxiv.org/html/2605.14026#S5.F4)\-\(TD7\), R2R2 demonstrates remarkable robustness in the high UTD regime\. While the TD7 baseline maintains a normalized score of 1\.02 atUTD=20\\text\{UTD\}=20, our method significantly boosts this to 1\.24, achieving a 22% relative improvement\. This gain is particularly pronounced in the complex DMC\-Hard benchmark \(1\.02→\\to1\.32\), indicating that the benefits of our regularization are magnified in challenging environments\.
### 5\.2Independence from Algorithmic Specifics
We aim to demonstrate that the gains observed in TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\)are not artifacts of its complex auxiliary techniques but are fundamental to the SPL framework\. As presented in the Minimalistϕ\\phi\(Niet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib11)\)section of Table[1](https://arxiv.org/html/2605.14026#S5.T1)and Fig\.[4](https://arxiv.org/html/2605.14026#S5.F4)\-\(Minimalistϕ\\phi\), results confirm that the benefits of R2R2 are fundamental\. Even in the Minimalistϕ\\phisetting, which lacks the sophisticated auxiliary mechanics of TD7, our method improves the Total aggregated mean from 5\.28 to 6\.20 atUTD=20\\text\{UTD\}=20\. Notably, in the Gym MuJoCo\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib26); Todorovet al\.,[2012](https://arxiv.org/html/2605.14026#bib.bib25)\)benchmark where the baseline suffers severe collapse \(dropping to 0\.41\), R2R2 effectively acts as a defense against such drastic drops, securing a score of 0\.57\. This suggests that our regularization enhances the robustness of the latent dynamics learning process, regardless of the algorithmic architecture\. Additionally, atUTD=1\\text\{UTD\}=1, combining the minimalist baseline with R2R2 still yields a clear improvement in performance \(1\.00→\\rightarrow2\.74\), indicating that our approach enriches the learned features even without intensive updates\.
We note, however, that because the minimalist baseline attains very low absolute returns due to its minimal structure, the normalized improvement ratio can appear inflated when the reference score is small\. To avoid overinterpreting this effect, we also provide the raw scores in the Appendix[E](https://arxiv.org/html/2605.14026#A5)\.
### 5\.3Distinct Mechanism from Layer Normalization
A critical question is whether our approach operates via a mechanism distinct from standard architectural normalization\. As presented in the TD7\+LN\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10); Baet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib34)\)section of Table[1](https://arxiv.org/html/2605.14026#S5.T1)and Fig\.[4](https://arxiv.org/html/2605.14026#S5.F4)\-\(TD7\+LN\), results reveal a compelling finding: architectural normalization alone is insufficient for high UTD robustness\. The TD7\+LN baseline, despite being a strong architectural variant, suffers from performance degradation atUTD=20\\text\{UTD\}=20, with the Total aggregated mean dropping to 0\.88 \(falling below itsUTD=1\\text\{UTD\}=1performance\)\. In stark contrast, applying R2R2 on top of LN not only recovers from this degradation but further unlocks performance gains, reaching a score of 1\.10\. This significant recovery implies that R2R2 addresses a distinct form of representational collapse that Layer Normalization cannot resolve, thereby demonstrating the orthogonality of the two approaches\.
### 5\.4Complementarity with Modern Architectures
We evaluated the compatibility of our method, R2R2, with SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), a state\-of\-the\-art architecture featuring extensive normalization\. As shown in Table[1](https://arxiv.org/html/2605.14026#S5.T1)and Fig\.[4](https://arxiv.org/html/2605.14026#S5.F4)\-\(SimbaV2\), integrating the tailored SPL framework alone \(\+SPL\) already surpasses the baseline, establishing a new SOTA score of 1\.34 atUTD=20\\text\{UTD\}=20\. Notably, the addition of our regularization \(\+SPL\+R2R2\) provides further gains, achieving a peak performance of 1\.38\.
While the margin of improvement \(\+0\.04\) may appear relatively modest compared to other baselines, we hypothesize that this is primarily due to a performance ceiling effect\. Given that the SimbaV2\(\+SPL\) backbone is already exceptionally strong across several environments, the room for further absolute gains is fundamentally limited\. Nevertheless, R2R2 provides a distinct, complementary benefit by preventing representational collapse\. As substantiated by our subsequent Effective Rank analysis \(Section[5\.7](https://arxiv.org/html/2605.14026#S5.SS7)\), preserving this structural integrity contributes to further performance gains even in highly saturated regimes\.
### 5\.5Ablation Study
We conduct two ablation studies on theDog\-Trotenvironment from DMC\(Tassaet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib27),[2020](https://arxiv.org/html/2605.14026#bib.bib41)\)to comprehensively validate our design choices\. We chose this high\-dimensional environment as it represents one of the most challenging tasks\.


Figure 5:Ablation Analysis\.\(Left\)Performance comparison regarding the zero\-centering constraint\. The results demonstrate that enforcing zero\-centering leads to performance degradation, confirming the importance of preserving global information\.\(Right\)Component\-wise analysis verifying the contribution of individual loss terms to the final performance\.Impact of Zero\-Centering Constraint\.As shown in Fig\.[5](https://arxiv.org/html/2605.14026#S5.F5)\-\(Left\), comparing our framework with a zero\-centered variant reveals that enforcing zero\-centering degrades performance\. This constraint strips away principal components, forcing the neural network to inefficiently reconstruct lost information at each step\. In contrast, our non\-centered design,ℒRR\\mathcal\{L\}\_\{\\text\{RR\}\}, yields better performance by preserving these essential features\.
Contribution of Regularization Terms\.We assess the individual roles ofℒVar\\mathcal\{L\}\_\{\\text\{Var\}\}andℒRR\\mathcal\{L\}\_\{\\text\{RR\}\}by ablating each term atUTD=20\\text\{UTD\}=20\. Fig\.[5](https://arxiv.org/html/2605.14026#S5.F5)\-\(Right\) confirms that both terms are complementary\. While both ablations lead to performance degradation, the absence ofℒRR\\mathcal\{L\}\_\{\\text\{RR\}\}causes a more severe drop\. This validates that simultaneously enforcing variance preservation and decorrelating feature dimensions are critical for achieving robust learning in high UTD settings\.
### 5\.6Spectral Analysis
To investigate R2R2, we analyze the singular value spectrum using TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\)as the base algorithm, focusing onHumanoid\-Standas it exhibits one of the most pronounced divergence between methods, thereby facilitating clear observation\. Ideal representations exhibit a power\-law decay: a steep initial drop indicates efficient compression of global dynamics, while a heavy tail ensures the preservation of fine\-grained local variations without subspace collapse\. Fig\.[6](https://arxiv.org/html/2605.14026#S5.F6)confirms that our regularization optimizes the latent structure towards this ideal profile\.


Figure 6:The singular value spectrum\.\(Left\)AtUTD=1\\text\{UTD\}=1, R2R2 shows a steeper initial decay\.\(Right\)AtUTD=20\\text\{UTD\}=20, baseline suffers from spectral cutoff in the tail indices\.Spectral Concentration atUTD=𝟏\\text\{UTD\}=1\.In the standard regime, R2R2 shows a steeper initial decay compared to the baseline, resulting in a lower Effective Rank \(ER≈\\approx65\.0 vs\. 76\.5\)\. This indicates spectral concentration, where the model filters out redundant signals to compress task\-relevant information into a compact set of principal components\.
Structural Integrity atUTD=𝟐𝟎\\text\{UTD\}=20\.Conversely, in the high UTD regime, the baseline suffers from a sharp spectral cutoff, with singular values in the tail indices dropping rapidly to near\-zero\. This suggests a partial collapse where the model loses the capacity to capture fine\-grained features\. In contrast, R2R2 maintains a heavy\-tailed distribution, confirming that our method effectively prevents subspace collapse, ensuring the full utilization of the latent space to represent diverse dynamics\.
### 5\.7Effective Rank Monitoring
To further clarify the representation instability discussed in previous sections, we monitor the evolution of the Effective Rank \(ER\) throughout the training process\. To highlight the divergence most clearly, we conduct this analysis under the extreme setting ofUTD=20\\text\{UTD\}=20on theHumanoid\-Runenvironment, where our method demonstrated substantial performance gains\.


Figure 7:Effective Rank \(ER\) over training steps\.Evaluated onHumanoid\-RunatUTD=20\\text\{UTD\}=20\.\(Left\)Comparison among the TD7 baseline, R2R2, and R2R2 with the zero\-centering constraint\.\(Right\)Comparison on the SimbaV2\+SPL backbone with and without R2R2, demonstrating complementarity\.Fig\.[7](https://arxiv.org/html/2605.14026#S5.F7)\-\(Left\) tracks the ER for the TD7 baseline and our variants\. While the unregularized baseline suffers a progressive loss of dimensionality, R2R2 successfully maintains a stable and high ER\. Furthermore, when the zero\-centering constraint is applied alongside our regularization, the ER drops progressively, closely mirroring the collapse trajectory of the baseline\. This confirms that preserving non\-centered features is crucial for preventing subspace collapse\.
To substantiate the complementarity of our method with modern architectures, we also analyze the ER on the state\-of\-the\-art SimbaV2\+SPL backbone\. As shown in Fig\.[7](https://arxiv.org/html/2605.14026#S5.F7)\-\(Right\), even though SimbaV2 is equipped with extensive architectural normalizations, the integration of R2R2 still yields a noticeably higher and more stable ER throughout the learning process\. This empirical evidence clearly demonstrates that R2R2 and architectural improvements contribute to representation learning in fundamentally different ways; while SimbaV2 provides a strong structural capacity, R2R2 explicitly defends against representational collapse, acting as a highly complementary component for robust learning\.
### 5\.8Training Time
We measure the wall\-clock training time \(sec/step\) onHumanoid\-v5atUTD=20\\text\{UTD\}=20\. This setting represents a worst\-case scenario for computational cost: the high observation dimension maximizes the overhead of our method, while the frequent updates ensure this cost dominates the environment simulation time\. We conduct all measurements on an Intel i7\-9800X and an NVIDIA RTX2080Ti\. Note that we implemented all SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\)variants on top of the official JAX\(Bradburyet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib40)\)implementation\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), whereas TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\)utilizes PyTorch\(Paszkeet al\.,[2019](https://arxiv.org/html/2605.14026#bib.bib39)\)\.
Table 2:Measurements of training time\.Average wall\-clock training time \(sec/step\) measured withUTD=20\\text\{UTD\}=20\.
## 6Conclusion
In this work, we identified a fundamental conflict where standard zero\-centering undermines SPL by eliminating the principal spectral mode\. To resolve this, we proposed R2R2, a non\-centered redundancy reduction term designed to preserve this information\. Our evaluation validates R2R2 across four dimensions: 1\)Robustness: It mitigates high\-UTD degradation, significantly boosting the TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\)baseline\. 2\)Independence: Confirmed via Minimalistϕ\\phi\(Niet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib11)\), verifying efficacy independent of auxiliary algorithmic components\. 3\)Distinctness: It addresses instability issues that Layer Normalization\(Baet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib34)\)cannot\. 4\)Complementarity: We constructed a new baseline, termed SimbaV2\-SPL, by enhancing the current SOTA, SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\)\. Notably, R2R2 achieves additional gains even on top of this robust architecture \(SimbaV2\-SPL \+R2R2\)\. Furthermore, ablation studies and spectral analysis provide additional verification of our design choices and theoretical soundness\.
## 7Limitations and Future Work
While our method effectively mitigates degradation at high UTD, it does not fully eliminate instability in extreme regimes\. Another limitation is that, although we performed partial validation on pixel\-based visual RL during the review process, we have not yet carried out a full\-scale evaluation in that domain\. Since the current study mainly focuses on low\-dimensional states, extending and thoroughly validating the method in visual RL, where feature statistics are more complex, remains an important direction for future work\.
## Acknowledgements
This work was supported by Samsung Electro\-Mechanics\.
This work was supported by Institute of Information & communications Technology Planning & Evaluation \(IITP\) grant funded by the Korea government\(MSIT\) \[NO\.RS\-2021\-II211343, Artificial Intelligence Graduate School Program \(Seoul National University\)\]
This work was supported by the National Research Foundation of Korea \(NRF\) grant funded by the Korea government \(MSIT\) \(No\. 2022R1A3B1077720, 2022R1A5A7083908\)
This work was supported by the BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University in 2026\.
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.
## References
- R\. Agarwal, M\. Schwarzer, P\. S\. Castro, A\. C\. Courville, and M\. G\. Bellemare \(2021\)Deep reinforcement learning at the edge of the statistical precipice\.InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual,M\. Ranzato, A\. Beygelzimer, Y\. N\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),pp\. 29304–29320\.External Links:[Link](https://proceedings.neurips.cc/paper/2021/hash/f514cec81cb148559cf475e7426eed5e-Abstract.html)Cited by:[Appendix E](https://arxiv.org/html/2605.14026#A5.p2.1)\.
- L\. J\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.CoRRabs/1607\.06450\.External Links:[Link](http://arxiv.org/abs/1607.06450),1607\.06450Cited by:[Table 10](https://arxiv.org/html/2605.14026#A5.T10),[Table 10](https://arxiv.org/html/2605.14026#A5.T10.8.4),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1),[3rd item](https://arxiv.org/html/2605.14026#S5.I1.i3.p1.1.1),[§5\.3](https://arxiv.org/html/2605.14026#S5.SS3.p1.2),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.1.1.1.9.8.1),[§5](https://arxiv.org/html/2605.14026#S5.p1.1),[§6](https://arxiv.org/html/2605.14026#S6.p1.1)\.
- A\. Bardes, J\. Ponce, and Y\. LeCun \(2022\)VICReg: variance\-invariance\-covariance regularization for self\-supervised learning\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,External Links:[Link](https://openreview.net/forum?id=xm6YD62D1Ub)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.14026#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.14026#S3.SS3.p1.1),[§4\.2](https://arxiv.org/html/2605.14026#S4.SS2.p1.1)\.
- H\. B\. Barlowet al\.\(1961\)Possible principles underlying the transformation of sensory messages\.Sensory communication1\(01\),pp\. 217–233\.Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.14026#S2.SS2.p1.1)\.
- A\. Bhatt, D\. Palenicek, B\. Belousov, M\. Argus, A\. Amiranashvili, T\. Brox, and J\. Peters \(2024\)CrossQ: batch normalization in deep reinforcement learning for greater sample efficiency and simplicity\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=PczQtTsTIX)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1),[§5](https://arxiv.org/html/2605.14026#S5.p5.1)\.
- J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne, and Q\. Zhang \(2018\)JAX: composable transformations of Python\+NumPy programs\.Note:[http://github\.com/jax\-ml/jax](http://github.com/jax-ml/jax)Version 0\.3\.13Cited by:[§5\.8](https://arxiv.org/html/2605.14026#S5.SS8.p1.1)\.
- G\. Brockman, V\. Cheung, L\. Pettersson, J\. Schneider, J\. Schulman, J\. Tang, and W\. Zaremba \(2016\)OpenAI Gym\.CoRRabs/1606\.01540\.External Links:[Link](http://arxiv.org/abs/1606.01540),1606\.01540Cited by:[§B\.1](https://arxiv.org/html/2605.14026#A2.SS1.p1.1),[Appendix B](https://arxiv.org/html/2605.14026#A2.p1.1),[Table 5](https://arxiv.org/html/2605.14026#A3.T5),[Table 5](https://arxiv.org/html/2605.14026#A3.T5.4.2.1),[§5\.2](https://arxiv.org/html/2605.14026#S5.SS2.p1.6),[Table 1](https://arxiv.org/html/2605.14026#S5.T1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.5.2.1),[§5](https://arxiv.org/html/2605.14026#S5.p2.1)\.
- T\. Chen, S\. Kornblith, M\. Norouzi, and G\. E\. Hinton \(2020\)A simple framework for contrastive learning of visual representations\.InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13\-18 July 2020, Virtual Event,Proceedings of Machine Learning Research, Vol\.119,pp\. 1597–1607\.External Links:[Link](http://proceedings.mlr.press/v119/chen20j.html)Cited by:[§2\.2](https://arxiv.org/html/2605.14026#S2.SS2.p1.1)\.
- X\. Chen and K\. He \(2021\)Exploring simple siamese representation learning\.InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19\-25, 2021,pp\. 15750–15758\.External Links:[Link](https://openaccess.thecvf.com/content/CVPR2021/html/Chen%5C_Exploring%5C_Simple%5C_Siamese%5C_Representation%5C_Learning%5C_CVPR%5C_2021%5C_paper.html),[Document](https://dx.doi.org/10.1109/CVPR46437.2021.01549)Cited by:[§2\.2](https://arxiv.org/html/2605.14026#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1)\.
- X\. Chen, C\. Wang, Z\. Zhou, and K\. W\. Ross \(2021\)Randomized ensembled double q\-learning: learning fast without a model\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=AY8zfZm0tDd)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- S\. Fujimoto, W\. Chang, E\. J\. Smith, S\. Gu, D\. Precup, and D\. Meger \(2023\)For SALE: state\-action representation learning for deep reinforcement learning\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/c20ac0df6c213db6d3a930fe9c7296c8-Abstract-Conference.html)Cited by:[Appendix B](https://arxiv.org/html/2605.14026#A2.p1.1),[Table 10](https://arxiv.org/html/2605.14026#A5.T10),[Table 10](https://arxiv.org/html/2605.14026#A5.T10.8.4),[§1](https://arxiv.org/html/2605.14026#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1),[§4\.3](https://arxiv.org/html/2605.14026#S4.SS3.p1.3),[1st item](https://arxiv.org/html/2605.14026#S5.I1.i1.p1.1.1),[§5\.1](https://arxiv.org/html/2605.14026#S5.SS1.p1.2),[§5\.2](https://arxiv.org/html/2605.14026#S5.SS2.p1.6),[§5\.3](https://arxiv.org/html/2605.14026#S5.SS3.p1.2),[§5\.6](https://arxiv.org/html/2605.14026#S5.SS6.p1.1),[§5\.8](https://arxiv.org/html/2605.14026#S5.SS8.p1.1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.1.1.1.4.3.1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.1.1.1.9.8.1),[§6](https://arxiv.org/html/2605.14026#S6.p1.1)\.
- S\. Fujimoto, H\. van Hoof, and D\. Meger \(2018\)Addressing function approximation error in actor\-critic methods\.InProceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10\-15, 2018,J\. G\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 1582–1591\.External Links:[Link](http://proceedings.mlr.press/v80/fujimoto18a.html)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- J\. Grill, F\. Strub, F\. Altché, C\. Tallec, P\. H\. Richemond, E\. Buchatskaya, C\. Doersch, B\. Á\. Pires, Z\. Guo, M\. G\. Azar, B\. Piot, K\. Kavukcuoglu, R\. Munos, and M\. Valko \(2020\)Bootstrap your own latent \- A new approach to self\-supervised learning\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html)Cited by:[§2\.2](https://arxiv.org/html/2605.14026#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1)\.
- D\. Ha and J\. Schmidhuber \(2018\)Recurrent world models facilitate policy evolution\.InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3\-8, 2018, Montréal, Canada,S\. Bengio, H\. M\. Wallach, H\. Larochelle, K\. Grauman, N\. Cesa\-Bianchi, and R\. Garnett \(Eds\.\),pp\. 2455–2467\.External Links:[Link](https://proceedings.neurips.cc/paper/2018/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018a\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InProceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10\-15, 2018,J\. G\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 1856–1865\.External Links:[Link](http://proceedings.mlr.press/v80/haarnoja18b.html)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- T\. Haarnoja, A\. Zhou, K\. Hartikainen, G\. Tucker, S\. Ha, J\. Tan, V\. Kumar, H\. Zhu, A\. Gupta, P\. Abbeel, and S\. Levine \(2018b\)Soft actor\-critic algorithms and applications\.CoRRabs/1812\.05905\.External Links:[Link](http://arxiv.org/abs/1812.05905),1812\.05905Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- S\. Ioffe and C\. Szegedy \(2015\)Batch normalization: accelerating deep network training by reducing internal covariate shift\.InProceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6\-11 July 2015,F\. R\. Bach and D\. M\. Blei \(Eds\.\),JMLR Workshop and Conference Proceedings, Vol\.37,pp\. 448–456\.External Links:[Link](http://proceedings.mlr.press/v37/ioffe15.html)Cited by:[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- M\. Jaderberg, V\. Mnih, W\. M\. Czarnecki, T\. Schaul, J\. Z\. Leibo, D\. Silver, and K\. Kavukcuoglu \(2017\)Reinforcement learning with unsupervised auxiliary tasks\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=SJ6yPD5xg)Cited by:[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1)\.
- M\. Janner, J\. Fu, M\. Zhang, and S\. Levine \(2019\)When to trust your model: model\-based policy optimization\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 12498–12509\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/5faf461eff3099671ad63c6f3f094f7f-Abstract.html)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- M\. Laskin, A\. Srinivas, and P\. Abbeel \(2020\)CURL: contrastive unsupervised representations for reinforcement learning\.InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13\-18 July 2020, Virtual Event,Proceedings of Machine Learning Research, Vol\.119,pp\. 5639–5650\.External Links:[Link](http://proceedings.mlr.press/v119/laskin20a.html)Cited by:[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1)\.
- H\. Lee, D\. Hwang, D\. Kim, H\. Kim, J\. J\. Tai, K\. Subramanian, P\. R\. Wurman, J\. Choo, P\. Stone, and T\. Seno \(2025a\)SimBa: simplicity bias for scaling up parameters in deep reinforcement learning\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=jXLiDKsuDo)Cited by:[Appendix B](https://arxiv.org/html/2605.14026#A2.p1.1),[§1](https://arxiv.org/html/2605.14026#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1),[§5](https://arxiv.org/html/2605.14026#S5.p2.1)\.
- H\. Lee, Y\. Lee, T\. Seno, D\. Kim, P\. Stone, and J\. Choo \(2025b\)Hyperspherical normalization for scalable deep reinforcement learning\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,External Links:[Link](https://openreview.net/forum?id=kfYxyvCYQ4)Cited by:[§B\.2](https://arxiv.org/html/2605.14026#A2.SS2.p1.1),[Appendix B](https://arxiv.org/html/2605.14026#A2.p1.1),[Table 12](https://arxiv.org/html/2605.14026#A5.T12),[Table 12](https://arxiv.org/html/2605.14026#A5.T12.8.4),[Table 13](https://arxiv.org/html/2605.14026#A5.T13),[Table 13](https://arxiv.org/html/2605.14026#A5.T13.8.4),[Figure 1](https://arxiv.org/html/2605.14026#S1.F1),[Figure 1](https://arxiv.org/html/2605.14026#S1.F1.3.2),[§1](https://arxiv.org/html/2605.14026#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2605.14026#S4.SS3.p1.3),[§4](https://arxiv.org/html/2605.14026#S4.p1.1),[4th item](https://arxiv.org/html/2605.14026#S5.I1.i4.p1.1.1),[§5\.4](https://arxiv.org/html/2605.14026#S5.SS4.p1.1),[§5\.8](https://arxiv.org/html/2605.14026#S5.SS8.p1.1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.1.1.1.12.11.1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.5.2.1),[§6](https://arxiv.org/html/2605.14026#S6.p1.1)\.
- T\. P\. Lillicrap, J\. J\. Hunt, A\. Pritzel, N\. Heess, T\. Erez, Y\. Tassa, D\. Silver, and D\. Wierstra \(2016\)Continuous control with deep reinforcement learning\.In4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2\-4, 2016, Conference Track Proceedings,Y\. Bengio and Y\. LeCun \(Eds\.\),External Links:[Link](http://arxiv.org/abs/1509.02971)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- M\. C\. Machado, M\. G\. Bellemare, E\. Talvitie, J\. Veness, M\. J\. Hausknecht, and M\. Bowling \(2018\)Revisiting the arcade learning environment: evaluation protocols and open problems for general agents \(extended abstract\)\.InProceedings of the Twenty\-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13\-19, 2018, Stockholm, Sweden,J\. Lang \(Ed\.\),pp\. 5573–5577\.External Links:[Link](https://doi.org/10.24963/ijcai.2018/787),[Document](https://dx.doi.org/10.24963/IJCAI.2018/787)Cited by:[§5](https://arxiv.org/html/2605.14026#S5.p3.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. A\. Riedmiller, A\. Fidjeland, G\. Ostrovski, S\. Petersen, C\. Beattie, A\. Sadik, I\. Antonoglou, H\. King, D\. Kumaran, D\. Wierstra, S\. Legg, and D\. Hassabis \(2015\)Human\-level control through deep reinforcement learning\.Nat\.518\(7540\),pp\. 529–533\.External Links:[Link](https://doi.org/10.1038/nature14236),[Document](https://dx.doi.org/10.1038/NATURE14236)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- M\. Nauman, M\. Ostaszewski, K\. Jankowski, P\. Milos, and M\. Cygan \(2024\)Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/cd3b5d2ed967e906af24b33d6a356cac-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- T\. Ni, B\. Eysenbach, E\. Seyedsalehi, M\. Ma, C\. Gehring, A\. Mahajan, and P\. Bacon \(2024\)Bridging state and history representations: understanding self\-predictive RL\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=ms0VgzSGF2)Cited by:[Appendix B](https://arxiv.org/html/2605.14026#A2.p1.1),[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1),[2nd item](https://arxiv.org/html/2605.14026#S5.I1.i2.p1.1.1),[§5\.2](https://arxiv.org/html/2605.14026#S5.SS2.p1.6),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.1.1.1.1.1),[§6](https://arxiv.org/html/2605.14026#S6.p1.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Köpf, E\. Z\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala \(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 8024–8035\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html)Cited by:[§5\.8](https://arxiv.org/html/2605.14026#S5.SS8.p1.1)\.
- M\. Schwarzer, A\. Anand, R\. Goel, R\. D\. Hjelm, A\. C\. Courville, and P\. Bachman \(2021\)Data\-efficient reinforcement learning with self\-predictive representations\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=uCQfPZwRaUu)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1)\.
- N\. Srivastava, G\. E\. Hinton, A\. Krizhevsky, I\. Sutskever, and R\. Salakhutdinov \(2014\)Dropout: a simple way to prevent neural networks from overfitting\.J\. Mach\. Learn\. Res\.15\(1\),pp\. 1929–1958\.External Links:[Link](https://dl.acm.org/doi/10.5555/2627435.2670313),[Document](https://dx.doi.org/10.5555/2627435.2670313)Cited by:[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning \- an introduction, 2nd edition\.MIT Press\.External Links:[Link](http://www.incompleteideas.net/book/the-book-2nd.html)Cited by:[§3\.1](https://arxiv.org/html/2605.14026#S3.SS1.p1.4)\.
- R\. S\. Sutton \(1990\)Integrated architectures for learning, planning, and reacting based on approximating dynamic programming\.InMachine Learning, Proceedings of the Seventh International Conference on Machine Learning, Austin, Texas, USA, June 21\-23, 1990,B\. W\. Porter and R\. J\. Mooney \(Eds\.\),pp\. 216–224\.External Links:[Link](https://doi.org/10.1016/b978-1-55860-141-3.50030-4),[Document](https://dx.doi.org/10.1016/B978-1-55860-141-3.50030-4)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- Y\. Tang, Z\. D\. Guo, P\. H\. Richemond, B\. Á\. Pires, Y\. Chandak, R\. Munos, M\. Rowland, M\. G\. Azar, C\. L\. Lan, C\. Lyle, A\. György, S\. Thakoor, W\. Dabney, B\. Piot, D\. Calandriello, and M\. Valko \(2023\)Understanding self\-predictive learning for reinforcement learning\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 33632–33656\.External Links:[Link](https://proceedings.mlr.press/v202/tang23d.html)Cited by:[§A\.2](https://arxiv.org/html/2605.14026#A1.SS2.1.p1.1),[§1](https://arxiv.org/html/2605.14026#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1),[§3\.2](https://arxiv.org/html/2605.14026#S3.SS2.p3.4)\.
- Y\. Tassa, Y\. Doron, A\. Muldal, T\. Erez, Y\. Li, D\. de Las Casas, D\. Budden, A\. Abdolmaleki, J\. Merel, A\. Lefrancq, T\. P\. Lillicrap, and M\. A\. Riedmiller \(2018\)DeepMind Control Suite\.CoRRabs/1801\.00690\.External Links:[Link](http://arxiv.org/abs/1801.00690),1801\.00690Cited by:[§B\.1](https://arxiv.org/html/2605.14026#A2.SS1.p1.1),[Appendix B](https://arxiv.org/html/2605.14026#A2.p1.1),[Table 5](https://arxiv.org/html/2605.14026#A3.T5),[Table 5](https://arxiv.org/html/2605.14026#A3.T5.4.2.1),[§1](https://arxiv.org/html/2605.14026#S1.p4.1),[§5\.5](https://arxiv.org/html/2605.14026#S5.SS5.p1.1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.5.2.1),[§5](https://arxiv.org/html/2605.14026#S5.p2.1)\.
- Y\. Tassa, S\. Tunyasuvunakool, A\. Muldal, Y\. Doron, S\. Liu, S\. Bohez, J\. Merel, T\. Erez, T\. P\. Lillicrap, and N\. Heess \(2020\)Dm\_control: software and tasks for continuous control\.CoRRabs/2006\.12983\.External Links:[Link](https://arxiv.org/abs/2006.12983),2006\.12983Cited by:[§B\.1](https://arxiv.org/html/2605.14026#A2.SS1.p1.1),[Appendix B](https://arxiv.org/html/2605.14026#A2.p1.1),[Table 5](https://arxiv.org/html/2605.14026#A3.T5),[Table 5](https://arxiv.org/html/2605.14026#A3.T5.4.2.1),[§1](https://arxiv.org/html/2605.14026#S1.p4.1),[§5\.5](https://arxiv.org/html/2605.14026#S5.SS5.p1.1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.5.2.1),[§5](https://arxiv.org/html/2605.14026#S5.p2.1)\.
- E\. Todorov, T\. Erez, and Y\. Tassa \(2012\)MuJoCo: A physics engine for model\-based control\.In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, Vilamoura, Algarve, Portugal, October 7\-12, 2012,pp\. 5026–5033\.External Links:[Link](https://doi.org/10.1109/IROS.2012.6386109),[Document](https://dx.doi.org/10.1109/IROS.2012.6386109)Cited by:[§B\.1](https://arxiv.org/html/2605.14026#A2.SS1.p1.1),[Appendix B](https://arxiv.org/html/2605.14026#A2.p1.1),[Table 5](https://arxiv.org/html/2605.14026#A3.T5),[Table 5](https://arxiv.org/html/2605.14026#A3.T5.4.2.1),[§5\.2](https://arxiv.org/html/2605.14026#S5.SS2.p1.6),[Table 1](https://arxiv.org/html/2605.14026#S5.T1),[Table 1](https://arxiv.org/html/2605.14026#S5.T1.5.2.1),[§5](https://arxiv.org/html/2605.14026#S5.p2.1)\.
- A\. van den Oord, Y\. Li, and O\. Vinyals \(2018\)Representation learning with contrastive predictive coding\.CoRRabs/1807\.03748\.External Links:[Link](http://arxiv.org/abs/1807.03748),1807\.03748Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1)\.
- C\. Voelcker, M\. Hussing, E\. Eaton, A\. Farahmand, and I\. Gilitschenski \(2025\)MAD\-TD: model\-augmented data stabilizes high update ratio RL\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=6RtRsg8ZV1)Cited by:[§2\.1](https://arxiv.org/html/2605.14026#S2.SS1.p1.1)\.
- D\. Yarats, A\. Zhang, I\. Kostrikov, B\. Amos, J\. Pineau, and R\. Fergus \(2021\)Improving sample efficiency in model\-free reinforcement learning from images\.InThirty\-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty\-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2\-9, 2021,pp\. 10674–10681\.External Links:[Link](https://doi.org/10.1609/aaai.v35i12.17276),[Document](https://dx.doi.org/10.1609/AAAI.V35I12.17276)Cited by:[§2\.3](https://arxiv.org/html/2605.14026#S2.SS3.p1.1)\.
- J\. Zbontar, L\. Jing, I\. Misra, Y\. LeCun, and S\. Deny \(2021\)Barlow Twins: self\-supervised learning via redundancy reduction\.InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18\-24 July 2021, Virtual Event,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 12310–12320\.External Links:[Link](http://proceedings.mlr.press/v139/zbontar21a.html)Cited by:[§1](https://arxiv.org/html/2605.14026#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.14026#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.14026#S4.SS2.p1.1)\.
## Appendix AProofs
In this section, we provide detailed proofs for the conflict between zero\-centering and the spectral properties of Self\-Predictive Learning \(SPL\), as discussed in Section[4\.1](https://arxiv.org/html/2605.14026#S4.SS1)\.
### A\.1Proof of Lemma[1](https://arxiv.org/html/2605.14026#Thmtheorem1)
###### Proof\.
Recall that the centering matrix is defined asH=IN−1N𝟏𝟏⊤H=I\_\{N\}\-\\frac\{1\}\{N\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}, whereINI\_\{N\}is the identity matrix of sizeNN, and𝟏∈ℝN\\mathbf\{1\}\\in\\mathbb\{R\}^\{N\}is a column vector of ones\. Let𝐜=c𝟏\\mathbf\{c\}=c\\mathbf\{1\}be an arbitrary constant vector with a scalarc∈ℝc\\in\\mathbb\{R\}\.
We compute the matrix\-vector productH𝐜H\\mathbf\{c\}as follows:
H𝐜\\displaystyle H\\mathbf\{c\}=\(IN−1N𝟏𝟏⊤\)\(c𝟏\)\\displaystyle=\\left\(I\_\{N\}\-\\frac\{1\}\{N\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\\right\)\(c\\mathbf\{1\}\)=c𝟏−cN𝟏\(𝟏⊤𝟏\)\.\\displaystyle=c\\mathbf\{1\}\-\\frac\{c\}\{N\}\\mathbf\{1\}\(\\mathbf\{1\}^\{\\top\}\\mathbf\{1\}\)\.\(10\)Since𝟏\\mathbf\{1\}is a vector of ones, the dot product𝟏⊤𝟏\\mathbf\{1\}^\{\\top\}\\mathbf\{1\}sums to the dimensionNN:
𝟏⊤𝟏=∑i=1N1⋅1=N\.\\mathbf\{1\}^\{\\top\}\\mathbf\{1\}=\\sum\_\{i=1\}^\{N\}1\\cdot 1=N\.\(11\)Substituting this back into the equation:
H𝐜\\displaystyle H\\mathbf\{c\}=c𝟏−cN𝟏\(N\)\\displaystyle=c\\mathbf\{1\}\-\\frac\{c\}\{N\}\\mathbf\{1\}\(N\)=c𝟏−c𝟏\\displaystyle=c\\mathbf\{1\}\-c\\mathbf\{1\}=𝟎\.\\displaystyle=\\mathbf\{0\}\.\(12\)Thus, applying the centering matrixHHto any constant vector results in the zero vector\. This formally demonstrates that the centering operation removes the constant component \(DC component\) from any signal\. ∎
### A\.2Proof of Proposition[2](https://arxiv.org/html/2605.14026#Thmtheorem2)
###### Proof of Proposition[2](https://arxiv.org/html/2605.14026#Thmtheorem2)\.
We build upon the theoretical framework of Tang et al\.\(Tanget al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib15)\)\. Their analysis implies that maximizing the SPL objective implicitly maximizes
Tr\(\(Φ⊤PπΦ\)⊤\(Φ⊤PπΦ\)\)subject toΦ⊤Φ=IK,\\mathrm\{Tr\}\\\!\\left\(\(\\Phi^\{\\top\}P^\{\\pi\}\\Phi\)^\{\\top\}\(\\Phi^\{\\top\}P^\{\\pi\}\\Phi\)\\right\)\\quad\\text\{subject to\}\\quad\\Phi^\{\\top\}\\Phi=I\_\{K\},\(13\)and therefore the optimal representationΦ∗\\Phi^\{\*\}spans a principal subspace corresponding with the top\-KKeigenmodes ofPπP^\{\\pi\}\.
To make the argument precise, we will use the following lemma \(Lemma[3](https://arxiv.org/html/2605.14026#Thmtheorem3)\), which characterizes the leading eigenvalue/eigenvector of a row\-stochastic transition matrix\. In particular, it allows us to choose the principal eigenvalue asλ1=1\\lambda\_\{1\}=1with corresponding eigenvector proportional to the constant vector𝟏\\mathbf\{1\}\.
###### Lemma 3\(Spectral radius of a row\-stochastic matrix\)\.
LetA∈ℝn×nA\\in\\mathbb\{R\}^\{n\\times n\}be*row\-stochastic*, i\.e\.,Aij≥0A\_\{ij\}\\geq 0for alli,ji,jand∑j=1nAij=1\\sum\_\{j=1\}^\{n\}A\_\{ij\}=1for allii\. Then:
1. 1\.11is an eigenvalue ofAAwith right eigenvector𝟏\\mathbf\{1\}, i\.e\.A𝟏=𝟏A\\mathbf\{1\}=\\mathbf\{1\}\.
2. 2\.Every eigenvalueλ\\lambdaofAAsatisfies\|λ\|≤1\|\\lambda\|\\leq 1\. In particular, no eigenvalue satisfiesλ\>1\\lambda\>1\.
###### Proof\.
\(1\) Since each row ofAAsums to11, for everyii,
\(A𝟏\)i=∑j=1nAij⋅1=1,\(A\\mathbf\{1\}\)\_\{i\}=\\sum\_\{j=1\}^\{n\}A\_\{ij\}\\cdot 1=1,\(14\)henceA𝟏=𝟏A\\mathbf\{1\}=\\mathbf\{1\}\.
\(2\) Letx≠0x\\neq 0and defineM:=‖x‖∞=maxj\|xj\|M:=\\\|x\\\|\_\{\\infty\}=\\max\_\{j\}\|x\_\{j\}\|\. For each coordinate,
\|\(Ax\)i\|=\|∑j=1nAijxj\|≤∑j=1nAij\|xj\|≤∑j=1nAijM=M,\|\(Ax\)\_\{i\}\|=\\left\|\\sum\_\{j=1\}^\{n\}A\_\{ij\}x\_\{j\}\\right\|\\leq\\sum\_\{j=1\}^\{n\}A\_\{ij\}\|x\_\{j\}\|\\leq\\sum\_\{j=1\}^\{n\}A\_\{ij\}M=M,\(15\)where we usedAij≥0A\_\{ij\}\\geq 0and∑jAij=1\\sum\_\{j\}A\_\{ij\}=1\. Thus‖Ax‖∞≤‖x‖∞\\\|Ax\\\|\_\{\\infty\}\\leq\\\|x\\\|\_\{\\infty\}\. IfAx=λxAx=\\lambda x, then
\|λ\|‖x‖∞=‖λx‖∞=‖Ax‖∞≤‖x‖∞,\|\\lambda\|\\,\\\|x\\\|\_\{\\infty\}=\\\|\\lambda x\\\|\_\{\\infty\}=\\\|Ax\\\|\_\{\\infty\}\\leq\\\|x\\\|\_\{\\infty\},\(16\)so\|λ\|≤1\|\\lambda\|\\leq 1\. In particular, no eigenvalue satisfiesλ\>1\\lambda\>1\. ∎
Now we complete the proof of Proposition[2](https://arxiv.org/html/2605.14026#Thmtheorem2)\. Let\{\(λi,ui\)\}\\\{\(\\lambda\_\{i\},u\_\{i\}\)\\\}be eigenpairs ofPπP^\{\\pi\}, ordered by\|λ1\|≥\|λ2\|≥⋯\|\\lambda\_\{1\}\|\\geq\|\\lambda\_\{2\}\|\\geq\\cdots\. SincePπP^\{\\pi\}is row\-stochastic, by Lemma[3](https://arxiv.org/html/2605.14026#Thmtheorem3)we haveλ1=1\\lambda\_\{1\}=1and we may choose the corresponding right eigenvector as the constant vectoru1=α𝟏u\_\{1\}=\\alpha\\mathbf\{1\}for someα≠0\\alpha\\neq 0\.
Define the orthogonal projector ontospan\(u1\)\\mathrm\{span\}\(u\_\{1\}\)by
Πu1:=u1u1⊤u1⊤u1\.\\Pi\_\{u\_\{1\}\}:=\\frac\{u\_\{1\}u\_\{1\}^\{\\top\}\}\{u\_\{1\}^\{\\top\}u\_\{1\}\}\.\(17\)Using this, define the component of the learned representationΦ∗\\Phi^\{\*\}alongu1u\_\{1\}as
Φproj,u1∗:=Πu1Φ∗\.\\Phi^\{\*\}\_\{\\mathrm\{proj\},u\_\{1\}\}:=\\Pi\_\{u\_\{1\}\}\\Phi^\{\*\}\.\(18\)
LetHHdenote the zero\-centering matrix \(Lemma[1](https://arxiv.org/html/2605.14026#Thmtheorem1)\), so thatH𝟏=𝟎H\\mathbf\{1\}=\\mathbf\{0\}\. Sinceu1=α𝟏u\_\{1\}=\\alpha\\mathbf\{1\}, we haveHu1=αH𝟏=𝟎Hu\_\{1\}=\\alpha H\\mathbf\{1\}=\\mathbf\{0\}, and hence
HΠu1=Hu1u1⊤u1⊤u1=\(Hu1\)u1⊤u1⊤u1=0\.H\\Pi\_\{u\_\{1\}\}=H\\frac\{u\_\{1\}u\_\{1\}^\{\\top\}\}\{u\_\{1\}^\{\\top\}u\_\{1\}\}=\\frac\{\(Hu\_\{1\}\)u\_\{1\}^\{\\top\}\}\{u\_\{1\}^\{\\top\}u\_\{1\}\}=0\.\(19\)Therefore,
HΦproj,u1∗=H\(Πu1Φ∗\)=\(HΠu1\)Φ∗=0,H\\Phi^\{\*\}\_\{\\mathrm\{proj\},u\_\{1\}\}=H\(\\Pi\_\{u\_\{1\}\}\\Phi^\{\*\}\)=\(H\\Pi\_\{u\_\{1\}\}\)\\Phi^\{\*\}=0,and consequently,
‖HΦproj,u1∗‖2=0\.\\\|H\\Phi^\{\*\}\_\{\\mathrm\{proj\},u\_\{1\}\}\\\|\_\{2\}=0\.\(20\)This proves that the zero\-centering operation mathematically annihilates the component of the representation corresponding to the global dynamics \(the constant eigenmode\), thereby leading to a loss of spectral information\. ∎
## Appendix BHyperparameters
For baseline algorithms, including TD7\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10)\), Minimalistϕ\\phi\(Niet al\.,[2024](https://arxiv.org/html/2605.14026#bib.bib11)\), and SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), we adopt the default configurations provided in their respective original implementations, unless explicitly overridden by the common settings listed below\. This ensures a fair comparison across all methods\. All parameters are fixed across all 11 environments \(4 Gym MuJoCo\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib26); Todorovet al\.,[2012](https://arxiv.org/html/2605.14026#bib.bib25)\)and 7 DMC\-Hard\(Tassaet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib27),[2020](https://arxiv.org/html/2605.14026#bib.bib41); Leeet al\.,[2025a](https://arxiv.org/html/2605.14026#bib.bib22)\)tasks\)\.
### B\.1Common Hyperparameters
Table[4](https://arxiv.org/html/2605.14026#A2.T4)lists the shared hyperparameters applied uniformly across all agents to ensure consistent evaluation conditions\. Note that for Gym MuJoCo\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib26); Todorovet al\.,[2012](https://arxiv.org/html/2605.14026#bib.bib25)\), decision steps are equivalent to environment steps\. In contrast, for DMC\-Hard\(Tassaet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib27),[2020](https://arxiv.org/html/2605.14026#bib.bib41)\)tasks, which utilize an action repeat of 2, our training budget of 500,000 decision steps corresponds to 1,000,000 environment steps\.
### B\.2SimbaV2\-SPL Specific Hyperparameters
For the backbone architecture and general optimization details, we strictly adhere to the original configurations of SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\)\. Table[4](https://arxiv.org/html/2605.14026#A2.T4)lists only the additional parameters introduced for the tailored SPL framework\.
Table 3:Common Hyperparameters for Training\.ParameterValueGeneral SettingsTotal Decision Steps500,000Action Repeat \(Gym\)1Action Repeat \(DMC\)2Number of Seeds5RegularizationλRR\\lambda\_\{\\text\{RR\}\}0\.01λVar\\lambda\_\{\\text\{Var\}\}0\.01vthv\_\{th\}1\.0
Table 4:SimbaV2\-SPL specific Hyperparameters\.ParameterValueNetwork ArchitectureEncoder\(ϕ\\phi\) Width128Encoder\(ϕ\\phi\) Depth1Predictor\(𝒯\\mathcal\{T\}\) Width256Predictor\(𝒯\\mathcal\{T\}\) Depth3Optimization & Init\.SPL Learning Rate \(Init\)3×10−43\\times 10^\{\-4\}SPL Learning Rate \(End\)5×10−55\\times 10^\{\-5\}Encoder C\-Shift3\.0
## Appendix CEnvironments
Table 5:Environment specifications\.We list the observation and action dimensions for the Gym MuJoCo\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib26); Todorovet al\.,[2012](https://arxiv.org/html/2605.14026#bib.bib25)\)and DMC\-Hard\(Tassaet al\.,[2018](https://arxiv.org/html/2605.14026#bib.bib27),[2020](https://arxiv.org/html/2605.14026#bib.bib41)\)benchmarks\.
## Appendix DEvaluation Metric Details
In this section, we provide the mathematical definition of the metrics used in our evaluation\. LetR\(i,k\)R\(i,k\)denote the evaluation return at thekk\-th checkpoint for theii\-th random seed\. Given a total ofK=50K=50uniformly distributed checkpoints, we focus on the final 20% of training, which corresponds to the lastM=10M=10checkpoints\.
Average Return\.The temporal mean returnS\(i\)S\(i\)for a specific seediiis defined as:
S\(i\)=1M∑k=K−M\+1KR\(i,k\)S\(i\)=\\frac\{1\}\{M\}\\sum\_\{k=K\-M\+1\}^\{K\}R\(i,k\)\(21\)
Normalized Score\.To aggregate results across environments with varying reward scales and to robustlyhandle potentially negative returns, we compute the normalized scoreS^\(i\)\\hat\{S\}\(i\)\. We define the normalization relative to the aggregate performance of the baseline algorithm\. The normalized score is calculated as:
S^\(i\)=1\+S\(i\)−𝔼j\[Sbase\(j\)\]\|𝔼j\[Sbase\(j\)\]\|\\hat\{S\}\(i\)=1\+\\frac\{S\(i\)\-\\mathbb\{E\}\_\{j\}\\left\[S\_\{\\text\{base\}\}\(j\)\\right\]\}\{\\left\|\\mathbb\{E\}\_\{j\}\\left\[S\_\{\\text\{base\}\}\(j\)\\right\]\\right\|\}\(22\)where𝔼j\[⋅\]\\mathbb\{E\}\_\{j\}\[\\cdot\]denotes the empirical expectation \(average\) over the baseline seedsj∈\{1,…,N\}j\\in\\\{1,\\dots,N\\\}\. Here, the subscript “base” refers to the baseline algorithm trained atUTD=1\\text\{UTD\}=1\. Consequently,Sbase\(j\)S\_\{\\text\{base\}\}\(j\)represents the average return of the baseline’sjj\-th seed, computed using the same protocol as in Eq\. \([21](https://arxiv.org/html/2605.14026#A4.E21)\):
Sbase\(j\)=1M∑k=K−M\+1KRbase\(j,k\)S\_\{\\text\{base\}\}\(j\)=\\frac\{1\}\{M\}\\sum\_\{k=K\-M\+1\}^\{K\}R\_\{\\text\{base\}\}\(j,k\)\(23\)whereRbase\(j,k\)R\_\{\\text\{base\}\}\(j,k\)denotes the evaluation return of the baseline algorithm \(trained atUTD=1\\text\{UTD\}=1\) at thekk\-th checkpoint for thejj\-th seed\.
## Appendix EDetailed Experimental Results
This section presents the complete set of experimental results\. Before detailing the full tables and figures, we address a key statistical observation regarding the performance metrics\.
Metric\.To ensure a rigorous evaluation, we report the Interquartile Mean \(IQM\)\(Agarwalet al\.,[2021](https://arxiv.org/html/2605.14026#bib.bib42)\)alongside the standard mean\. As shown in the subsequent tables, while R2R2 consistently outperforms the baselines in IQM, the performance gap is notably wider in the standard mean\. This discrepancy suggests that our performance gains stem primarily from improved robustness\. The mean metric is sensitive to outliers, including catastrophic failures where agents suffer from representation collapse in high UTD regimes\. Since IQM discards the bottom 25% of the data distribution, it effectively masks these instability issues inherent in the baselines\.In contrast, by incorporating our proposed regularization term into the loss function, R2R2 effectively mitigates these failures and significantly improves the performance of “worst\-case” seeds\.Consequently, the larger gain in Mean \(which accounts for preventing failures\) compared to IQM \(which ignores them\) confirms that our regularization acts as a crucial “safety net,” enhancing the overall reliability of the algorithm\.
Common Observations across Baselines\.As shown in all tables below, R2R2 yields scores comparable to or higher than the baseline across most tasks atUTD=1\\text\{UTD\}=1\. However, we observe a performance drop in some high\-dimensional environments likeHumanoid\-\{Walk, Run\}\. This reflects a characteristic trade\-off: the regularization terms \(ℒVar\\mathcal\{L\}\_\{\\text\{Var\}\}andℒRR\\mathcal\{L\}\_\{\\text\{RR\}\}\) actively enforce feature diversity, preventing premature convergence\. While this may slightly delay policy specialization in standard regimes \(UTD=1\\text\{UTD\}=1\), it acts as a critical stabilizer in data\-intensive settings\. Indeed, atUTD=20\\text\{UTD\}=20, R2R2 surpasses the baseline even in these tasks, confirming thatthe benefits of regularization outweigh the initial cost\.
### E\.1TD7 baseline
Table 6:Raw score performance comparison on TD7 \(Base vs\. Ours\) atUTD=1\\text\{UTD\}=1,UTD=10\\text\{UTD\}=10, andUTD=20\\text\{UTD\}=20\. We report mean±\\pmstd\.TD7 Baseline
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x18.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x19.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x20.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x21.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x22.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x23.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x24.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x25.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x26.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x27.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x28.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x29.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x30.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x31.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x32.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x33.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x34.png)

















Figure 8:Learning curves for TD7 baseline\.Table 7:Performance comparison on TD7 \(Base vs\. Ours\) atUTD=1\\text\{UTD\}=1,UTD=10\\text\{UTD\}=10, andUTD=20\\text\{UTD\}=20\. We report normalized mean±\\pmstd\. The final rows report the aggregate Mean and IQM with 95% CIs\.
### E\.2Minimalistϕ\\phibaseline
Table 8:Raw score performance comparison on Minimalistϕ\\phi\(Base vs\. Ours\) atUTD=1\\text\{UTD\}=1,UTD=10\\text\{UTD\}=10, andUTD=20\\text\{UTD\}=20\. We report mean±\\pmstd\.Minimalistϕ\\phiBaseline
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x52.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x53.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x54.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x55.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x56.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x57.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x58.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x59.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x60.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x61.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x62.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x63.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x64.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x65.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x66.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x67.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x68.png)

















Figure 9:Learning curves for Minimalistϕ\\phibaseline\.Table 9:Performance comparison on Minimalistϕ\\phi\(Base vs\. Ours\) atUTD=1\\text\{UTD\}=1,UTD=10\\text\{UTD\}=10, andUTD=20\\text\{UTD\}=20\. We report normalized mean±\\pmstd\. The final rows report the aggregate Mean and IQM with 95% CIs\.
### E\.3TD7\+Layer Normalization \(LN\) baseline
Table 10:Raw score performance comparison on TD7\+LayerNorm\(Fujimotoet al\.,[2023](https://arxiv.org/html/2605.14026#bib.bib10); Baet al\.,[2016](https://arxiv.org/html/2605.14026#bib.bib34)\)\(Base vs\. Ours\) atUTD=1\\text\{UTD\}=1,UTD=10\\text\{UTD\}=10, andUTD=20\\text\{UTD\}=20\. We report mean±\\pmstd\.TD7\+LN Baseline
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x86.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x87.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x88.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x89.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x90.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x91.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x92.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x93.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x94.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x95.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x96.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x97.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x98.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x99.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x100.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x101.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x102.png)

















Figure 10:Learning curves for TD7\+LN baseline\.Table 11:Performance comparison on TD7 \+ LayerNorm \(Base vs\. Ours\) atUTD=1\\text\{UTD\}=1,UTD=10\\text\{UTD\}=10, andUTD=20\\text\{UTD\}=20\. We report normalized mean±\\pmstd\. The final rows report the aggregate Mean and IQM with 95% CIs\.
### E\.4SimbaV2 baseline
Table 12:Raw score performance comparison on SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), SimbaV2\-SPL, and \+R2R2 atUTD=1\\text\{UTD\}=1,UTD=10\\text\{UTD\}=10, andUTD=20\\text\{UTD\}=20\. We report mean±\\pmstd\.SimbaV2 Baseline
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x120.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x121.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x122.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x123.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x124.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x125.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x126.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x127.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x128.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x129.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x130.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x131.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x132.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x133.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x134.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x135.png)
![[Uncaptioned image]](https://arxiv.org/html/2605.14026v1/x136.png)

















Figure 11:Learning curves for SimbaV2 baseline\.Table 13:Performance comparison on SimbaV2\(Leeet al\.,[2025b](https://arxiv.org/html/2605.14026#bib.bib5)\), SimbaV2\-SPL, and \+R2R2 atUTD=1\\text\{UTD\}=1,UTD=10\\text\{UTD\}=10, andUTD=20\\text\{UTD\}=20\. We report normalized mean±\\pmstd\. The final rows report the aggregate Mean and IQM with 95% CIs\.Similar Articles
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
This paper argues that representation learning, not model-based planning, is the key to scalable multitask deep reinforcement learning. It introduces MR.Q, a simple model-free algorithm with auxiliary predictive objectives that outperforms prior world-model-based methods across diverse continuous control tasks.
Debiased Model-based Representations for Sample-efficient Continuous Control
This paper introduces the DR.Q algorithm, which improves model-based representations for Q-learning by maximizing mutual information and using faded prioritized experience replay to reduce bias and overfitting in continuous control tasks.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
This paper introduces Retention-aware Policy Optimization (RaPO) to mitigate catastrophic forgetting in visual continual learning using reinforcement fine-tuning. RaPO uses trajectory-level reward shaping and cross-task advantage normalization to close the gap between reinforcement and supervised fine-tuning in class- and domain-incremental learning.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
The paper introduces Reflection-Enhanced Self-Distillation (Resd), a framework that transforms failure feedback into corrective supervision for LLMs, enabling efficient learning from rare successes. It outperforms standard self-distillation baselines and achieves faster early improvement than GRPO with fewer samples.
Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL
This paper introduces CARL, a method for offline hierarchical reinforcement learning that exploits local dynamics regularity to learn reusable skills. The approach clusters state-goal pairs requiring similar action sequences, enabling more effective skill reuse and improved performance on complex humanoid tasks.