Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
Summary
This paper argues that representation learning, not model-based planning, is the key to scalable multitask deep reinforcement learning. It introduces MR.Q, a simple model-free algorithm with auxiliary predictive objectives that outperforms prior world-model-based methods across diverse continuous control tasks.
View Cached Full Text
Cached at: 06/05/26, 08:11 AM
# Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
Source: [https://arxiv.org/html/2606.05555](https://arxiv.org/html/2606.05555)
Johan Obando\-Ceron1,2Lu Li1,2Scott Fujimoto3Pierre\-Luc Bacon1,2 Aaron Courville1,2,4Pablo Samuel Castro1,2,5 1Mila – Québec AI Institute2Université de Montréal3McGill University 4CIFAR AI Chair5Google DeepMind jobando0730@gmail\.com,scott\.fujimoto@mail\.mcgill\.ca \{lu\.li, pierre\-luc\.bacon, courvila, pablo\-samuel\.castro\}@mila\.quebec
###### Abstract
Scaling reinforcement learning \(RL\) to diverse multitask settings remains a central challenge\. While recent advances in model\-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability\. We revisit this question and argue that the primary driver of scalable multitask RL is not model\-based control, but*representation learning*\. In particular, we show that combining predictive, model\-based representations with high\-capacity value function approximation is sufficient to achieve strong performance, even without planning\. We evaluate a simple model\-free algorithm, MR\.Q, coupled with auxiliary predictive objectives into a scalable actor\-critic architecture\. This approach outperforms a recent world\-model\-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall\-clock efficiency\. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance\.Our code is available at[ScaleMRL](https://github.com/johanobandoc/ScaleMRL.git)\.
“What we observe isn’t nature itself, but nature exposed to our method of questioning111In RL, what an agent “sees” depends on its representation\. Our results suggest that improving representations can be more important, and significantly more efficient, than modeling environment dynamics and planning\.\.”
— Werner Heisenberg
## 1Introduction
Deep reinforcement learning \(RL\) has achieved remarkable success across a wide range of domains, including games, robotics, and control\(Akkayaet al\.,[2019](https://arxiv.org/html/2606.05555#bib.bib26); Mnihet al\.,[2013](https://arxiv.org/html/2606.05555#bib.bib96); Schwarzeret al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib15)\)\. However, much of this progress remains confined to single\-task settings, where agents are trained and evaluated on narrowly defined environments, often requiring hundreds of millions of environment interactions to converge\. In contrast, recent advances in machine learning, particularly in language and vision, demonstrate that scaling models across diverse data distributions enables generalization, transfer, and robustness through shared representations\(Wanget al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib99); Alayracet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib100); Kojimaet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib102); Subramanianet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib101); Zhouet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib98); Reedet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib112); Wiedemeret al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib103)\)\. Extending these principles to online deep RL remains an open challenge\. Unlike supervised settings, RL involves non\-stationary data, bootstrapped targets, and long\-horizon credit assignment, which introduce optimization instabilities that manifest as representation collapse, loss of plasticity, and unstable value estimation\. These instabilities compound the sample costs of learning and ultimately hinder progress in multitask settings\(Kumaret al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib104); Nikishinet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib20); Sokaret al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib46); Naumanet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib61); Tang and Berseth,[2024](https://arxiv.org/html/2606.05555#bib.bib105); Castanyeret al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib29)\)\.
Multitask RL \(MTRL\) seeks to train a single agent over a distribution of tasks, but doing so across increasingly diverse task distributions introduces instability, task interference, and underutilization of model capacity\(Tehet al\.,[2017](https://arxiv.org/html/2606.05555#bib.bib106); Yuet al\.,[2020a](https://arxiv.org/html/2606.05555#bib.bib107); D’Eramoet al\.,[2020](https://arxiv.org/html/2606.05555#bib.bib110); Konget al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib111)\)\. Recent work byNaumanet al\.\([2025](https://arxiv.org/html/2606.05555#bib.bib97)\)demonstrates that substantially increasing value function capacity, paired with categorical value parameterization and explicit regularization, leads to significant multitask gains\. Yet scaling model size alone does not solve the problem: without the right training objectives and representation learning mechanisms, larger models simply require more data to stabilize\(Taigaet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib108); Farebrotheret al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib109)\)\. This points to representation quality as a central axis of progress, since better representations have been shown to reduce TD variance, accelerate learning, and stabilize training across tasks\(Castroet al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib34); Schwarzeret al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib35); Fujimotoet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib13); Cetinet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib115); Echchahed and Castro,[2025](https://arxiv.org/html/2606.05555#bib.bib114); Obando\-Ceronet al\.,[2026a](https://arxiv.org/html/2606.05555#bib.bib50)\)\.
Model\-based RL methods pursue this goal by leveraging predictive objectives — specifically by learning latent dynamics models — to provide dense supervision that shapes representations beyond what TD learning alone can achieve\. This richer learning signal is a key driver behind recent model\-based advances\(Hafneret al\.,[2020b](https://arxiv.org/html/2606.05555#bib.bib68),[2025a](https://arxiv.org/html/2606.05555#bib.bib69); Hansenet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib11),[2026](https://arxiv.org/html/2606.05555#bib.bib49); Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\)\. Recent large\-scale systems further combine predictive representation learning, large shared architectures, and planning to achieve strong multitask performance\(Xuet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib124); Georgievet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib51); Hafneret al\.,[2025a](https://arxiv.org/html/2606.05555#bib.bib69); Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\)\. Yet because these approaches bundle multiple components together, isolating the source of their gains remains difficult\. Moreover, planning itself introduces computational overhead, hyperparameter sensitivity, and compounding model errors, ultimately eroding the very efficiency gains these methods aim to provide\(Zhanget al\.,[2021b](https://arxiv.org/html/2606.05555#bib.bib123); Talvitie,[2014](https://arxiv.org/html/2606.05555#bib.bib121); Rajeswaranet al\.,[2017](https://arxiv.org/html/2606.05555#bib.bib118); Claveraet al\.,[2018](https://arxiv.org/html/2606.05555#bib.bib119); Chuaet al\.,[2018](https://arxiv.org/html/2606.05555#bib.bib120); Voelckeret al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib122)\)\.
We hypothesize that much of the benefit attributed to model\-based control in fact arises from the representations these methods learn, and that predictive objectives alone are sufficient to achieve competitive sample efficiency at scale\(Jaderberget al\.,[2017](https://arxiv.org/html/2606.05555#bib.bib127); Geladaet al\.,[2019](https://arxiv.org/html/2606.05555#bib.bib126); Leeet al\.,[2020](https://arxiv.org/html/2606.05555#bib.bib125); Anandet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib128)\)\. To test this hypothesis, we study MR\.Q\(Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\), a purely model\-free agent that integrates predictive objectives into TD learning\. MR\.Q is a natural probe for this question as it isolates the representational benefits of predictive learning from the confounds of planning, allowing us to test whether richer supervision alone drives sample efficiency gains\.
While originally proposed for single\-task settings, we extend MR\.Q’s evaluation to the multitask regime\. However, previous MTRL benchmarks evaluate at 100M or more environment steps\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\), obscuring whether methods are genuinely sample\-efficient or simply benefit from prolonged training\. To address this, we consider a version of the benchmark that evaluates agents at 10M environment steps, where sample efficiency gains are most visible\.
Across a suite of continuous control benchmarks, MR\.Q outperforms a recent world\-model\-based method \(Newt\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\)\) while achieving substantially improved wall\-clock, sample efficiency, and demonstrates performance benefits from scaling in both model size and data availability\. In addition, MR\.Q exhibits stronger transfer to unseen tasks thanNewt, suggesting that representations learned through multitask training yield substantially better zero\-shot initialization and faster adaptation during few\-shot finetuning\. Ablations further confirm that predictive objectives are critical, with performance degrading significantly when removed even at large model scales\. Overall, these results support a representation\-centric view of deep RL scaling, where the quality of learned representations plays a central role in enabling effective scalable multitask learning\.
## 2Preliminaries
#### Problem setting\.
We consider a multitask RL \(MTRL\) setting in which tasksτ∼p\(τ\)\\tau\\sim p\(\\tau\)are sampled from a task distribution\. Each task induces a Markov decision process \(MDP\)ℳτ=\(𝒮,𝒜,𝒯τ,ℛτ,γ\)\\mathcal\{M\}\_\{\\tau\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\}\_\{\\tau\},\\mathcal\{R\}\_\{\\tau\},\\gamma\), where we assume a shared action space𝒜\\mathcal\{A\}and \(typically\) a shared state space𝒮\\mathcal\{S\}across tasks, while transition dynamics and rewards may vary withτ\\tau\. At each time steptt, the agent observesst∈𝒮s\_\{t\}\\in\\mathcal\{S\}, takes actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}, and receives rewardrt∼ℛτ\(st,at\)r\_\{t\}\\sim\\mathcal\{R\}\_\{\\tau\}\(s\_\{t\},a\_\{t\}\), transitioning tost\+1∼𝒯τ\(⋅∣st,at\)s\_\{t\+1\}\\sim\\mathcal\{T\}\_\{\\tau\}\(\\cdot\\mid s\_\{t\},a\_\{t\}\)\. The objective is to learn a single policyπ\(a∣s,τ\)\\pi\(a\\mid s,\\tau\)that maximizes the expected discounted return across tasks, formulated as𝔼τ∼p\(τ\),π\[∑t=0∞γtrt\]\.\\mathbb\{E\}\_\{\\tau\\sim p\(\\tau\),\\pi\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\_\{t\}\\right\]\.Similar toHansenet al\.\([2026](https://arxiv.org/html/2606.05555#bib.bib49)\), when task information is available \(e\.g\., task identifiers or language instructions\), we condition the policy and value functions on a learned embeddinge\(τ\)e\(\\tau\)\. Otherwise, the problem reduces to a partially observable MDP, where task identity must be inferred from interaction\. We assume an off\-policy setting, where experience is stored in a replay buffer𝒟\\mathcal\{D\}containing tuples\(st,at,rt,dt,st\+1,τ\)\(s\_\{t\},a\_\{t\},r\_\{t\},d\_\{t\},s\_\{t\+1\},\\tau\), withdt∈\{0,1\}d\_\{t\}\\in\\\{0,1\\\}indicating episode termination\. We adopt an off\-policy actor–critic architecture\(Konda and Tsitsiklis,[1999](https://arxiv.org/html/2606.05555#bib.bib132); Fujimotoet al\.,[2018](https://arxiv.org/html/2606.05555#bib.bib16)\), where a parametric policy \(actor\)πψ\(a∣s,τ\)\\pi\_\{\\psi\}\(a\\mid s,\\tau\)is trained to maximize expected return, while a value function \(critic\)Qθ\(s,a,τ\)Q\_\{\\theta\}\(s,a,\\tau\)estimates the expected return of state–action pairs\. The critic is optimized via temporal\-difference \(TD\) learning using targets constructed from a slowly updated target network, while the actor is trained to maximize the critic’s value estimates\. In practice, we employ twin criticsQθ1,Qθ2Q\_\{\\theta\_\{1\}\},Q\_\{\\theta\_\{2\}\}to mitigate overestimation bias, as in prior work on off\-policy RL\(Fujimotoet al\.,[2018](https://arxiv.org/html/2606.05555#bib.bib16); Haarnojaet al\.,[2018](https://arxiv.org/html/2606.05555#bib.bib45)\)\.
#### Predictive Information Representations\.
Representation learning is central to deep RL, particularly in high\-dimensional and multitask settings where stability and generalization depend on the structure of learned features\(Agarwalet al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib133); Echchahed and Castro,[2025](https://arxiv.org/html/2606.05555#bib.bib114)\)\. Because supervision from temporal\-difference learning is often weak and non\-stationary, predictive auxiliary objectives are commonly used to stabilize optimization and encourage latent representations to capture environment dynamics and temporal structure beyond reward signals\(Nikishinet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib20); Hafneret al\.,[2020b](https://arxiv.org/html/2606.05555#bib.bib68); Hansenet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib11)\)\. We consider an off\-policy actor–critic operating on learned latent representations\. Observations \(and optionally task information\) are encoded aszt=ϕξ\(st,τ\)z\_\{t\}=\\phi\_\{\\xi\}\(s\_\{t\},\\tau\), and both the policyπψ\(a∣z\)\\pi\_\{\\psi\}\(a\\mid z\)and twin criticsQθ1,Qθ2Q\_\{\\theta\_\{1\}\},Q\_\{\\theta\_\{2\}\}operate in latent space\. Critics are trained via temporal\-difference learning with target networks, while the policy maximizes value estimates\. To improve representation quality, we augment training with predictive modeling in latent space: models of dynamics, reward, and termination predict\(zt\+1,rt,dt\)\(z\_\{t\+1\},r\_\{t\},d\_\{t\}\)from\(zt,at\)\(z\_\{t\},a\_\{t\}\)\(Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\), and their gradients are backpropagated through the encoderϕξ\\phi\_\{\\xi\}\. This encourages representations that are predictive of environment dynamics and task\-relevant signals\. Crucially, no planning is performed, the learned models are used solely to shape the representation, isolating the benefits of predictive learning without the computational overhead and instability of explicit model\-based control\.
## 3Scaling deep RL through Representation Learning
Figure 1:Representation quality drives scaling performance in model\-free RL\.We compare standard PPO with a variant augmented with model\-based representations \(\+ MB\. Representations\) across four network sizes \(Small, Medium, Large, X\-Large\) on HalfCheetah and Humanoid\.A central challenge in deep RL is how to scale agents across tasks, model capacity, and data\. Recent progress has been largely driven by model\-based approaches, where agents learn predictive world models and leverage planning to improve decision\-making\(Hansenet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib11); Hafneret al\.,[2025a](https://arxiv.org/html/2606.05555#bib.bib69)\)\. Methods such as Dreamer and TD\-MPC2 demonstrate that combining predictive modeling with large\-capacity function approximators can substantially improve performance in both single\-task and multitask settings\. At larger scales, systems such asNewt\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\)extend this paradigm to hundreds of tasks by training shared world models across diverse continuous\-control domains, demonstrating strong multitask performance and transfer\. However, these gains come with significant computational and algorithmic overhead\. Model\-based agents must jointly learn dynamics, reward, and value functions while additionally performing latent rollouts or planning during training or inference\. This increases wall\-clock cost, memory usage, and implementation complexity, while also introducing additional sources of instability as model errors compound over imagined trajectories\(Talvitie,[2014](https://arxiv.org/html/2606.05555#bib.bib121); Janneret al\.,[2019](https://arxiv.org/html/2606.05555#bib.bib141)\)\. These challenges become particularly pronounced in multitask settings, where a single world model must capture diverse and potentially conflicting dynamics across environments\. Additional discussion of related work is provided in[App\. C](https://arxiv.org/html/2606.05555#A3)\.
At the same time, recent work suggests that some benefits commonly attributed to model\-based RL may instead arise from the representations induced by predictive learning\(Schwarzeret al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib35); Ghugareet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib5); Zhaoet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib4)\)\. In particular, methods such as MR\.Q show that model\-free agents augmented with auxiliary predictive objectives can achieve strong performance across diverse tasks without explicit planning\(Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\)\.
To isolate this effect, we study a controlled single\-task setting where planning and multitask interference are absent\.[Fig\. 1](https://arxiv.org/html/2606.05555#S3.F1)provides evidence in a controlled setting where planning and multitask interference are absent\. We compare standard PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.05555#bib.bib33)\)with a variant augmented with predictive model\-based representations \(\+ MB\. Representations\) across four network sizes on HalfCheetah and Humanoid, two environments of increasing complexity and dimensionality\. Without predictive representations, scaling model capacity yields little to no benefit: on HalfCheetah, larger PPO models can even underperform smaller ones, while on Humanoid, performance remains nearly flat across all model sizes\. With predictive representations, however, PPO consistently outperforms standard PPO at every network size and offers increased robustness to varied capacity\. These results hint that representation quality may be an important bottleneck when scaling deep RL systems\. Predictive objectives provide an additional supervisory signal that appears to help larger models make more effective use of increased capacity, whereas reward\-only supervision often struggles to do so\.
This finding has direct implications for multitask RL, where scaling shared architectures is critical for learning transferable representations across diverse tasks\(Naumanet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib97); Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\), and where additional challenges — task interference, non\-stationarity, and distributional shift — may make representation quality an even more severe bottleneck\. This motivates the central question of this work:*Can model\-free RL match the scalability and generalization of world\-model approaches in multitask settings by focusing on representation learning alone?*
## 4Multitask Model\-Free RL with Structured Representations
In this section, we evaluate whether model\-free RL augmented with predictive representation learning can match recent world\-model approaches in multitask settings\. We show that MR\.Q consistently matches or surpasses the large\-scale world\-model baselineNewtacross diverse multitask domains without relying on planning or latent rollouts\. We further analyze how predictive representation learning impacts representation geometry and optimization stability\.
World models provide useful inductive biases through predictive supervision and structured latent representations\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2606.05555#bib.bib134); Hafneret al\.,[2020a](https://arxiv.org/html/2606.05555#bib.bib135); Geladaet al\.,[2019](https://arxiv.org/html/2606.05555#bib.bib126); Schwarzeret al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib35)\)\. However, many of their benefits may arise from the learned representations rather than planning itself\. This motivates model\-free approaches that incorporate predictive representation learning while preserving the simplicity, efficiency and scalability of model\-free RL\.
#### Baselines and Evaluation Protocol\.
We compare against a strong model\-based baseline,Newt\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\), in a multitask setting under fixed interaction budgets\. Our primary evaluation is conducted in a low\-data regime of 10M environment steps, where sample efficiency is critical, in contrast to prior work that typically evaluates at 100M environment steps\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\)\. To assess scalability, we additionally include selected longer runs\. We report aggregate learning curves to evaluate sample efficiency, as well as final performance at the end of training, averaging results over five seeds and reporting 95% confidence intervals \(CIs\) across tasks and runs\. Our results show that equipping TD3\(Fujimotoet al\.,[2018](https://arxiv.org/html/2606.05555#bib.bib16)\)with predictive representation learning objectives \(MR\.Q\(Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\)\) enables model\-free methods to match or surpass model\-based approaches\. All experiments follow the multitask language\-conditioned training protocol introduced inNewt\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\); see[App\. E](https://arxiv.org/html/2606.05555#A5)and[App\. G](https://arxiv.org/html/2606.05555#A7)for MR\.Q and training details\.










Figure 2:Per\-domain aggregate performance across all 10 MMBench domains\.Average normalized score ofMR\.Q\(solid,teal\) versusNewt\(dashed,red\) on state\-based multitask benchmarks from MMBench\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\), spanning continuous control, manipulation, locomotion, and discrete game domains\. MR\.Q, a model\-free agent with model\-based representation learning, consistently matches or surpasses the model\-basedNewtbaseline in sample efficiency and final performance across all domains\. Shaded regions denote 95% CIs across five seeds\.
#### Learning Across Tasks\.
We consider a multitask setting where a single agent is trained jointly across a diverse set of environments that share observation and action spaces but differ in dynamics and reward functions, following prior work on multitask RL\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\)\. Training is performed by interleaving experience from multiple tasks under a shared set of parameters\. This setup enables knowledge sharing across tasks, but introduces several challenges:\(i\) non\-stationarity, as the data distribution shifts with the task mixture;\(ii\) interference, as shared representations must support multiple, potentially conflicting objectives; and\(iii\) optimization difficulty, as gradients from different tasks may not align\. These challenges make representation learning a central bottleneck for scaling RL systems\. These challenges are particularly relevant for evaluating predictive representation learning\. If predictive objectives improve latent structure, temporal consistency, and feature reuse across tasks, they may alleviate optimization instability and reduce interference even in the absence of explicit planning\.[Fig\. 2](https://arxiv.org/html/2606.05555#S4.F2)shows that MR\.Q consistently improves both sample efficiency and final performance across diverse multitask domains, suggesting that predictive representation learning alone can substantially improve cross\-task generalization and optimization stability\. Additional per\-task learning curves are provided in[App\. K](https://arxiv.org/html/2606.05555#A11), and detailed descriptions of the multitask suites and training protocol are given in[App\. D](https://arxiv.org/html/2606.05555#A4)\. Unless otherwise specified, results are averaged over 5 seeds\.
#### Training for Longer\.


Figure 3:Extended training performance \(up to 50M environment steps\)\.MR\.Qsustains strong performance at scale, surpassingNewt, indicating that gains from structured representations persist beyond the low\-data regime\.While our primary evaluation focuses on the low\-data regime, it is important to assess whether the observed gains persist at larger interaction budgets\. To this end, we evaluate MR\.Q in extended training settings, scaling up to 50M environment steps\. This allows us to analyze asymptotic performance and determine whether improvements in representation learning continue to provide benefits beyond the initial learning phase\.[Fig\. 3](https://arxiv.org/html/2606.05555#S4.F3)shows that MR\.Q maintains strong performance as the number of interactions increases, matching or surpassing model\-based approaches such asNewtin multitask settings\. This suggests that the advantages of structured representation learning are not limited to sample efficiency, but also translate to improved scalability; reinforcing the view that a model\-free method equipped with structured representations provides a scalable alternative to model\-based approaches, achieving strong performance across both low\- and high\-data regimes\.
#### Visual Observations\.
While most experiments are conducted in the state\-based setting, we additionally evaluate performance under high\-dimensional visual inputs, following prior multitask benchmarks\(Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\)\. We use a pretrained DINOv2 encoder\(Oquabet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib136)\)to extract features from pixels, which are then used by the policy and value networks\. This setup removes the burden of learning representations from scratch, allowing us to isolate the role of downstream representation adaptation\. Despite strong pretrained features, representation learning remains a key bottleneck in the multitask regime\. The agent must adapt shared embeddings across diverse tasks, introducing interference and instability in value learning\. As shown in[Fig\. 4](https://arxiv.org/html/2606.05555#S4.F4), MR\.Q consistently outperformsNewtacross all domains, achieving higher sample efficiency and final performance\. These results highlight that the benefits of structured representation learning extend beyond low\-dimensional settings\. Even with powerful pretrained encoders, predictive objectives remain important for learning representations that support effective multitask learning under high\-dimensional inputs\.





Figure 4:Pixel\-based multitask learning curves across five domains\.Average normalized score ofMR\.Q\(solid\)andNewt\(dashed\)using visual observations with a frozen DINOv2 encoder\. MR\.Q consistently achieves higher sample efficiency and final performance, demonstrating that its predictive auxiliary objectives yield better task\-relevant representations in the high\-dimensional input regime\. Shaded regions denote 95% CIs across five seeds\.
### 4\.1Analyses
To rigorously isolate the mechanisms driving performance and assess the structural integrity of the learned representations, we compare MR\.Q against an encoder\-free baseline \(TD3\) to isolate the impact of model\-based representation learning\. In this ablation, the encoder is removed entirely, and the actor receives a direct concatenation of the raw low\-dimensional state and the 512\-dimensional language instruction embedding as input, while the critic additionally receives the raw action\.
Figure 5:Performance comparison across benchmark suites\.Per\-domain aggregate performance forMR\.Q, the encoder\-free baseline \(TD3\), andNewtacross four MMBench domains\.#### Performance Comparison\.
We evaluate the performance of MR\.Q alongside the encoder\-free baseline \(TD3\) andNewtas shown in[Fig\. 5](https://arxiv.org/html/2606.05555#S4.F5)\. Our results demonstrate that MR\.Q outperforms the encoder\-free baseline in three out of four domains while achieving comparable results in the remaining one, demonstrating overall superior performance and sample efficiency\. Interestingly, our results show that even the encoder\-free baseline consistently matches or outperformsNewtacross all four domains\. This indicates that a well\-tuned model\-free architecture utilizing raw low\-dimensional states and language instruction embeddings constitutes a highly competitive baseline while offering superior computational efficiency compared to a model\-based RL approach\.
Notably, this finding highlights the inherent robustness of model\-free RL in multitask regimes, suggesting that explicit world\-modeling may not be a strict prerequisite for handling multitask RL\. Beyond aggregate performance, the encoder\-free baseline facilitates a diagnostic evaluation of how learned representations modulate underlying learning dynamics when compared against MR\.Q\. We analyze these effects across three key dimensions\. These findings are summarized in[Fig\. 7](https://arxiv.org/html/2606.05555#S4.F7)and discussed in detail below\. Results averaged over five seeds, shaded areas represent 95% CIs\.
#### Representation Geometry\.


Figure 6:PCA visualization of multitask latent representations\.Two\-dimensional PCA projections of latent features extracted from multitask checkpoints trained on DMControl\-Ext \(left\) and MuJoCo \(right\)\. Each point corresponds to an observation colored by task identity\. MR\.Q learns structured and well\-separated task representations with substantially higher effective dimensionality \(95%\-d\), whereas removing predictive representation learning leads to collapsed and lower\-rank embeddings\.We evaluate feature capacity by measuring theSRank\(Kumaret al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib104)\)of the state representations\. While the encoder\-free baseline TD3 uses a larger input dimensionality \(512\+dobs512\+d\_\{obs\}\) than the512512\-dimensional latent space of MR\.Q, its SRank is significantly lower\. This suggests that raw observations result in a redundant feature space when processed without specific inductive biases\. In contrast, the representation learning in MR\.Q enforces a high\-rank manifold that has better representational capacity\. We additionally perform Principal Component Analysis \(PCA\) on the latent features at the end of 10M training steps to quantify the variance distribution\. We measure effective dimensionality by calculating the number of principal components required to explain 95% of the variance \(95%95\\%\-d\)\. Consistent with the SRank collapse, removing the encoder causes a severe representational bottleneck: across the DMControl\-ext and MuJoCo suites, the95%95\\%\-d drops from8989and6666down to merely2121and1515, respectively\. As depicted in[Fig\. 6](https://arxiv.org/html/2606.05555#S4.F6), these results validate the necessity of representation learning in preserving the expressive capacity required to scale across diverse multitasks\. Colors denote different tasks \(12 tasks for DMControl\-Ext and 6 tasks for MuJoCo\); task labels are omitted in the main figure for readability, with fully annotated visualizations provided in[App\. I](https://arxiv.org/html/2606.05555#A9)\.
#### Training Dynamics
To study how representation quality impacts optimization dynamics, we monitor the fraction of dormant neurons\(Sokaret al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib46); Liuet al\.,[2025b](https://arxiv.org/html/2606.05555#bib.bib82)\), which measures the proportion of inactive units in the network\. Dormant neurons indicate underutilized capacity and reduced plasticity, both of which are particularly harmful in multitask settings where agents must continually adapt to diverse and shifting objectives\. MR\.Q consistently exhibits a substantially lower fraction of dormant units than the encoder\-free baseline, especially in the critic network where the gap becomes pronounced throughout training\. In contrast, removing predictive representation learning leads to widespread critic dormancy, suggesting that the critic fails to effectively utilize the available network capacity\. This degradation is accompanied by higher value losses, indicating that collapsed or poorly structured representations make value learning significantly more difficult under multitask non\-stationarity\.
Figure 7:Empirical analyses for the effect of representation learning\.Comparison of MR\.Q against an encoder\-free baseline \(TD3\)\. From left to right: aggregate return across task sets, state representation SRank, value loss, and dormant neuron fractions in the actor and critic\.Overall, these results suggest that predictive representation learning not only improves representation geometry, but also preserves optimization stability and network plasticity during training\(Mayoret al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib32)\)\. By maintaining expressive and active latent features, MR\.Q enables the critic to make more effective use of model capacity, helping stabilize value learning across diverse multitask domains\. However, competitive performance on fixed multitask benchmarks alone does not fully characterize scalability\. In practice, scalable RL systems must continue to improve with increased task diversity, model capacity, data, and computation while remaining computationally efficient\. In[Sec\. 5](https://arxiv.org/html/2606.05555#S5), we therefore investigate how model\-free RL equipped with model\-based representations behaves across these scaling axes in large multitask settings\.
## 5Evaluation at Scale
A central question is whether model\-free methods can scale as effectively as model\-based approaches in multitask RL\. We study this across multiple scaling axes, including task diversity, model capacity, data, update frequency, and computational efficiency\.
#### Towards General Multitask RL Agents\.
To evaluate scalability, we train MR\.Q on a large combined benchmark of 200 tasks spanning multiple domains in a unified setting\. This setting stress\-tests whether structured, model\-free representations can scale to the diversity required of general\-purpose multitask agents, where a single model must simultaneously solve locomotion, manipulation, navigation, and arcade tasks\.[Fig\. 8](https://arxiv.org/html/2606.05555#S5.F8)\(left\) reveals that MR\.Q exhibits substantially higher*sample efficiency*throughout training: at 2M environment steps it achieves a normalized score of 0\.11 versus 0\.08 forNewt\(\+37% relative\), and maintains a consistent lead of 5–8% across the range\. This early advantage is practically significant in large\-scale settings, where each additional interaction is costly\. From a representation learning perspective, this suggests that model\-free agents with structured latent spaces can match the representational expressiveness of world\-model\-based methods at scale, while requiring fewer environment interactions to do so\.



Figure 8:\(Left\)Large\-scale multitask training across 200 tasks\.Normalized score throughout training on a combined benchmark of tasks spanning multiple domains\. MR\.Q consistently outperformsNewtduring training, while both methods converge to similar final performance\.Data and model scaling in multitask RL\.\(Middle\) Data scaling: performance as a function of training data for different dataset sizes\. \(Right\) Model scaling: performance across model sizes\. MR\.Q exhibits stable scaling across both axes, whileNewtshows sensitivity to reduced data and smaller models\.
#### Model and Data Scaling\.
We analyze how multitask performance scales with data and model capacity\.[Fig\. 8](https://arxiv.org/html/2606.05555#S5.F8)\(middle\) shows performance as a function of available training data\. Both methods improve with increased data, but exhibit different scaling behaviors\. MR\.Q shows consistent gains across data regimes, maintaining strong performance even with reduced data\. In contrast,Newtis more sensitive to data availability, with larger performance degradation in low\-data settings\.[Fig\. 8](https://arxiv.org/html/2606.05555#S5.F8)\(right\) shows scaling with model size\. MR\.Q exhibits smooth and predictable improvements as capacity increases, indicating effective utilization of additional parameters\.Newt, however, shows weaker scaling, with smaller gains and higher sensitivity to model size\. These results suggest that scaling performance is determined not only by access to more data or larger models, but also by how effectively additional capacity is utilized\. MR\.Q exhibits more stable scaling behavior across both axes, allowing it to better leverage increased data and model capacity\.
Figure 9:Few\-shot finetuning on held\-out tasks\.Average normalized score across 28 unseen tasks during finetuning steps from a 10M\-step multitask checkpoint\.MR\.Qachieves 50% higher zero\-shot performance and∼\\sim13% advantage throughout training\.
#### Scaling with Update\-to\-Data Ratio\.
We analyze how performance scales as a function of the update\-to\-data \(UTD\) ratio, which controls the number of gradient updates performed per environment interaction\. Increasing UTD effectively increases the amount of computation applied to a fixed dataset, probing how efficiently a method can extract information from available data\.[Fig\. 11](https://arxiv.org/html/2606.05555#A8.F11)shows performance as a function of environment steps for different UTD values\. As shown in[Fig\. 11](https://arxiv.org/html/2606.05555#A8.F11)\(left\), MR\.Q benefits consistently from increasing UTD, with higher update regimes leading to improved performance across training\. This indicates that the agent can effectively use additional gradient updates without destabilizing learning\. In contrast,Newt\([Fig\. 11](https://arxiv.org/html/2606.05555#A8.F11),right\) exhibits weaker scaling; performance improves slowly and shows diminishing returns at higher UTD\.
#### Few\-shot finetuning\.
To evaluate transfer to unseen tasks, we hold out a set of 28 tasks spanning multiple domains and finetune each individually using online RL, initializing from a checkpoint trained for 10M environment steps on the remaining 200 tasks\. We compare MR\.Q againstNewtunder an identical finetuning budget of 200k steps\. MR\.Q provides a substantially stronger zero\-shot initialization before any finetuning as shown in[Fig\. 9](https://arxiv.org/html/2606.05555#S5.F9)\. MR\.Q achieves an average normalized score of 0\.13 versus 0\.09 forNewt, a 50% relative advantage, indicating that multitask pretraining with MR\.Q yields more transferable representations\. This advantage is preserved throughout adaptation: at 100k steps MR\.Q scores 0\.55 versus 0\.48 forNewt\(\+12\.8%\), and at 200k steps 0\.62 versus 0\.55 \(\+12\.9%\)\. At the individual task level, MR\.Q outperformsNewton 17 of 28 held\-out tasks \(61%\) at the end of finetuning\. These results suggest that MR\.Q’s structured, model\-free representations learned during multitask pretraining transfer more effectively to novel tasks, enabling both a better starting point and faster convergence during adaptation\.
#### Computational Impact\.
We evaluate wall\-clock efficiency by measuring performance as a function of training time\. While standard deep RL evaluations emphasize sample efficiency, methods with similar interaction budgets can differ substantially in time\-to\-performance\. Model\-based approaches incur additional overhead from learning dynamics models and performing latent rollouts, which slows down training despite their strong sample efficiency\. In contrast, MR\.Q avoids explicit planning and simulation, learning structured representations directly from data\. As a result, MR\.Q achieves faster improvement in performance per unit of time, reaching higher returns significantly earlier than model\-based baselines, as shown in[Fig\. 10](https://arxiv.org/html/2606.05555#S5.F10)\. This highlights that gains in sample efficiency for world\-model approaches often come at the cost of increased computational overhead\. These differences have important practical implications, as higher compute requirements translate into longer training times, increased energy consumption, and reduced accessibility\.





Figure 10:Wall\-clock efficiency\.Normalized score as a function of wall\-clock training time \(hours\) on five MMBench domains\.MR\.Qconsistently reaches higher returns earlier thanNewt, a model\-based baseline that incurs substantial overhead from world\-model learning and latent rollout generation\. Shaded regions denote 95% CIs\. All runs use a fixed budget of 10M environment steps\.
## 6Lessons and Opportunities
Scaling deep RL to large and diverse multitask settings remains a central challenge\. In this work, we studied the role of model\-based representations in the multitask RL setting and showed that a simple model\-free approach augmented with predictive objectives can match or surpass a recent large\-scale world\-model baseline while substantially reducing computational overhead and improving wall\-clock efficiency \(see[Fig\. 10](https://arxiv.org/html/2606.05555#S5.F10)\)\. Our analyses further demonstrate that predictive representation learning improves representation geometry, stabilizes optimization, and enables more effective utilization of model capacity as systems scale \([Sec\. 5](https://arxiv.org/html/2606.05555#S5)\)\.
More broadly, our results highlight the importance of developing sample\- and compute\-efficient multitask RL algorithms that can learn effectively across diverse tasks under a realistic interaction budget\. In contrast to prior work that evaluates at substantially larger scales, our primary experiments are conducted in a challenging 10M interaction regime, where efficiency and representation quality become critical bottlenecks\. This is particularly important in real\-world settings, where interaction data is costly\. Despite this restricted budget, our approach consistently matches or surpasses large\-scale world\-model baselines across multiple axes, including task diversity, model scaling, transfer, and wall\-clock efficiency\. These findings suggest that scalable multitask RL may depend not only on larger model size or interaction budgets, but also on learning effective representations\.
#### Limitations and Future Work\.
Our study focuses primarily on continuous\-control multitask benchmarks, and it remains unclear how these findings extend to more diverse domains such as long\-horizon environments\. In addition, although our analyses suggest that predictive objectives improve representation quality and scaling behavior, the mechanisms underlying these improvements are not yet fully understood\. A more principled theoretical understanding could help guide the design of future sample\-efficient multitask agents\. An important direction for future work is to further investigate the relationship between predictive representation learning and planning\. While MR\.Q demonstrates that strong multitask performance can emerge without explicit planning, hybrid approaches that combine scalable model\-free representation learning with latent planning or imagination\-based rollouts may provide complementary benefits\(Changet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib153)\)\. More broadly, understanding how representation learning, planning, and scaling interact in large multitask systems remains an important open problem for deep RL\.
## Acknowledgments
The authors would like to thank Sami Nur Islam, Walter Mayor Toro and Gopeshh Subbaraj for valuable discussions during the preparation of this work\. We would also like to give special thanks to Ghada Sokar for providing valuable feedback on an early draft of the paper\.
The research was enabled in part by computational resources provided by the Digital Research Alliance of Canada \([https://alliancecan\.ca](https://alliancecan.ca/)\) and Mila \([https://mila\.quebec](https://mila.quebec/)\)\. Pablo Samuel Castro acknowledges funding from NSERC Discovery Grant\. We acknowledge funding support from Google and CIFAR AI\. We would also like to thank the Python community\(Van Rossum and Drake Jr,[1995](https://arxiv.org/html/2606.05555#bib.bib73); Oliphant,[2007](https://arxiv.org/html/2606.05555#bib.bib52)\)for developing tools that enabled this work, including NumPy\(Harriset al\.,[2020](https://arxiv.org/html/2606.05555#bib.bib74)\), Matplotlib\(Hunter,[2007](https://arxiv.org/html/2606.05555#bib.bib75)\), Jupyter\(Kluyveret al\.,[2016](https://arxiv.org/html/2606.05555#bib.bib78)\), and Pandas\(McKinney,[2013](https://arxiv.org/html/2606.05555#bib.bib79)\)\.
## References
- Learning generalizable representations for reinforcement learning via adaptive meta\-learner of behavioral similarities\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px2.p1.6)\.
- I\. Akkaya, M\. Andrychowicz, M\. Chociej, M\. Litwin, B\. McGrew, A\. Petron, A\. Paino, M\. Plappert, G\. Powell, R\. Ribas,et al\.\(2019\)Solving rubik’s cube with a robot hand\.arXiv preprint arXiv:1910\.07113\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds, R\. Ring, E\. Rutherford, S\. Cabi, T\. Han, Z\. Gong, S\. Samangooei, M\. Monteiro, J\. L\. Menick, S\. Borgeaud, A\. Brock, A\. Nematzadeh, S\. Sharifzadeh, M\. Bińkowski, R\. Barreira, O\. Vinyals, A\. Zisserman, and K\. Simonyan \(2022\)Flamingo: a visual language model for few\-shot learning\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 23716–23736\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- A\. Anand, J\. C\. Walker, Y\. Li, E\. Vértes, J\. Schrittwieser, S\. Ozair, T\. Weber, and J\. B\. Hamrick \(2022\)Procedural generalization by planning with self\-supervised world models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=FmBegXJToY)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p4.1)\.
- M\. Andrychowicz, A\. Raichuk, P\. Stańczyk, M\. Orsini, S\. Girgin, R\. Marinier, L\. Hussenot, M\. Geist, O\. Pietquin, M\. Michalski, S\. Gelly, and O\. Bachem \(2021\)What matters for on\-policy deep actor\-critic methods? a large\-scale study\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nIAxjsniDzg)Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- F\. Bai, H\. Zhang, T\. Tao, Z\. Wu, Y\. Wang, and B\. Xu \(2023\)Picor: multi\-task deep reinforcement learning with policy correction\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 6728–6736\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p1.1)\.
- M\. G\. Bellemare, Y\. Naddaf, J\. Veness, and M\. Bowling \(2013\)The arcade learning environment: an evaluation platform for general agents\.J\. Artif\. Int\. Res\.47\(1\),pp\. 253–279\.External Links:ISSN 1076\-9757Cited by:[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px8.p1.1)\.
- G\. Brockman, V\. Cheung, L\. Pettersson, J\. Schneider, J\. Schulman, J\. Tang, and W\. Zaremba \(2016\)Openai gym\.arXiv preprint arXiv:1606\.01540\.Cited by:[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px6.p1.1)\.
- R\. C\. Castanyer, J\. Obando\-Ceron, L\. Li, P\. Bacon, G\. Berseth, A\. Courville, and P\. S\. Castro \(2025\)Stable gradients for stable learning at scale in deep reinforcement learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Vqj65VeDOu)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- P\. S\. Castro, T\. Kastner, P\. Panangaden, and M\. Rowland \(2021\)MICo: improved representations via sampling\-based state similarity for markov decision processes\.Advances in Neural Information Processing Systems34,pp\. 30113–30126\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- J\. S\. O\. Ceron, J\. G\. M\. Araújo, A\. Courville, and P\. S\. Castro \(2024a\)On the consistency of hyper\-parameter selection in value\-based deep reinforcement learning\.InReinforcement Learning Conference,External Links:[Link](https://openreview.net/forum?id=szUyvvwoZB)Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- J\. S\. O\. Ceron, A\. Courville, and P\. S\. Castro \(2024b\)In value\-based deep reinforcement learning, a pruned network is a good network\.InInternational Conference on Machine Learning,pp\. 38495–38519\.Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- J\. S\. O\. Ceron, G\. Sokar, T\. Willi, C\. Lyle, J\. Farebrother, J\. N\. Foerster, G\. K\. Dziugaite, D\. Precup, and P\. S\. Castro \(2024c\)Mixtures of experts unlock parameter scaling for deep rl\.InInternational Conference on Machine Learning,pp\. 38520–38540\.Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- E\. Cetin, B\. P\. Chamberlain, M\. M\. Bronstein, and J\. J\. Hunt \(2023\)Hyperbolic deep reinforcement learning\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TfBHFLgv77)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- W\. Chang, M\. Henaff, B\. Amos, G\. Dudek, and S\. Fujimoto \(2026\)The surprising difficulty of search in model\-based reinforcement learning\.arXiv preprint arXiv:2601\.21306\.Cited by:[§6](https://arxiv.org/html/2606.05555#S6.SS0.SSS0.Px1.p1.1)\.
- K\. Chua, R\. Calandra, R\. McAllister, and S\. Levine \(2018\)Deep reinforcement learning in a handful of trials using probabilistic dynamics models\.Advances in Neural Information Processing Systems31\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1)\.
- I\. Clavera, J\. Rothfuss, J\. Schulman, Y\. Fujita, T\. Asfour, and P\. Abbeel \(2018\)Model\-based reinforcement learning via meta\-policy optimization\.InProceedings of The 2nd Conference on Robot Learning,A\. Billard, A\. Dragan, J\. Peters, and J\. Morimoto \(Eds\.\),Proceedings of Machine Learning Research, Vol\.87,pp\. 617–629\.External Links:[Link](https://proceedings.mlr.press/v87/clavera18a.html)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1)\.
- C\. D’Eramo, D\. Tateo, A\. Bonarini, M\. Restelli, and J\. Peters \(2020\)Sharing knowledge in multi\-task deep reinforcement learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rkgpv2VFvr)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- A\. Echchahed and P\. S\. Castro \(2025\)A survey of state representation learning for deep reinforcement learning\.Transactions on Machine Learning Research\.Note:Survey CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=gOk34vUHtz)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p2.1),[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px2.p1.6)\.
- J\. Farebrother and P\. S\. Castro \(2024\)Cale: continuous arcade learning environment\.Advances in Neural Information Processing Systems37,pp\. 134927–134946\.Cited by:[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px8.p1.1)\.
- J\. Farebrother, J\. Orbay, Q\. Vuong, A\. Ali Taiga, Y\. Chebotar, T\. Xiao, A\. Irpan, S\. Levine, P\. S\. Castro, A\. Faust, A\. Kumar, and R\. Agarwal \(2024\)Stop regressing: training value functions via classification for scalable deep RL\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 13049–13071\.External Links:[Link](https://proceedings.mlr.press/v235/farebrother24a.html)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- S\. Fujimoto, W\. Chang, E\. Smith, S\. S\. Gu, D\. Precup, and D\. Meger \(2023\)For sale: state\-action representation learning for deep reinforcement learning\.Advances in neural information processing systems36,pp\. 61573–61624\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- S\. Fujimoto, P\. D’Oro, A\. Zhang, Y\. Tian, and M\. Rabbat \(2025\)Towards general\-purpose model\-free reinforcement learning\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=R1hIXdST22)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px2.p1.1),[Appendix E](https://arxiv.org/html/2606.05555#A5.p2.6),[Appendix E](https://arxiv.org/html/2606.05555#A5.p3.1),[Appendix F](https://arxiv.org/html/2606.05555#A6.p6.2),[§1](https://arxiv.org/html/2606.05555#S1.p3.1),[§1](https://arxiv.org/html/2606.05555#S1.p4.1),[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px2.p1.6),[§3](https://arxiv.org/html/2606.05555#S3.p2.1),[§4](https://arxiv.org/html/2606.05555#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Fujimoto, H\. Hoof, and D\. Meger \(2018\)Addressing function approximation error in actor\-critic methods\.InInternational conference on machine learning,pp\. 1587–1596\.Cited by:[Appendix E](https://arxiv.org/html/2606.05555#A5.p1.1),[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px1.p1.19),[§4](https://arxiv.org/html/2606.05555#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Fujimoto, D\. Meger, D\. Precup, O\. Nachum, and S\. S\. Gu \(2022\)Why should i trust you, bellman? the bellman error is a poor replacement for value error\.InInternational Conference on Machine Learning,pp\. 6918–6943\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1)\.
- C\. Gelada, S\. Kumar, J\. Buckman, O\. Nachum, and M\. G\. Bellemare \(2019\)Deepmdp: learning continuous latent space models for representation learning\.InInternational conference on machine learning,pp\. 2170–2179\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p4.1),[§4](https://arxiv.org/html/2606.05555#S4.p2.1)\.
- I\. Georgiev, V\. Giridhar, N\. Hansen, and A\. Garg \(2025\)PWM: policy learning with multi\-task world models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hOELrZfg0J)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1)\.
- R\. Ghugare, H\. Bharadhwaj, B\. Eysenbach, S\. Levine, and R\. Salakhutdinov \(2023\)Simplifying model\-based RL: learning representations, latent\-space models, and policies with one objective\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=MQcmfgRxf7a)Cited by:[§3](https://arxiv.org/html/2606.05555#S3.p2.1)\.
- D\. Ha and J\. Schmidhuber \(2018\)World models\.arXiv preprint arXiv:1803\.101222\(3\),pp\. 440\.Cited by:[§4](https://arxiv.org/html/2606.05555#S4.p2.1)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InInternational conference on machine learning,pp\. 1861–1870\.Cited by:[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px1.p1.19)\.
- D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi \(2020a\)Dream to control: learning behaviors by latent imagination\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=S1lOTC4tDS)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p2.1),[Appendix E](https://arxiv.org/html/2606.05555#A5.p3.1),[§4](https://arxiv.org/html/2606.05555#S4.p2.1)\.
- D\. Hafner, T\. Lillicrap, I\. Fischer, R\. Villegas, D\. Ha, H\. Lee, and J\. Davidson \(2019\)Learning latent dynamics for planning from pixels\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 2555–2565\.External Links:[Link](https://proceedings.mlr.press/v97/hafner19a.html)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p2.1)\.
- D\. Hafner, T\. P\. Lillicrap, M\. Norouzi, and J\. Ba \(2020b\)Mastering atari with discrete world models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1),[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px2.p1.6)\.
- D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap \(2025a\)Mastering diverse control tasks through world models\.Nature,pp\. 1–7\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.05555#S1.p3.1),[§3](https://arxiv.org/html/2606.05555#S3.p1.1)\.
- D\. Hafner, W\. Yan, and T\. Lillicrap \(2025b\)Training agents inside of scalable world models, 2025\.URL https://arxiv\. org/abs/2509\.2452720\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p2.1),[Appendix E](https://arxiv.org/html/2606.05555#A5.p3.1)\.
- N\. A\. Hansen, H\. Su, and X\. Wang \(2022\)Temporal difference learning for model predictive control\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 8387–8406\.External Links:[Link](https://proceedings.mlr.press/v162/hansen22a.html)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p2.1)\.
- N\. Hansen, H\. Su, and X\. Wang \(2024\)TD\-MPC2: scalable, robust world models for continuous control\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Oxh5CstDJU)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p2.1),[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px2.p1.1),[Appendix E](https://arxiv.org/html/2606.05555#A5.p3.1),[Appendix F](https://arxiv.org/html/2606.05555#A6.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p3.1),[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px2.p1.6),[§3](https://arxiv.org/html/2606.05555#S3.p1.1)\.
- N\. Hansen, H\. Su, and X\. Wang \(2026\)Learning massively multitask world models for continuous control\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=MPabX9LEds)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p2.1),[Appendix D](https://arxiv.org/html/2606.05555#A4.p1.1),[Appendix E](https://arxiv.org/html/2606.05555#A5.p3.1),[Appendix F](https://arxiv.org/html/2606.05555#A6.p1.1),[Appendix G](https://arxiv.org/html/2606.05555#A7.p1.1),[Appendix G](https://arxiv.org/html/2606.05555#A7.p3.1),[§1](https://arxiv.org/html/2606.05555#S1.p3.1),[§1](https://arxiv.org/html/2606.05555#S1.p5.1),[§1](https://arxiv.org/html/2606.05555#S1.p6.1),[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px1.p1.19),[§3](https://arxiv.org/html/2606.05555#S3.p1.1),[§3](https://arxiv.org/html/2606.05555#S3.p4.1),[Figure 2](https://arxiv.org/html/2606.05555#S4.F2),[§4](https://arxiv.org/html/2606.05555#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.05555#S4.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.05555#S4.SS0.SSS0.Px4.p1.1)\.
- C\. R\. Harris, K\. J\. Millman, S\. J\. Van Der Walt, R\. Gommers, P\. Virtanen, D\. Cournapeau, E\. Wieser, J\. Taylor, S\. Berg, N\. J\. Smith,et al\.\(2020\)Array programming with numpy\.Nature585\(7825\),pp\. 357–362\.Cited by:[Acknowledgments](https://arxiv.org/html/2606.05555#Sx1.p2.1)\.
- J\. D\. Hunter \(2007\)Matplotlib: a 2d graphics environment\.Computing in science & engineering9\(03\),pp\. 90–95\.Cited by:[Acknowledgments](https://arxiv.org/html/2606.05555#Sx1.p2.1)\.
- M\. Jaderberg, V\. Mnih, W\. M\. Czarnecki, T\. Schaul, J\. Z\. Leibo, D\. Silver, and K\. Kavukcuoglu \(2017\)Reinforcement learning with unsupervised auxiliary tasks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SJ6yPD5xg)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p4.1)\.
- M\. Janner, J\. Fu, M\. Zhang, and S\. Levine \(2019\)When to trust your model: model\-based policy optimization\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/5faf461eff3099671ad63c6f3f094f7f-Paper.pdf)Cited by:[§3](https://arxiv.org/html/2606.05555#S3.p1.1)\.
- Ł\. Kaiser, M\. Babaeizadeh, P\. Miłos, B\. Osiński, R\. H\. Campbell, K\. Czechowski, D\. Erhan, C\. Finn, P\. Kozakowski, S\. Levine, A\. Mohiuddin, R\. Sepassi, G\. Tucker, and H\. Michalewski \(2020\)Model based reinforcement learning for atari\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=S1xCPJHtDB)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p2.1)\.
- H\. Kannan, D\. Hafner, C\. Finn, and D\. Erhan \(2021\)Robodesk: a multi\-task reinforcement learning benchmark\.Cited by:[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px7.p1.1)\.
- T\. Kluyver, B\. Ragan\-Kelley, F\. Pérez, B\. Granger, M\. Bussonnier, J\. Frederic, K\. Kelley, J\. Hamrick, J\. Grout, S\. Corlay, P\. Ivanov, D\. Avila, S\. Abdalla, C\. Willing, and Jupyter Development Team \(2016\)Jupyter Notebooks—a publishing format for reproducible computational workflows\.InIOS Press,pp\. 87–90\.External Links:[Document](https://dx.doi.org/10.3233/978-1-61499-649-1-87)Cited by:[Acknowledgments](https://arxiv.org/html/2606.05555#Sx1.p2.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- V\. Konda and J\. Tsitsiklis \(1999\)Actor\-critic algorithms\.Advances in neural information processing systems12\.Cited by:[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px1.p1.19)\.
- Y\. Kong, G\. Ma, Q\. Zhao, H\. Wang, L\. Shen, X\. Wang, and D\. Tao \(2025\)Mastering massive multi\-task reinforcement learning via mixture\-of\-expert decision transformer\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=qUcUyqP1UA)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- J\. E\. Kooi, Z\. Yang, and V\. Francois\-Lavet \(2026\)Hadamax encoding: elevating performance in model\-free atari\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=iRQM8Ehgl9)Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- A\. Kumar, R\. Agarwal, D\. Ghosh, and S\. Levine \(2021\)Implicit under\-parameterization inhibits data\-efficient deep reinforcement learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=O9bnihsFfXU)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.05555#S4.SS1.SSS0.Px2.p1.8)\.
- M\. Laskin, A\. Srinivas, and P\. Abbeel \(2020\)Curl: contrastive unsupervised representations for reinforcement learning\.InInternational conference on machine learning,pp\. 5639–5650\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1)\.
- K\. Lee, I\. Fischer, A\. Liu, Y\. Guo, H\. Lee, J\. Canny, and S\. Guadarrama \(2020\)Predictive information accelerates learning in rl\.Advances in Neural Information Processing Systems33,pp\. 11890–11901\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p4.1)\.
- J\. Liu, J\. S\. O\. Ceron, A\. Courville, and L\. Pan \(2025a\)Neuroplastic expansion in deep reinforcement learning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=20qZK2T7fa)Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- J\. Liu, Z\. Wu, J\. Obando\-Ceron, P\. S\. Castro, A\. Courville, and L\. Pan \(2025b\)Measure gradients, not activations\! enhancing neuronal activity in deep reinforcement learning\.arXiv preprint arXiv:2505\.24061\.Cited by:[§4\.1](https://arxiv.org/html/2606.05555#S4.SS1.SSS0.Px3.p1.1)\.
- W\. Mayor, J\. Obando\-Ceron, A\. Courville, and P\. S\. Castro \(2025\)The impact of on\-policy parallelized data collection on deep reinforcement learning networks\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=cnqyzuZhSo)Cited by:[§4\.1](https://arxiv.org/html/2606.05555#S4.SS1.SSS0.Px3.p2.1)\.
- W\. McKinney \(2013\)Python for data analysis: data wrangling with pandas, NumPy, and IPython\.1 edition,O’Reilly Media\.Note:PaperbackExternal Links:ISBN 9789351100065,[Link](http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20%5C&path=ASIN/1449319793)Cited by:[Acknowledgments](https://arxiv.org/html/2606.05555#Sx1.p2.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. Graves, I\. Antonoglou, D\. Wierstra, and M\. Riedmiller \(2013\)Playing atari with deep reinforcement learning\.arXiv preprint arXiv:1312\.5602\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- M\. Nauman, M\. Bortkiewicz, P\. Miłoś, T\. Trzciński, M\. Ostaszewski, and M\. Cygan \(2024\)Overestimation, overfitting, and plasticity in actor\-critic: the bitter lesson of reinforcement learning\.InProceedings of the 41st International Conference on Machine Learning,pp\. 37342–37364\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- M\. Nauman, M\. Cygan, C\. Sferrazza, A\. Kumar, and P\. Abbeel \(2025\)Bigger, regularized, categorical: high\-capacity value functions are efficient multi\-task learners\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=zhOUfuOIzA)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p2.1),[§3](https://arxiv.org/html/2606.05555#S3.p4.1)\.
- T\. Ni, B\. Eysenbach, E\. SeyedSalehi, M\. Ma, C\. Gehring, A\. Mahajan, and P\. Bacon \(2024\)Bridging state and history representations: understanding self\-predictive rl\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px2.p1.1)\.
- E\. Nikishin, M\. Schwarzer, P\. D’Oro, P\. Bacon, and A\. Courville \(2022\)The primacy bias in deep reinforcement learning\.InInternational conference on machine learning,pp\. 16828–16847\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p1.1),[§2](https://arxiv.org/html/2606.05555#S2.SS0.SSS0.Px2.p1.6)\.
- J\. Obando Ceron, M\. Bellemare, and P\. S\. Castro \(2023\)Small batch deep reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 26003–26024\.Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- J\. Obando\-Ceron, W\. Mayor, S\. Lavoie, S\. Fujimoto, A\. Courville, and P\. S\. Castro \(2026a\)Simplicial embeddings improve sample efficiency in actor–critic agents\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mCpq1GCKxA)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- J\. Obando\-Ceron, W\. Mayor, S\. Lavoie, S\. Fujimoto, A\. Courville, and P\. S\. Castro \(2026b\)Simplicial embeddings improve sample efficiency in actor–critic agents\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mCpq1GCKxA)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1)\.
- J\. S\. Obando\-Ceron and P\. S\. Castro \(2021\)Revisiting rainbow: promoting more insightful and inclusive deep reinforcement learning research\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research\.Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- T\. E\. Oliphant \(2007\)Python for scientific computing\.Computing in Science & Engineering9\(3\),pp\. 10–20\.External Links:[Document](https://dx.doi.org/10.1109/MCSE.2007.58)Cited by:[Acknowledgments](https://arxiv.org/html/2606.05555#Sx1.p2.1)\.
- M\. Oquab, T\. Darcet, T\. Moutakanni, H\. V\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. HAZIZA, F\. Massa, A\. El\-Nouby, M\. Assran, N\. Ballas, W\. Galuba, R\. Howes, P\. Huang, S\. Li, I\. Misra, M\. Rabbat, V\. Sharma, G\. Synnaeve, H\. Xu, H\. Jegou, J\. Mairal, P\. Labatut, A\. Joulin, and P\. Bojanowski \(2024\)DINOv2: learning robust visual features without supervision\.Transactions on Machine Learning Research\.Note:Featured CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=a68SUt6zFt)Cited by:[Appendix G](https://arxiv.org/html/2606.05555#A7.p3.1),[§4](https://arxiv.org/html/2606.05555#S4.SS0.SSS0.Px4.p1.1)\.
- S\. Park, K\. Frans, B\. Eysenbach, and S\. Levine \(2025\)OGBench: benchmarking offline goal\-conditioned rl\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px9.p1.1)\.
- A\. S\. Pasand, J\. Obando\-Ceron, A\. Courville, P\. Bashivan, and P\. S\. Castro \(2026\)Stable deep reinforcement learning via isotropic gaussian representations\.arXiv preprint arXiv:2602\.19373\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 8748–8763\.External Links:[Link](https://proceedings.mlr.press/v139/radford21a.html)Cited by:[Appendix G](https://arxiv.org/html/2606.05555#A7.p1.1)\.
- A\. Rajeswaran, S\. Ghotra, B\. Ravindran, and S\. Levine \(2017\)EPOpt: learning robust neural network policies using model ensembles\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SyWvgP5el)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1)\.
- S\. Reed, K\. Zolna, E\. Parisotto, S\. G\. Colmenarejo, A\. Novikov, G\. Barth\-maron, M\. Giménez, Y\. Sulsky, J\. Kay, J\. T\. Springenberg, T\. Eccles, J\. Bruce, A\. Razavi, A\. Edwards, N\. Heess, Y\. Chen, R\. Hadsell, O\. Vinyals, M\. Bordbar, and N\. de Freitas \(2022\)A generalist agent\.Transactions on Machine Learning Research\.Note:Featured Certification, Outstanding CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=1ikK0kHjvj)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§3](https://arxiv.org/html/2606.05555#S3.p3.1)\.
- M\. Schwarzer, A\. Anand, R\. Goel, R\. D\. Hjelm, A\. Courville, and P\. Bachman \(2021\)Data\-efficient reinforcement learning with self\-predictive representations\.InThe Nineth International Conference on Learning Representations \(ICLR\),Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p2.1),[§3](https://arxiv.org/html/2606.05555#S3.p2.1),[§4](https://arxiv.org/html/2606.05555#S4.p2.1)\.
- M\. Schwarzer, J\. S\. O\. Ceron, A\. Courville, M\. G\. Bellemare, R\. Agarwal, and P\. S\. Castro \(2023\)Bigger, better, faster: human\-level atari with human\-level efficiency\.InInternational Conference on Machine Learning,pp\. 30365–30380\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- S\. Sodhani, A\. Zhang, and J\. Pineau \(2021\)Multi\-task reinforcement learning with context\-based representations\.InInternational conference on machine learning,pp\. 9767–9779\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p1.1)\.
- G\. Sokar, R\. Agarwal, P\. S\. Castro, and U\. Evci \(2023\)The dormant neuron phenomenon in deep reinforcement learning\.InInternational Conference on Machine Learning,pp\. 32145–32168\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.05555#S4.SS1.SSS0.Px3.p1.1)\.
- G\. Sokar, J\. S\. O\. Ceron, A\. Courville, H\. Larochelle, and P\. S\. Castro \(2025\)Don’t flatten, tokenize\! unlocking the key to softmoe’s efficacy in deep RL\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8oCrlOaYcc)Cited by:[Appendix J](https://arxiv.org/html/2606.05555#A10.p2.1)\.
- S\. Subramanian, P\. Harrington, K\. Keutzer, W\. Bhimji, D\. Morozov, M\. W\. Mahoney, and A\. Gholami \(2023\)Towards foundation models for scientific machine learning: characterizing scaling and transfer behavior\.Advances in Neural Information Processing Systems36,pp\. 71242–71262\.Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- A\. A\. Taiga, R\. Agarwal, J\. Farebrother, A\. Courville, and M\. G\. Bellemare \(2023\)Investigating multi\-task pretraining and generalization in reinforcement learning\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=sSt9fROSZRO)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- E\. Talvitie \(2014\)Model regularization for stable sample rollouts\.InProceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence,UAI’14,Arlington, Virginia, USA,pp\. 780–789\.External Links:ISBN 9780974903910Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1),[§3](https://arxiv.org/html/2606.05555#S3.p1.1)\.
- H\. Tang and G\. Berseth \(2024\)Improving deep reinforcement learning by reducing the chain effect of value and policy churn\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=cQoAgPBARc)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- S\. Tao, F\. Xiang, A\. Shukla, Y\. Qin, X\. Hinrichsen, X\. Yuan, C\. Bao, X\. Lin, Y\. Liu, T\. Chan, Y\. Gao, X\. Li, T\. Mu, N\. Xiao, A\. Gurha, V\. N\. Rajesh, Y\. W\. Choi, Y\. Chen, Z\. Huang, R\. Calandra, R\. Chen, S\. Luo, and H\. Su \(2025\)ManiSkill3: gpu parallelized robotics simulation and rendering for generalizable embodied ai\.Robotics: Science and Systems\.Cited by:[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px4.p1.1)\.
- Y\. Tassa, Y\. Doron, A\. Muldal, T\. Erez, Y\. Li, D\. d\. L\. Casas, D\. Budden, A\. Abdolmaleki, J\. Merel, A\. Lefrancq,et al\.\(2018\)Deepmind control suite\.arXiv preprint arXiv:1801\.00690\.Cited by:[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px2.p1.1)\.
- Y\. Teh, V\. Bapst, W\. M\. Czarnecki, J\. Quan, J\. Kirkpatrick, R\. Hadsell, N\. Heess, and R\. Pascanu \(2017\)Distral: robust multitask reinforcement learning\.Advances in neural information processing systems30\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- E\. Todorov, T\. Erez, and Y\. Tassa \(2012\)Mujoco: a physics engine for model\-based control\.In2012 IEEE/RSJ international conference on intelligent robots and systems,pp\. 5026–5033\.Cited by:[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px1.p1.1)\.
- G\. Van Rossum and F\. L\. Drake Jr \(1995\)Python reference manual\.Centrum voor Wiskunde en Informatica Amsterdam\.Cited by:[Acknowledgments](https://arxiv.org/html/2606.05555#Sx1.p2.1)\.
- C\. A\. Voelcker, V\. Liao, A\. Garg, and A\. Farahmand \(2022\)Value gradient weighted model\-based reinforcement learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=4-D6CZkRXxI)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1)\.
- T\. Wang, A\. Roberts, D\. Hesslow, T\. L\. Scao, H\. W\. Chung, I\. Beltagy, J\. Launay, and C\. Raffel \(2022\)What language model architecture and pretraining objective works best for zero\-shot generalization?\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 22964–22984\.External Links:[Link](https://proceedings.mlr.press/v162/wang22u.html)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- M\. Watter, J\. Springenberg, J\. Boedecker, and M\. Riedmiller \(2015\)Embed to control: a locally linear latent dynamics model for control from raw images\.Advances in neural information processing systems28\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px2.p1.1)\.
- T\. Wiedemer, Y\. Li, P\. Vicol, S\. S\. Gu, N\. Matarese, K\. Swersky, B\. Kim, P\. Jaini, and R\. Geirhos \(2026\)Video models are zero\-shot learners and reasoners\.External Links:[Link](https://openreview.net/forum?id=MCWypEBtlF)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
- Y\. Xu, N\. Hansen, Z\. Wang, Y\. Chan, H\. Su, and Z\. Tu \(2023\)On the feasibility of cross\-task transfer with model\-based reinforcement learning\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=KB1sc5pNKFv)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1)\.
- D\. Yarats, R\. Fergus, A\. Lazaric, and L\. Pinto \(2022\)Mastering visual continuous control: improved data\-augmented reinforcement learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=_SJ-_yyes8)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1)\.
- D\. Yarats, I\. Kostrikov, and R\. Fergus \(2021\)Image augmentation is all you need: regularizing deep reinforcement learning from pixels\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=GY6-6sTvGaf)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1)\.
- T\. Yu, S\. Kumar, A\. Gupta, S\. Levine, K\. Hausman, and C\. Finn \(2020a\)Gradient surgery for multi\-task learning\.Advances in neural information processing systems33,pp\. 5824–5836\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.05555#S1.p2.1)\.
- T\. Yu, D\. Quillen, Z\. He, R\. Julian, K\. Hausman, C\. Finn, and S\. Levine \(2020b\)Meta\-world: a benchmark and evaluation for multi\-task and meta reinforcement learning\.InConference on robot learning,pp\. 1094–1100\.Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px3.p1.1),[Appendix D](https://arxiv.org/html/2606.05555#A4.SS0.SSS0.Px3.p1.1)\.
- A\. Zhang, R\. T\. McAllister, R\. Calandra, Y\. Gal, and S\. Levine \(2021a\)Learning invariant representations for reinforcement learning without reconstruction\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=-2FCwDKRREu)Cited by:[Appendix C](https://arxiv.org/html/2606.05555#A3.SS0.SSS0.Px1.p1.1)\.
- B\. Zhang, R\. Rajan, L\. Pineda, N\. Lambert, A\. Biedenkapp, K\. Chua, F\. Hutter, and R\. Calandra \(2021b\)On the importance of hyperparameter optimization for model\-based reinforcement learning\.InProceedings of The 24th International Conference on Artificial Intelligence and Statistics,A\. Banerjee and K\. Fukumizu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.130,pp\. 4015–4023\.External Links:[Link](https://proceedings.mlr.press/v130/zhang21n.html)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p3.1)\.
- Y\. Zhao, W\. Zhao, R\. Boney, J\. Kannala, and J\. Pajarinen \(2023\)Simplified temporal consistency reinforcement learning\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 42227–42246\.External Links:[Link](https://proceedings.mlr.press/v202/zhao23k.html)Cited by:[§3](https://arxiv.org/html/2606.05555#S3.p2.1)\.
- Y\. Zhou, J\. Shen, and Y\. Cheng \(2025\)Weak to strong generalization for large language models with multi\-capabilities\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=N1vYivuSKq)Cited by:[§1](https://arxiv.org/html/2606.05555#S1.p1.1)\.
## Appendix Contents
## Appendix AThe Use of Large Language Models
In this paper, LLMs were used only to polish the writing of certain paragraphs in order to improve clarity and grammar\. The key ideas, theoretical analysis, method design, figures, and experimental results are entirely the result of the human authors’ contributions\.
## Appendix BImpact statement
This paper presents work whose goal is to advance the field of Machine Learning, and Reinforcement Learning in particular\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.
## Appendix CRelated Work
#### Representation Learning and World Models in RL\.
Representation learning is a central challenge in deep RL, where learned features must support value estimation, policy optimization, generalization, and stable learning under non\-stationary data distributions\. A large body of work studies how auxiliary objectives, contrastive learning, reconstruction, bisimulation metrics, and predictive modeling can improve learned representations\[Geladaet al\.,[2019](https://arxiv.org/html/2606.05555#bib.bib126), Laskinet al\.,[2020](https://arxiv.org/html/2606.05555#bib.bib36), Yaratset al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib59),[2022](https://arxiv.org/html/2606.05555#bib.bib60), Zhanget al\.,[2021a](https://arxiv.org/html/2606.05555#bib.bib58)\]\. Recent analyses further show that poor representations can lead to feature collapse, dormant neurons, reduced plasticity, and unstable value learning\[Kumaret al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib104), Fujimotoet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib130), Nikishinet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib20), Sokaret al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib46), Obando\-Ceronet al\.,[2026b](https://arxiv.org/html/2606.05555#bib.bib113), Pasandet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib53)\]\.
Predictive objectives are also central to modern world\-model approaches\. Methods such as PlaNet, Dreamer, DreamerV3, TD\-MPC, and TD\-MPC2 learn latent dynamics models that support imagined rollouts or latent trajectory optimization for control\[Hafneret al\.,[2019](https://arxiv.org/html/2606.05555#bib.bib57), Kaiseret al\.,[2020](https://arxiv.org/html/2606.05555#bib.bib144), Hafneret al\.,[2020a](https://arxiv.org/html/2606.05555#bib.bib135),[2025a](https://arxiv.org/html/2606.05555#bib.bib69), Hansenet al\.,[2022](https://arxiv.org/html/2606.05555#bib.bib56),[2024](https://arxiv.org/html/2606.05555#bib.bib11)\]\. These methods demonstrate strong performance and scalability across continuous\-control domains, while recent large\-scale systems such as Dreamer 4 and Newt extend these principles to multitask settings\[Hafneret al\.,[2025b](https://arxiv.org/html/2606.05555#bib.bib143), Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\]\. However, these approaches require jointly learning world models, value functions, and planning components, introducing substantial computational overhead and optimization complexity\.
#### Model\-Free RL with Predictive Representations\.
Several recent works suggest that predictive representation learning can improve RL even without explicit planning\. Methods such as SPR\[Schwarzeret al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib35)\], BBF\[Schwarzeret al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib15)\], and MR\.Q\[Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\]augment model\-free RL with auxiliary predictive objectives that encourage temporal consistency and latent structure\. Similar ideas have also been explored through self\-predictive representations and latent dynamics supervision\[Niet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib148), Watteret al\.,[2015](https://arxiv.org/html/2606.05555#bib.bib131)\]\. In these approaches, predictive models are used to shape the representation rather than to generate imagined rollouts or perform trajectory optimization\.
Our work builds on this line of research and studies whether predictive representation learning alone can recover many of the scalability and generalization benefits commonly associated with world\-model methods\. Unlike Dreamer, TD\-MPC2, or Newt, our approach does not use latent planning or imagination for policy improvement\. Instead, predictive objectives are used exclusively as auxiliary supervision for representation learning, allowing us to isolate the role of predictive representations from explicit model\-based control\.
#### Multitask Reinforcement Learning\.
Multitask RL aims to train a single agent across multiple environments while enabling transfer and representation sharing across tasks\[Tehet al\.,[2017](https://arxiv.org/html/2606.05555#bib.bib106), Yuet al\.,[2020b](https://arxiv.org/html/2606.05555#bib.bib151), Sodhaniet al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib55)\]\. Scaling RL to multitask settings introduces significant optimization challenges, including non\-stationarity, gradient interference, negative transfer, and under\-utilization of model capacity\[Yuet al\.,[2020a](https://arxiv.org/html/2606.05555#bib.bib107), Taigaet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib108), Baiet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib54), Naumanet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib97)\]\. These issues become increasingly severe as task diversity and model scale grow\.
Recent large\-scale multitask systems such as TD\-MPC2 and Newt suggest that world models can scale effectively across many tasks and embodiments when trained using large shared architectures and task conditioning\[Hansenet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib11),[2026](https://arxiv.org/html/2606.05555#bib.bib49)\]\. In contrast, our work demonstrates that a simpler model\-free agent equipped with predictive representations can also scale effectively across multitask domains while substantially improving computational efficiency\. Our findings therefore highlight representation learning itself as a key ingredient for scalable multitask deep RL\.
## Appendix DTasks Description
For all experiments, we utilize the multitask suites introduced in MMBench\[Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\]\. This benchmark encompasses 10 distinct domains and a total of 200 diverse continuous control tasks, spanning robotic manipulation, locomotion, navigation, arcade games, and classic control\. A brief overview of each domain is provided below\. Full task specifications and configuration details can be found in the original MMBench benchmark\[Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\]\.
#### MuJoCo\.
The MuJoCo\[Todorovet al\.,[2012](https://arxiv.org/html/2606.05555#bib.bib150)\]serves as a standard benchmark for continuous control in reinforcement learning\. It comprises a variety of simulated robotic locomotion tasks, ranging from lower\-dimensional kinematic problems \(e\.g\.,HalfCheetah\) to complex, high\-dimensional control challenges involving severe contact dynamics \(e\.g\.Ant\)\. Following MMBench, we utilize the v4 environment configurations and disable early termination conditions to ensure consistency across all evaluated task domains\.
#### DMControl and DMControl Extended\.
The DeepMind Control \(DMControl\) suite\[Tassaet al\.,[2018](https://arxiv.org/html/2606.05555#bib.bib149)\]provides a standardized set of physics\-based simulation environments, with a fixed episode length of 500 and no termination conditions\. DMControl Extended is an extended task set based on the original DMControl, include 11 custom tasks previously proposed byHansenet al\.\[[2024](https://arxiv.org/html/2606.05555#bib.bib11)\]\.
#### MetaWorld\.
MetaWorld\[Yuet al\.,[2020b](https://arxiv.org/html/2606.05555#bib.bib151)\]is a benchmark designed for multitask and meta\-reinforcement learning, focusing exclusively on simulated robotic manipulation tasks\. This domain consists of 50 diverse manipulation tasks that share a unified observation and action space\. Due to a known simulation issue, theShelf Placeis excluded, yielding a final set of 49 tasks for this domain\.
#### ManiSkill\.
ManiSkill3\[Taoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib155)\]is a comprehensive physics\-based benchmark focused on complex robotic control\. This domain encompasses a diverse array of tasks and robotic morphologies, spanning tabletop manipulation, quadruped locomotion, whole\-body humanoid control, and mobile manipulation\. Additionally, it includes reimplementations of widely adopted control environments from the MuJoCo and DMControl suites\.
#### Pygame\.
Pygame consists of 22 tasks spanning 14 unique arcade\-style environment\. These tasks exhibit significant heterogeneity in their core objectives, episode horizons, state\-action dimensionalities, and underlying reward structures\. MMBench enforce a fixed episode length across all tasks and disable early termination conditions\.
#### Box2D\.
The Box2D suite\[Brockmanet al\.,[2016](https://arxiv.org/html/2606.05555#bib.bib157)\]utilizes a 2D physics engine to simulate rigid body dynamics\. It encompasses a well\-known set of classic control, navigation, and locomotion tasks, such asLunarLander\. While the Box2D tasks were originally designed for low\-dimensional state observations, MMBench modernizes the implementation by introducing support for high\-dimensional visual observations\.
#### RoboDesk\.
RoboDesk\[Kannanet al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib156)\]is a specialized suite of robotic manipulation tasks designed explicitly for multitask reinforcement learning research\. The benchmark features 9 distinct object manipulation tasks situated within a single, unified desk\-themed environment, where all tasks share a common observation and action space\.
#### Atari\.
Based on the Arcade Learning Environment \(ALE\)\[Bellemareet al\.,[2013](https://arxiv.org/html/2606.05555#bib.bib44)\], the Atari domain serves as a rigorous testbed for RL algorithms across a wide spectrum of simulated classic Atari 2600 games\. More recently,Farebrother and Castro \[[2024](https://arxiv.org/html/2606.05555#bib.bib154)\]proposed a non\-linear continuous\-to\-discrete action transformation that extends support to algorithms operating within continuous action spaces\. MMBench utilizes this continuous variant of the Atari domain\.
#### OGBench\.
OGBench\[Parket al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib152)\]is a benchmark tailored for evaluating goal\-conditioned RL and offline RL\. Because it was not originally designed for standard online RL, MMBench adapts these environments by introducing redefined dense reward functions and ensuring all necessary task information is fully integrated into the observation space \(e\.g\.goal position\)\.
## Appendix EMR\.Qalgorithm: Model\-based Representations for Q\-learning
TD3\[Fujimotoet al\.,[2018](https://arxiv.org/html/2606.05555#bib.bib16)\]is a model\-free off\-policy actor–critic algorithm for continuous control that improves stability through twin critics, delayed policy updates, and target policy smoothing\. In its standard form, TD3 operates directly on environment observations without learning an explicit latent representation encoder\.
MR\.Q\[Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\]extends TD3 by introducing a learned encoder together with auxiliary predictive objectives for representation learning\. Observations are first encoded into a latent representationzt=ϕξ\(st,τ\)z\_\{t\}=\\phi\_\{\\xi\}\(s\_\{t\},\\tau\)using a learned encoderϕξ\\phi\_\{\\xi\}\. The actor and twin critics then operate directly in latent space\. In addition to standard temporal\-difference learning, MR\.Q trains auxiliary latent models to predict future latent representations, rewards, and termination signals from\(zt,at\)\(z\_\{t\},a\_\{t\}\)\. The dynamics model predicts the next latent statez^t\+1\\hat\{z\}\_\{t\+1\}, while auxiliary heads predict rewardsr^t\\hat\{r\}\_\{t\}and episode terminationd^t\\hat\{d\}\_\{t\}\. These objectives are optimized using supervised losses and backpropagated through the shared encoder\.
Importantly, the learned latent models are used exclusively for representation shaping\. Unlike model\-based RL methods such as Dreamer\[Hafneret al\.,[2020a](https://arxiv.org/html/2606.05555#bib.bib135),[2025b](https://arxiv.org/html/2606.05555#bib.bib143)\], TD\-MPC2\[Hansenet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib11)\], or Newt\[Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\], MR\.Q\[Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\]does not perform latent rollouts, trajectory imagination, or planning\. The predictive objectives instead provide dense auxiliary supervision that encourages representations to capture temporal structure, while preserving the simplicity and efficiency of model\-free RL\.
## Appendix FNewtalgorithm:
Newt\[Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\]builds upon TD\-MPC2\[Hansenet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib11)\], a model\-based RL framework that combines latent world models with trajectory optimization for control\. The central idea is to learn a compact latent dynamics model that supports both value estimation and planning directly in latent space\.
Given an observationsts\_\{t\}, TD\-MPC2 first encodes it into a latent representation
zt=hθ\(st\),z\_\{t\}=h\_\{\\theta\}\(s\_\{t\}\),wherehθh\_\{\\theta\}is a learned encoder\. A latent dynamics model then predicts future latent states conditioned on actions:
z^t\+1=fθ\(zt,at\)\.\\hat\{z\}\_\{t\+1\}=f\_\{\\theta\}\(z\_\{t\},a\_\{t\}\)\.Additional prediction heads estimate rewards and state values:
r^t=rθ\(zt,at\),V^t=Vθ\(zt\)\.\\hat\{r\}\_\{t\}=r\_\{\\theta\}\(z\_\{t\},a\_\{t\}\),\\qquad\\hat\{V\}\_\{t\}=V\_\{\\theta\}\(z\_\{t\}\)\.
The world model is trained using supervised consistency objectives across imagined latent rollouts\. TD\-MPC2 optimizes a multi\-step latent prediction objective of the form
ℒmodel=∑t,k\(‖zt\+k−z^t\+k‖2\+‖rt\+k−r^t\+k‖2\+‖Vt\+k−V^t\+k‖2\),\\mathcal\{L\}\_\{\\text\{model\}\}=\\sum\_\{t,k\}\\Big\(\\\|z\_\{t\+k\}\-\\hat\{z\}\_\{t\+k\}\\\|^\{2\}\+\\\|r\_\{t\+k\}\-\\hat\{r\}\_\{t\+k\}\\\|^\{2\}\+\\\|V\_\{t\+k\}\-\\hat\{V\}\_\{t\+k\}\\\|^\{2\}\\Big\),where latent states are recursively imagined through the learned dynamics model\. Unlike reconstruction\-based world models, TD\-MPC2 operates entirely in latent space without pixel reconstruction, improving scalability and computational efficiency\.
A key difference from standard actor–critic methods is that TD\-MPC2 performs explicit planning using the learned latent model\. At decision time, candidate action sequencesat:t\+Ha\_\{t:t\+H\}are optimized using model predictive control \(MPC\) by maximizing predicted future returns over imagined latent trajectories:
maxat:t\+H∑k=0Hγkr^t\+k\+γH\+1V^t\+H\+1\.\\max\_\{a\_\{t:t\+H\}\}\\sum\_\{k=0\}^\{H\}\\gamma^\{k\}\\hat\{r\}\_\{t\+k\}\+\\gamma^\{H\+1\}\\hat\{V\}\_\{t\+H\+1\}\.This planning procedure repeatedly rolls out trajectories inside the learned world model and selects actions according to the highest predicted return\.
Newtextends these principles to massively multitask settings by training a single language\-conditioned world model jointly across hundreds of tasks and embodiments\. The resulting system jointly optimizes latent dynamics learning, value estimation, reward prediction, policy learning, and trajectory optimization within a shared multitask architecture\.
In contrast, MR\.Q\[Fujimotoet al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib19)\]uses predictive latent modeling exclusively for representation learning rather than planning\. Similar to TD\-MPC2, observations are encoded into latent representations and auxiliary models predict future latent states, rewards, and termination signals\. However, MR\.Q does not perform latent rollouts for control or trajectory optimization\. The predictive objectives are instead used solely as auxiliary supervision to shape the latent representation:
ℒMR\.Q=ℒTD\+λℒpredictive,\\mathcal\{L\}\_\{\\text\{MR\.Q\}\}=\\mathcal\{L\}\_\{\\text\{TD\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{predictive\}\},whereℒpredictive\\mathcal\{L\}\_\{\\text\{predictive\}\}includes latent dynamics, reward, and termination prediction losses\. Policy improvement remains entirely model\-free and is performed through standard actor–critic optimization rather than planning\.
## Appendix GTraining Protocol
For all experiments, we follow the multitask language\-conditioned training protocol introduced in MMBench\[Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49)\]\. A single shared agent is trained jointly across tasks spanning multiple domains and embodiments using a unified multitask architecture\. Task identity is provided through language instruction embeddings\[Radfordet al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib137)\], allowing the policy and value functions to condition behavior on the current task while sharing representations across environments\. Following the official benchmark implementation, language embeddings are concatenated with state or latent features and used as additional conditioning signals throughout training\.
Training is performed in an off\-policy setting using replay buffers that store transitions collected across all tasks\. During training, minibatches are sampled uniformly from the shared replay buffer and used to jointly optimize the actor, critics, and auxiliary predictive objectives\. Unless otherwise specified, all results are averaged over five random seeds\.
For visual\-observation experiments, we follow prior work\[Hansenet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib49), Oquabet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib136)\]and use a frozen DINOv2 encoder\[Oquabet al\.,[2024](https://arxiv.org/html/2606.05555#bib.bib136)\]to extract image representations from raw pixels\. These pretrained visual features provide strong semantic representations and stabilize training in the high\-dimensional input regime, allowing the downstream RL algorithm to focus on multitask adaptation and control rather than learning visual representations from scratch\.
Our primary evaluation focuses on a challenging low\-data regime of 10M environment interactions, substantially smaller than the budgets commonly used in prior large\-scale multitask world\-model systems\. Additional experiments evaluate longer training horizons, model scaling, transfer, and update\-to\-data \(UTD\) scaling\. Evaluation follows the normalized\-score protocol introduced in MMBench, aggregating performance across tasks within each benchmark suite\.
## Appendix HScaling with UTD
The update\-to\-data ratio \(UTD\) controls the number of gradient updates performed per environment interaction and serves as an important scaling axis for evaluating data reuse efficiency\. Increasing UTD effectively increases the amount of optimization performed on a fixed dataset, testing whether a method can efficiently extract information from available experience without destabilizing learning\.
[Fig\. 11](https://arxiv.org/html/2606.05555#A8.F11)compares scaling behavior across different UTD values\. MR\.Q consistently benefits from larger UTD regimes, achieving improved performance as additional gradient updates are applied per interaction\. In contrast,Newtexhibits weaker gains and greater sensitivity to increased update frequency\. These results suggest that predictive representation learning enables more stable and effective reuse of replay data, allowing model\-free methods to better exploit additional computation under fixed interaction budgets\.
Figure 11:Scaling with UTD\.Normalized score across five multitask suites\.MR\.Qbenefits more from higher UTD thanNewt, better data reuse\.
## Appendix IPCA Analyses
To further analyze the geometry of the learned multitask representations, we visualize latent features using Principal Component Analysis \(PCA\)\. We project latent representations extracted from trained checkpoints onto their top two principal components and color points according to task identity\.
Across both DMControl\-Ext and MuJoCo suites, MR\.Q learns substantially more structured and separated latent representations than the encoder\-free baseline\. Predictive representation learning produces higher\-rank embeddings with improved task separation and greater effective dimensionality, whereas removing representation learning leads to collapsed feature spaces with substantially reduced variance across dimensions\. These results complement the quantitative analyses presented in the main paper\. Together with the SRank measurements and dormant\-neuron analyses, the PCA visualizations suggest that predictive auxiliary objectives improve representation diversity and preserve expressive capacity in large multitask settings\.
Figure 12:PCA visualization on DMControl\-Ext\.Two\-dimensional PCA projections of multitask latent representations learned by MR\.Q and the encoder\-free baseline \(TD3\)\. Predictive representation learning produces substantially more structured and separated task representations\.Figure 13:PCA visualization on MuJoCo\.Latent representations learned by MR\.Q exhibit higher diversity and improved task separation compared to the encoder\-free baseline \(TD3\), indicating more expressive multitask representations\.
## Appendix JCompute Resources
All experiments were conducted on NVIDIA A100 GPUs using distributed Slurm\-based compute clusters\. Most multitask experiments were trained on a single GPU with approximately 24–48 GB of memory\. Depending on the benchmark and model size, training required approximately 12–60 hours per run\. Results are averaged over five seeds\.
Beyond reducing training time, the computational efficiency of model\-free agents equipped with predictive model\-based representations has practical implications for how multitask RL systems are developed and studied\. By avoiding explicit planning and latent rollout generation, our approach lowers the cost of experimentation and enables faster iteration cycles during development and finetuning\. This can make large\-scale multitask RL more accessible under limited compute budgets\[Obando\-Ceron and Castro,[2021](https://arxiv.org/html/2606.05555#bib.bib1)\], allowing researchers to explore architectures\[Ceronet al\.,[2024c](https://arxiv.org/html/2606.05555#bib.bib62),[b](https://arxiv.org/html/2606.05555#bib.bib63), Sokaret al\.,[2025](https://arxiv.org/html/2606.05555#bib.bib83), Liuet al\.,[2025a](https://arxiv.org/html/2606.05555#bib.bib90), Kooiet al\.,[2026](https://arxiv.org/html/2606.05555#bib.bib3)\], hyperparameters\[Andrychowiczet al\.,[2021](https://arxiv.org/html/2606.05555#bib.bib2), Obando Ceronet al\.,[2023](https://arxiv.org/html/2606.05555#bib.bib87), Ceronet al\.,[2024a](https://arxiv.org/html/2606.05555#bib.bib84)\], and adaptation strategies without repeatedly incurring the cost of expensive model\-based training pipelines\.
These efficiency gains may create opportunities to scale multitask RL beyond the model sizes and experimental regimes commonly explored today\. Since additional compute is not spent on planning procedures, resources can instead be allocated toward larger networks, broader task distributions, or more extensive scaling studies\.
## Appendix KPer\-tasks learning curves
In addition to the aggregate results presented in the main paper, we provide per\-task learning curves for all benchmark suites\. These plots offer a more fine\-grained view of training dynamics across individual environments\.
Figure 14:Atari per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross Atari tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 15:Box2D per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross Box2D tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 16:DMControl per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross DMControl tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 17:DMControl\-Ext per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross DMControl\-Ext tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 18:ManiSkill per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross ManiSkill tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 19:MetaWorld per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross MetaWorld tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 20:MuJoCo per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross MuJoCo tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 21:OGBench per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross OGBench tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 22:PyGame per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross PyGame tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.Figure 23:RoboDesk per\-game learning performance\.MR\.Q, a model\-free agent augmented with predictive model\-based representations, consistently matches or surpasses the world\-model\-based approachNewtacross RoboDesk tasks\. Shaded regions denote 95% confidence intervals \(CIs\)\.
## Appendix LFinetuning: Per\-tasks learning curves
To evaluate transfer to unseen tasks, we finetune pretrained multitask checkpoints on held\-out environments using online RL\. All experiments are initialized from the same multitask checkpoint and finetuned under identical interaction budgets\. These experiments evaluate whether the representations learned during multitask pretraining transfer effectively to novel tasks and support rapid adaptation under limited additional experience\. See[Sec\. 5](https://arxiv.org/html/2606.05555#S5)for more details\.
Figure 24:Per\-task finetuning performance on held\-out environments\.Learning curves during online finetuning from pretrained multitask checkpoints\.MR\.Qconsistently achieves stronger zero\-shot initialization and faster adaptation across the majority of held\-out tasks, indicating improved transfer and representation reuse\. Shaded regions denote 95% confidence intervals \(CIs\)\.Similar Articles
Debiased Model-based Representations for Sample-efficient Continuous Control
This paper introduces the DR.Q algorithm, which improves model-based representations for Q-learning by maximizing mutual information and using faded prioritized experience replay to reduce bias and overfitting in continuous control tasks.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
Proposes R2R2, a regularization method for self-predictive learning in reinforcement learning to mitigate overfitting under high update-to-data ratios, achieving significant improvements on continuous control tasks.
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology.
Learning policy representations in multiagent systems
OpenAI researchers propose a general framework for learning representations of agent policies in multiagent systems using minimal interaction data, casting the problem as representation learning with applications to competitive control and cooperative communication environments.
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
This paper studies when end-to-end reinforcement learning training improves multi-agent LLM workflows, comparing shared-policy and isolated-policy training across different workflows, tasks, and model scales, revealing conditional tradeoffs.