Towards Scalable Multi-Task Reinforcement Learning with Large Decision Models
Summary
This paper introduces LDM-v0, a large decision model trained offline on trajectories from thousands of diverse reinforcement learning environments, demonstrating that a single transformer policy can match the performance of task-specific policies across robotics, autonomous driving, inventory management, cybersecurity, trading, and video games.
View Cached Full Text
Cached at: 06/25/26, 05:08 AM
# Towards Scalable Multi-Task Reinforcement Learning with Large Decision Models
Source: [https://arxiv.org/html/2606.24962](https://arxiv.org/html/2606.24962)
###### Abstract
Recent progress in large\-scale sequence modeling has shown that a single model can learn useful representations across highly diverse data distributions\. Inspired by these advances, we investigate whether a unified transformer policy can be trained across large collections of heterogeneous reinforcement learning environments\.
We introduce LDM\-v0, a Large Decision Model trained offline on trajectories collected from thousands of environments spanning multiple domains and modalities\. LDM\-v0 is a multi\-task, multi\-modal transformer policy conditioned on histories of observations, actions, rewards, and termination signals, and trained through supervised next\-action prediction over offline trajectories\. We describe the environment infrastructure, automated data generation pipeline, model architecture, and training methodology used to build LDM\-v0, and evaluate its performance across diverse environments\. We show that a single pretrained model matches the performance of independently trained task\-specific reference policies on approximately 1,000 environments including robotics, autonomous driving, inventory management, cybersecurity, trading, and video games\. These results demonstrate the feasibility of large\-scale offline pretraining across heterogeneous reinforcement learning environments using a single transformer policy\.
## 1Introduction
Reinforcement Learning \(RL\) provides a general framework for sequential decision making and has achieved impressive results in domains such as games, robotics, resource optimization, and control\. Despite this progress, applying RL in real\-world settings remains difficult\. Modern RL systems often require extensive environment interaction, careful reward engineering, domain\-specific architectures, and substantial hyperparameter tuning\. As a consequence, many successful applications rely heavily on expert knowledge and task\-specific design choices\.
Offline RL and offline\-to\-online RL partially address these limitations by leveraging previously collected trajectories to reduce costly online interaction\. However, selecting and adapting a suitable RL algorithm for a new environment remains challenging\(Nie et al\.,[2022](https://arxiv.org/html/2606.24962#bib.bib14)\)\. These difficulties motivate the development of more general and automated approaches to reinforcement learning\.
In parallel, large\-scale sequence models trained on diverse datasets have transformed natural language processing and, more recently, computer vision and multimodal learning\. In RL, sequence\-modeling approaches such as Decision Transformers\(Chen et al\.,[2021](https://arxiv.org/html/2606.24962#bib.bib4)\)have shown that policies can be represented as autoregressive models over trajectories\. These results raise an important question: can multi\-domain offline RL trajectories be consolidated into a single scalable transformer policy while maintaining strong task performance across many domains? One key challenge is that multi\-domain RL ecosystems are fragmented and hard to unify\.
In this work, we explore this direction by building a unified multi\-domain RL infrastructure, using it to generate large\-scale trajectories and training LDM\-v0, a Large Decision Model instantiated as a single transformer policy\. LDM\-v0 is a multi\-task and multi\-modal model conditioned on past observations, actions, rewards, and current observations in order to predict future actions\. Our primary objective is not to study out\-of\-distribution generalization, but rather to investigate whether a single transformer policy can jointly model diverse RL behaviors at scale\.
We present the environment infrastructure, large\-scale RL dataset generation pipeline, model architecture and training methodology used to train LDM\-v0, and evaluate its performance across a diverse collection of environments\. More broadly, we view LDM\-v0 as a step toward scalable pretrained reinforcement learning systems\.
## 2Related Work
Transformer architectures\(Vaswani et al\.,[2017](https://arxiv.org/html/2606.24962#bib.bib20)\)have recently become an important framework for reinforcement learning, motivated by their success in large\-scale sequence modeling\(Brown et al\.,[2020](https://arxiv.org/html/2606.24962#bib.bib3)\)\. Sequence\-modeling approaches such as Decision Transformer\(Chen et al\.,[2021](https://arxiv.org/html/2606.24962#bib.bib4)\)and Trajectory Transformer\(Janner et al\.,[2021](https://arxiv.org/html/2606.24962#bib.bib10)\)demonstrated that RL policies can be formulated as autoregressive models over trajectories, while subsequent work has further explored the role of transformers in RL\(Agarwal et al\.,[2023](https://arxiv.org/html/2606.24962#bib.bib1)\)\.
Our work is also related to meta\-reinforcement learning\. Traditional meta\-RL methods\(Finn et al\.,[2017](https://arxiv.org/html/2606.24962#bib.bib6); Duan et al\.,[2016](https://arxiv.org/html/2606.24962#bib.bib5); Beck et al\.,[2023](https://arxiv.org/html/2606.24962#bib.bib2)\)aim to learn agents that can adapt rapidly across tasks, often through explicit task distributions, recurrent policies, or gradient\-based adaptation\. More recently, in\-context RL approaches have investigated whether transformers can learn adaptation strategies directly from trajectory context\(Laskin et al\.,[2023](https://arxiv.org/html/2606.24962#bib.bib12); Lee et al\.,[2023](https://arxiv.org/html/2606.24962#bib.bib13); Team et al\.,[2023](https://arxiv.org/html/2606.24962#bib.bib19); Grigsby et al\.,[2023](https://arxiv.org/html/2606.24962#bib.bib8);[2024](https://arxiv.org/html/2606.24962#bib.bib9);[Kumar et al\.,](https://arxiv.org/html/2606.24962#bib.bib11); Sridhar et al\.,[2024](https://arxiv.org/html/2606.24962#bib.bib18); Petrov et al\.,[2024](https://arxiv.org/html/2606.24962#bib.bib15); Raparthy et al\.,[2023](https://arxiv.org/html/2606.24962#bib.bib16)\)\. In particular,Lee et al\. \([2023](https://arxiv.org/html/2606.24962#bib.bib13)\)cast meta\-RL as supervised pretraining, where a transformer predicts actions conditioned on a query state and an in\-context dataset of prior interactions\. This perspective has also been connected theoretically to Bayesian posterior sampling\(Wang et al\.,[2024](https://arxiv.org/html/2606.24962#bib.bib21)\)\.
LDM\-v0 is most closely inspired by Gato\(Reed et al\.,[2022](https://arxiv.org/html/2606.24962#bib.bib17)\)and supervised in\-context RL pretraining\(Lee et al\.,[2023](https://arxiv.org/html/2606.24962#bib.bib13)\)\. Unlike prior generalist transformer agents\(Reed et al\.,[2022](https://arxiv.org/html/2606.24962#bib.bib17); Gallouédec et al\.,[2024](https://arxiv.org/html/2606.24962#bib.bib7)\)that combine RL with language, robotics, or internet\-scale supervised data, LDM\-v0 focuses specifically on scalable reinforcement learning pretraining across highly heterogeneous RL ecosystems\. Our work emphasizes multi\-domain environment integration \(see Table[1](https://arxiv.org/html/2606.24962#S2.T1)\), large\-scale trajectory generation across thousands of Gym/Gymnasium\-compatible environments, and compact transition\-level sequence modeling\.
Table 1:Comparison of large\-scale transformer policies trained across multi\-domain reinforcement learning environments\. The table reports the number of reinforcement learning libraries used, the approximate number of tasks or environments covered, the input/output modalities considered, and the number of tokens used to encode each trajectory timestep\. For Gato, this number varies with the dimensionality of the observation and action spaces because different dimensions are tokenized separately; the paper reports more than 100 tokens for a single image observation\.
## 3Method
The goal of LDM\-v0 is to train a single policy model across a large and heterogeneous collection of reinforcement learning environments\. Our approach combines automated reference\-policy supervision with a unified sequence\-modeling architecture: we first generate supervised policy data by training task\-specific RL agents and retaining high\-performing policies as references, then train a transformer policy to predict reference actions from interaction histories and current observations\.
### 3\.1Automated Reference\-Policy Supervision
Generating high\-quality trajectories across diverse environments is challenging because no single RL algorithm or hyperparameter configuration performs well across all domains\. We therefore treat dataset construction as an AutoRL problem: for each environment family, the pipeline empirically ranks candidate algorithms and configurations, trains strong task\-specific reference policies, records their trajectories, and annotates collected observations with final\-policy actions\. LDM\-v0 is thus trained to imitate strong task\-specific policies rather than exploratory actions\. Low\-quality runs are removed using performance\-based filtering; implementation details are given in Section[4](https://arxiv.org/html/2606.24962#S4)\.
### 3\.2LDM\-v0 Architecture
The primary design objective of LDM\-v0 is to support unified training across varied reinforcement learning environments, including domains with different observation modalities, action spaces, and temporal dynamics\. An overview of the architecture is shown in Fig\.[1](https://arxiv.org/html/2606.24962#S3.F1)\.
LDM\-v0 receives an interaction history and the current observation as input\. Observations, previous actions, rewards, and termination signals are encoded using modality\-specific encoders, merged into transition\-level embeddings, processed by a decoder\-only transformer backbone, and decoded into an action prediction\.
Interaction historywith an env:Past ObservationsPast ActionsPast RewardsPast Term/TruncCurrent env state:ObservationLDM\-v0EncodersObs enc\.:Continuous\(Multi\)DiscreteImageAction enc\.:Continuous,\(Multi\)DiscreteReward enc\.Term/trunc enc\.TransformerLlama backbonetrainedfrom scratchActiondecoderLinearlayerActionReferenceactionCrossEntropyLossFigure 1:Architecture of LDM\-v0\.LDM\-v0 receives an interaction history and the current observation, encodes each modality, merges them into transition\-level embeddings \(containing an observation at timestep t and action/reward/done at timestep t\-1\), processes them with a Llama backbone, and decodes the predicted action\. During training, the prediction is supervised using strong task\-specific reference policies\.
### 3\.3Tokenization and Embedding
LDM\-v0 converts multi\-modal environment interactions into a unified tokenized representation compatible with transformer sequence modeling\.
- •Continuous inputs areμ\\mu\-law encoded, discretized into 1024 bins, and mapped through learned embedding tables\.
- •Discrete inputs are embedded through lookup tables\.
- •Image observations are resized to a common resolution\(64,64,3\)\(64,64,3\)and encoded using a convolutional encoder\.
The different dimensions of each observation are stacked into an observation embedding, using a maximum observation dimension of 128 \(padding is applied for smaller observations\)\. The same is done for multi\-dimensional actions, with a maximum action dimension of 28\. The observation/action/reward/termination embeddings are then aligned \(observation aligns with previous action/reward/termination signal; padded in case of first observation\), stacked and processed to a transition\-level meta\-token using a linear layer\.
This representation maps multi\-modal environment interactions into a shared latent sequence representation compatible with a single transformer backbone\. This compact transition\-level packing also reduces sequence length compared to per\-dimension tokenization approaches\(Reed et al\.,[2022](https://arxiv.org/html/2606.24962#bib.bib17)\)and enables longer interaction histories within a fixed transformer context budget\.
### 3\.4Backbone
LDM\-v0 uses a decoder\-only transformer backbone based on the Llama architecture\. The backbone processes transition\-level embeddings autoregressively and produces contextualized representations used for action prediction\.
The model is trained entirely from scratch on the dataset described in Section[4](https://arxiv.org/html/2606.24962#S4), as we did not observe measurable improvements from initializing with language\-model checkpoints in preliminary experiments\.
### 3\.5Action Decoder and Training Objective
The transformer output is passed to a linear action decoder that predicts logits over discretized action bins for each action dimension\. For continuous actions, the decoder predicts the corresponding discretized bin; for discrete actions, it predicts the corresponding action category\.
LDM\-v0 is trained to predict reference actions from trajectories generated by RL agents trained independently on each environment \(described in Section[4\.2](https://arxiv.org/html/2606.24962#S4.SS2)\)\. Training is performed using a standard cross\-entropy loss over discretized actions\.
Formally, the model learns an autoregressive policy of the form:
at=LDM\(\(oi\)i=1T,\(ai,ri,di\)i=1T−1\),a\_\{t\}=\\mathrm\{LDM\}\\Big\(\(o\_\{i\}\)\_\{i=1\}^\{T\},\(a\_\{i\},r\_\{i\},d\_\{i\}\)\_\{i=1\}^\{T\-1\}\\Big\),
whereoio\_\{i\},aia\_\{i\},rir\_\{i\}, anddid\_\{i\}respectively denote observations, actions, rewards, and termination indicators\. The model autoregressively predicts actions conditioned on the trajectory history retained within the context window\.
## 4Experimental Setup
### 4\.1Environments
Public RL environments provide a natural testbed for large\-scale heterogeneous policy training: they cover a wide range of control, optimization, and sequential decision\-making problems, and vary substantially in observation modalities, action spaces, temporal horizons, and reward structures\. We collect training environments from publicly available GitHub repositories implementing OpenAI Gym\- or Gymnasium\-compatible interfaces\. Although these environments share a common high\-level API, they often rely on different Python versions, package ecosystems, and simulator dependencies, which makes unified large\-scale training difficult in practice\.
To address this challenge, we developed an internal environment orchestration framework that encapsulates each environment library within isolated Docker containers and exposes a unified interaction API compatible with modern Gymnasium interfaces\. This infrastructure enables scalable and reproducible interaction with multi\-domain RL environments while preserving compatibility with legacy dependencies\.
The process of fetching, validating, containerizing, and integrating environment repositories into the framework was partly automated\. Using this pipeline, we collected 146 environment libraries corresponding to approximately 15,000 individual environments\.
The list of integrated libraries and their corresponding number of environments is summarized in Appendix[A](https://arxiv.org/html/2606.24962#A1)\. We note that the number of environments alone is not necessarily indicative of behavioral diversity, as some libraries contain many closely related tasks while others expose fewer but highly configurable environments\.
### 4\.2Reference\-Policy Data Generation
We instantiate the automated reference\-policy supervision pipeline described in Section[3\.1](https://arxiv.org/html/2606.24962#S3.SS1)as follows\.
We define a fixed pool of candidate algorithm/configuration pairs drawn from Stable\-Baselines3 and SB3\-Contrib, including A2C, ARS, DDPG, DQN, PPO, QR\-DQN, SAC, TD3, TQC, and TRPO\. Candidates use either default library hyperparameters or predefined alternatives; the complete set is reported in Appendix[C](https://arxiv.org/html/2606.24962#A3)\. A candidate is evaluated only when compatible with the environment observation and action spaces, and no hyperparameters are tuned manually for individual environments\.
For each environment library, compatible candidates are first trained on a subset of five environments for 3 million transitions\. Each run is evaluated by the mean episode return over the last 10% of training episodes\. Candidate configurations are then ranked using pairwise comparisons: for each environment, a configuration receives one point over another only when its final performance is significantly higher according to a one\-tailed Welch’s t\-test\. Scores are summed across the five\-environment subset to obtain a library\-level ranking\.
We then select the top\-NNalgorithm/configuration pairs for each environment \(N=3N=3in our experiments\), and train each for 3 million transitions\. Due to computational constraints, we cap the number of environments to 250 per library\. During training, all interaction trajectories are recorded, including observations, actions, rewards, and termination signals\.
After training, the final trained policy is replayed over all collected observations to generate reference action annotations\. Consequently, the supervision targets used to train LDM\-v0 correspond to actions produced by the final policy rather than the potentially exploratory actions originally taken during data collection\.
We apply several data curation steps to improve dataset quality:
- •We remove trajectories from training runs that do not exhibit statistically significant performance improvement during learning111We use a one\-tailed Welch’s t\-test on episode returns from the first and last 10% of training episodes\.
- •We remove trajectories whose final policy performance falls below 95% of the best\-performing policy trained on the same environment\.
The resulting dataset contains approximately 4,000 high\-performing agents spanning 3,000 distinct environments and a total of 9\.3 billion transitions annotated with reference policy actions\. A summary of the dataset is provided in Appendix[B](https://arxiv.org/html/2606.24962#A2)\.
### 4\.3LDM\-v0 Training Details
Unless otherwise specified, the main LDM\-v0 model has 12 hidden layers, 12 attention heads, a hidden size of 768 and a context length of 2048 transitions\. The model has 308M parameters and is trained from scratch for six days using two nodes equipped with eight NVIDIA H200 GPUs each\.
We use the AdamW optimizer withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999,ϵ=10−8\\epsilon=10^\{\-8\}and a constant learning rate of2\.5×10−42\.5\\times 10^\{\-4\}\. Training uses Distributed Data Parallel with a total batch size of 1024\.
The wall\-clock time required to train one reference policy for 3 million transitions varies substantially across environments, since the dominant cost is often environment simulation rather than policy optimization\. Reference\-policy data generation took 12 weeks in total on four servers, consisting of two nodes with 8 NVIDIA H200 GPUs each and two nodes with 4 NVIDIA RTX 4090 GPUs each, for a total of 608 CPU cores\. The generated dataset occupies 29 terabytes of storage\.
### 4\.4Evaluation Protocol
For each environment, LDM\-v0 performance is averaged over two fresh rollouts of 10 episodes each\. The context is initialized empty before each rollout, and the interaction history is kept across episodes during deployment\. We use deterministic decoding at evaluation time\. The action\-space mask is applied to ensure that predicted actions are feasible, for example within the valid bounds for continuous actions or within the allowed range for discrete and multi\-discrete actions\. We then select the action corresponding to the maximum logit without sampling\. Each environment score is normalized relative to the corresponding task\-specific reference\-policy training trajectory\. The average episode return over the first 10% of the trajectory is treated as poor performance, corresponding to 0%, while the average over the last 10% is treated as reference performance, corresponding to 100%\.
## 5Results
We evaluate whether a single pretrained LDM\-v0 model can achieve strong performance across heterogeneous training environments, and study how performance changes with model size\.
### 5\.1Training Environments
Figure[2](https://arxiv.org/html/2606.24962#S5.F2)shows a performance\-threshold curve for LDM\-v0 against task\-specific reference agents on training environments\. All reported results are obtained using a single pretrained model with a single shared set of parameters across all environments\.
Performance varies substantially across libraries, but strong results are observed across many unrelated domains despite the use of a single shared parameterization\. LDM\-v0 achieves more than 80% of the reference policy performance on over 1,600 environments spanning a broad range of domains and matches reference\-policy performance on approximately 1,000 environments\.
In particular, LDM\-v0 demonstrates strong performance across a wide variety of practical sequential decision\-making problems\. Performance comparable to task\-specific high\-performing agents is observed in robotic manipulation and control \(gymnasium\_robotics,panda\-gym\), drone and UAV control \(gym\-copter,jsbgym\), autonomous driving simulation \(highway\-env\), electric motor control \(gym\_electricmotors\), smart\-grid and energy\-management tasks \(building\-energy\-storage\-simulation\), financial trading \(gym\_trading\_env\), inventory optimization \(gym\_inventory\), cybersecurity\-oriented environments \(cymnasium\), and plant or crop optimization tasks involving branching and growth control \(growspace\)\.
Beyond real\-world\-inspired domains, LDM\-v0 also achieves strong results on complex game\-like environments including Atari\-style tasks \(gym\_masked\_atari,gym\_super\_mario\_bros\), and procedurally generated environments such asprocgen\.
These results indicate that a single shared parameterization can achieve competitive performance across diverse settings, and suggest that large sequence models may benefit from statistical regularities shared across subsets of environments despite substantial heterogeneity in observation modalities, action spaces, reward scales, and temporal horizons\.
\\begin\{overpic\}\[width=377\.60951pt\]\{images/ldm\_300M\_gatolike\_phase1and2\.png\} \\put\(32\.0,52\.0\)\{\\small\{$\>1\{,\}600$ envs at 80\\%\}\} \\put\(44\.2,50\.0\)\{\\vector\(0,\-1\)\{8\.0\}\} \\end\{overpic\}Figure 2:Performance\-threshold curve of LDM\-v0 across training environments\.For each threshold on the x\-axis, the y\-axis reports the number of environments where LDM\-v0 achieves at least that percentage of the corresponding task\-specific reference\-policy return\.
### 5\.2Model Scaling Experiments
Figure[3](https://arxiv.org/html/2606.24962#S5.F3)shows the in\-distribution performance of LDM\-v0 as a function of model size\. We evaluate four model scales: 32M, 70M, 308M, and 736M parameters\.
All models use the same transformer context length of 2048 transitions\. The batch size is fixed to 1024 for all models except the 736M variant, which uses a batch size of 384 due to GPU memory limitations\.
For each model, we report the percentage of training environments where the pretrained model achieves at least 80% of reference performance as training progresses\.
We observe substantial performance improvements when increasing model size from 32M to 308M parameters, with larger models generally achieving stronger and more stable performance across training\. Performance gains appear to plateau between the 308M and 736M models, although additional experiments would be required to characterize scaling behavior more precisely\.
These results nevertheless suggest that heterogeneous offline RL sequence modeling can benefit from increased model capacity, similarly to trends observed in other large\-scale sequence modeling domains\.
Figure 3:Model size scaling law: In\-distribution performance as a function of transitions processed\.
## 6Discussion and Future Work
We presented LDM\-v0, a large\-scale transformer policy trained through a unified offline reinforcement learning pipeline built on automated environment orchestration and large\-scale trajectory generation across highly diverse RL environments\. Our results demonstrate the feasibility of scalable heterogeneous RL pretraining using a single shared model across multiple domains and modalities\.
A primary direction for future work is scaling the size and diversity of the training dataset: due to computational limitations we trained only on roughly 19% of our current environment pool, which can be further expanded\.
An important limitation of the current work is that evaluations are primarily conducted on environments contained within the training distribution\. Understanding the extent to which LDM\-v0 acquires transferable decision\-making strategies rather than learning predominantly environment\-specific behaviors, remains an important open question\. Future work will investigate both in\-context adaptation and offline/online finetuning approaches for improving generalization to unseen environments\.
More broadly, we hope this work motivates further research on scalable and automated reinforcement learning systems, including future large\-scale pretrained reinforcement learning models\.
## Appendix ALibraries
Table 2:Github libraries currently available in our Environments API
## Appendix BDatasets
Table 3:Datasets used to train LDM\-v0
## Appendix CCandidate algorithm/configuration pairs
Table 4:Candidate reinforcement learning algorithm/configuration pairs used for automated reference\-policy generation\.Noneindicates that no non\-default hyperparameters were used, i\.e\., the corresponding Stable\-Baselines3 or SB3\-Contrib library defaults are used\. Candidate configurations are used only when compatible with the environment observation and action spaces\.
## Acknowledgments
We thank the NeoInstinct developers for their valuable technical contributions to the development of our environment orchestration framework and to the collection and integration of experimental environments\. Their work was essential to the successful development and execution of this study\. We also thank our NeoInstinct colleagues for helpful discussions that contributed to the development of this work\.
This work was funded by NeoInstinct SA\.
## References
- Agarwal et al\. \(2023\)Pranav Agarwal, Aamer Abdul Rahman, Pierre\-Luc St\-Charles, Simon J\. D\. Prince, and Samira Ebrahimi Kahou\.Transformers in reinforcement learning: a survey\.*arXiv preprint arXiv:2307\.05979*, 2023\.
- Beck et al\. \(2023\)Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon Whiteson\.A survey of meta\-reinforcement learning\.*arXiv preprint arXiv:2301\.08028*, 2023\.
- Brown et al\. \(2020\)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D\. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al\.Language models are few\-shot learners\.*Advances in Neural Information Processing Systems*, 33:1877–1901, 2020\.
- Chen et al\. \(2021\)Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch\.Decision transformer: Reinforcement learning via sequence modeling\.*Advances in Neural Information Processing Systems*, 34:15084–15097, 2021\.
- Duan et al\. \(2016\)Yan Duan, John Schulman, Xi Chen, Peter L\. Bartlett, Ilya Sutskever, and Pieter Abbeel\.RL2: Fast reinforcement learning via slow reinforcement learning\.*arXiv preprint arXiv:1611\.02779*, 2016\.
- Finn et al\. \(2017\)Chelsea Finn, Pieter Abbeel, and Sergey Levine\.Model\-agnostic meta\-learning for fast adaptation of deep networks\.In*International Conference on Machine Learning*, pages 1126–1135\. PMLR, 2017\.
- Gallouédec et al\. \(2024\)Quentin Gallouédec, Edward Beeching, Clément Romac, and Emmanuel Dellandréa\.Jack of all trades, master of some, a multi\-purpose transformer agent\.*arXiv preprint arXiv:2402\.09844*, 2024\.
- Grigsby et al\. \(2023\)Jake Grigsby, Linxi Fan, and Yuke Zhu\.Amago: Scalable in\-context reinforcement learning for adaptive agents\.*arXiv preprint arXiv:2310\.09971*, 2023\.
- Grigsby et al\. \(2024\)Jake Grigsby, Justin Sasek, Samyak Parajuli, Daniel Adebi, Amy Zhang, and Yuke Zhu\.Amago\-2: Breaking the multi\-task barrier in meta\-reinforcement learning with transformers\.*Advances in Neural Information Processing Systems*, 37:87473–87508, 2024\.
- Janner et al\. \(2021\)Michael Janner, Qiyang Li, and Sergey Levine\.Offline reinforcement learning as one big sequence modeling problem\.*Advances in Neural Information Processing Systems*, 34:1273–1286, 2021\.
- \(11\)Akarsh Kumar, Chris Lu, Louis Kirsch, and Phillip Isola\.Learning in\-context decision making with synthetic MDPs\.In*Automated Reinforcement Learning: Exploring Meta\-Learning, AutoML, and LLMs*\.
- Laskin et al\. \(2023\)Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, D\. J\. Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, et al\.In\-context reinforcement learning with algorithm distillation\.*International Conference on Learning Representations*, 2023\.
- Lee et al\. \(2023\)Jonathan N\. Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, Ofir Nachum, and Emma Brunskill\.Supervised pretraining can learn in\-context reinforcement learning\.*arXiv preprint arXiv:2306\.14892*, 2023\.
- Nie et al\. \(2022\)Allen Nie, Yannis Flet\-Berliac, Deon Jordan, William Steenbergen, and Emma Brunskill\.Data\-efficient pipeline for offline reinforcement learning with limited data\.*Advances in Neural Information Processing Systems*, 35:14810–14823, 2022\.
- Petrov et al\. \(2024\)Vladimir Petrov, Nikhil Vyas, and Lucas Janson\.Transformers can reinforcement learn to approximate Gittins index\.In*NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning*, 2024\.
- Raparthy et al\. \(2023\)Sharath Chandra Raparthy, Eric Hambro, Robert Kirk, Mikael Henaff, and Roberta Raileanu\.Generalization to new sequential decision making tasks with in\-context learning\.*arXiv preprint arXiv:2312\.03801*, 2023\.
- Reed et al\. \(2022\)Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth\-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al\.A generalist agent\.*arXiv preprint arXiv:2205\.06175*, 2022\.
- Sridhar et al\. \(2024\)Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, and Insup Lee\.REGENT: A retrieval\-augmented generalist agent that can act in\-context in new environments\.*arXiv preprint arXiv:2412\.04759*, 2024\.
- Team et al\. \(2023\)Adaptive Agent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley\-Schmieg, Michael Chang, Natalie Clay, Adrian Collister, et al\.Human\-timescale adaptation in an open\-ended task space\.*arXiv preprint arXiv:2301\.07608*, 2023\.
- Vaswani et al\. \(2017\)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Łukasz Kaiser, and Illia Polosukhin\.Attention is all you need\.*Advances in Neural Information Processing Systems*, 30, 2017\.
- Wang et al\. \(2024\)Hanzhao Wang, Yu Pan, Fupeng Sun, Shang Liu, Kalyan Talluri, Guanting Chen, and Xiaocheng Li\.Understanding the training and generalization of pretrained transformer for sequential decision making\.*arXiv preprint arXiv:2405\.14219*, 2024\.Similar Articles
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
This paper argues that representation learning, not model-based planning, is the key to scalable multitask deep reinforcement learning. It introduces MR.Q, a simple model-free algorithm with auxiliary predictive objectives that outperforms prior world-model-based methods across diverse continuous control tasks.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
This paper proposes the LLM-as-Environment-Engineer framework, where a policy model analyzes failures to automatically redesign the training environment for reinforcement learning, and introduces MAPF-FrozenLake as a controllable testbed. The framework, using Qwen3-4B, outperforms larger models like GPT and Gemini, showing that policy learning improves the model's ability to diagnose weaknesses.
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
Proposes Model-Based Diffusion Policy Optimization (MBDPO), a framework that unifies search and policy optimization in world models using diffusion policy representations, achieving consistent scaling behavior and superior performance across offline and online reinforcement learning tasks.
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
This paper studies when end-to-end reinforcement learning training improves multi-agent LLM workflows, comparing shared-policy and isolated-policy training across different workflows, tasks, and model scales, revealing conditional tradeoffs.
A Deep Reinforcement Learning (DRL)-Based Transformer Method for Solving the Open Shop Scheduling Problem
Presents a Transformer-based scheduling policy trained with reinforcement learning for the open shop scheduling problem, showing that a model trained on small instances can generalize to much larger problems and compete with classical dispatching heuristics.