RL²: Fast reinforcement learning via slow reinforcement learning

OpenAI Blog Papers

Summary

RL² proposes encoding a fast reinforcement learning algorithm as the weights of a recurrent neural network, learned through slow general-purpose RL, enabling agents to adapt to new tasks with few trials similar to biological learning. The method demonstrates strong performance on both small-scale bandit problems and large-scale vision-based navigation tasks.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:43 PM

# RL²: Fast reinforcement learning via slow reinforcement learning Source: [https://openai.com/index/rl2/](https://openai.com/index/rl2/) ## Abstract Deep reinforcement learning \(deep RL\) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials\. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world\. This paper seeks to bridge this gap\. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network \(RNN\) and learn it from data\. In our proposed method, RL², the algorithm is encoded in the weights of the RNN, which are learned slowly through a general\-purpose \("slow"\) RL algorithm\. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process \(MDP\)\. The activations of the RNN store the state of the "fast" RL algorithm on the current \(previously unseen\) MDP\. We evaluate RL² experimentally on both small\-scale and large\-scale problems\. On the small\-scale side, we train it to solve randomly generated multi\-arm bandit problems and finite MDPs\. After RL² is trained, its performance on new MDPs is close to human\-designed algorithms with optimality guarantees\. On the large\-scale side, we test RL² on a vision\-based navigation task and show that it scales up to high\-dimensional problems\.

Similar Articles

EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.

Generalizing from simulation

OpenAI Blog

OpenAI describes challenges with conventional RL on robotics tasks and introduces Hindsight Experience Replay (HER), a new RL algorithm that enables agents to learn from binary rewards by reframing failures as intended outcomes, combined with domain randomization for sim-to-real transfer.

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hugging Face Daily Papers

RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.

Reinforcement learning with prediction-based rewards

OpenAI Blog

OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.