RL²: Fast reinforcement learning via slow reinforcement learning

OpenAI Blog 11/09/16, 08:00 AM Papers

Summary

RL² proposes encoding a fast reinforcement learning algorithm as the weights of a recurrent neural network, learned through slow general-purpose RL, enabling agents to adapt to new tasks with few trials similar to biological learning. The method demonstrates strong performance on both small-scale bandit problems and large-scale vision-based navigation tasks.

No content available

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 02:43 PM

# RL²: Fast reinforcement learning via slow reinforcement learning Source: [https://openai.com/index/rl2/](https://openai.com/index/rl2/) ## Abstract Deep reinforcement learning \(deep RL\) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials\. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world\. This paper seeks to bridge this gap\. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network \(RNN\) and learn it from data\. In our proposed method, RL², the algorithm is encoded in the weights of the RNN, which are learned slowly through a general\-purpose \("slow"\) RL algorithm\. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process \(MDP\)\. The activations of the RNN store the state of the "fast" RL algorithm on the current \(previously unseen\) MDP\. We evaluate RL² experimentally on both small\-scale and large\-scale problems\. On the small\-scale side, we train it to solve randomly generated multi\-arm bandit problems and finite MDPs\. After RL² is trained, its performance on new MDPs is close to human\-designed algorithms with optimality guarantees\. On the large\-scale side, we test RL² on a vision\-based navigation task and show that it scales up to high\-dimensional problems\.

RL²: Fast reinforcement learning via slow reinforcement learning

Similar Articles

Building Fast & Accurate Agents with Prime-RL Post Training (22 minute read)

EasyVideoR1: Easier RL for Video Understanding

Generalizing from simulation

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Reinforcement learning with prediction-based rewards

Submit Feedback

Similar Articles

Building Fast & Accurate Agents with Prime-RL Post Training (22 minute read)

EasyVideoR1: Easier RL for Video Understanding

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Reinforcement learning with prediction-based rewards