Generalizing from simulation

OpenAI Blog Papers

Summary

OpenAI describes challenges with conventional RL on robotics tasks and introduces Hindsight Experience Replay (HER), a new RL algorithm that enables agents to learn from binary rewards by reframing failures as intended outcomes, combined with domain randomization for sim-to-real transfer.

Our latest robotics techniques allow robot controllers, trained entirely in simulation and deployed on physical robots, to react to unplanned changes in the environment as they solve simple tasks. That is, we’ve used these techniques to build closed-loop systems rather than open-loop ones as before.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:56 PM

# Generalizing from simulation Source: [https://openai.com/index/generalizing-from-simulation/](https://openai.com/index/generalizing-from-simulation/) The abundance of RL results with simulated robots can make it seem like RL easily solves most robotics tasks\. But common RL algorithms work well only on tasks where small perturbations to your action can provide an incremental change to the reward\. Some robotics tasks have simple rewards, like walking, where you can be scored on distance traveled\. But most tasks do[not⁠](https://openai.com/index/learning-from-human-preferences/)—to define a dense reward for block stacking, you’d need to encode that the arm is close to the block, that the arm approaches the block in the correct orientation, that the block is lifted off the ground, the distance of block to the desired position, etc\. We spent a number of months unsuccessfully trying to get conventional RL algorithms working on pick\-and\-place tasks before ultimately developing a new reinforcement learning algorithm,[Hindsight Experience Replay⁠\(opens in a new window\)](https://arxiv.org/pdf/1707.01495.pdf)\(HER\), which allows agents to learn from a binary reward by pretending that a failure was what they wanted to do all along and learning from it accordingly\. \(By analogy, imagine looking for a gas station but ending up at a pizza shop\. You still don’t know where to get gas, but you’ve now learned where to get pizza\.\) We also used[domain randomization⁠](https://openai.com/index/spam-detection-in-the-physical-world/)on the visual shapes to learn a vision system robust enough for the physical world\. Our HER implementation uses the actor\-critic technique with asymmetric information\. \(The*actor*is the policy, and the*critic*is a network which receives action/state pairs and estimates their Q\-value, or sum of future rewards, providing training signal to the actor\.\) While the critic has access to the full state of the simulator, the actor only has access to RGB and depth data\. Thus the critic can provide fully accurate feedback, while the actor uses only data present in the real world\.

Similar Articles

Ingredients for robotics research

OpenAI Blog

OpenAI presents Hindsight Experience Replay (HER), a reinforcement learning technique that enables robots to learn from failed attempts by retroactively treating achieved alternative outcomes as successful goals, allowing learning even with sparse reward signals.

Hindsight Experience Replay

OpenAI Blog

OpenAI presents Hindsight Experience Replay (HER), a technique enabling sample-efficient reinforcement learning from sparse binary rewards without complex reward engineering. It is demonstrated on robotic arm manipulation tasks including pushing, sliding, and pick-and-place, and validated on physical robots.

Sim-to-real transfer of robotic control with dynamics randomization

OpenAI Blog

OpenAI researchers demonstrate a method to bridge the reality gap in robotic control by training policies with randomized simulator dynamics, enabling robots trained purely in simulation to successfully transfer to real-world tasks like object manipulation without physical training.

Reinforcement learning with prediction-based rewards

OpenAI Blog

OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.