Generalizing from simulation

OpenAI Blog 10/19/17, 07:00 AM Papers

Summary

OpenAI describes challenges with conventional RL on robotics tasks and introduces Hindsight Experience Replay (HER), a new RL algorithm that enables agents to learn from binary rewards by reframing failures as intended outcomes, combined with domain randomization for sim-to-real transfer.

Our latest robotics techniques allow robot controllers, trained entirely in simulation and deployed on physical robots, to react to unplanned changes in the environment as they solve simple tasks. That is, we’ve used these techniques to build closed-loop systems rather than open-loop ones as before.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 02:56 PM

# Generalizing from simulation Source: [https://openai.com/index/generalizing-from-simulation/](https://openai.com/index/generalizing-from-simulation/) The abundance of RL results with simulated robots can make it seem like RL easily solves most robotics tasks\. But common RL algorithms work well only on tasks where small perturbations to your action can provide an incremental change to the reward\. Some robotics tasks have simple rewards, like walking, where you can be scored on distance traveled\. But most tasks do[not⁠](https://openai.com/index/learning-from-human-preferences/)—to define a dense reward for block stacking, you’d need to encode that the arm is close to the block, that the arm approaches the block in the correct orientation, that the block is lifted off the ground, the distance of block to the desired position, etc\. We spent a number of months unsuccessfully trying to get conventional RL algorithms working on pick\-and\-place tasks before ultimately developing a new reinforcement learning algorithm,[Hindsight Experience Replay⁠\(opens in a new window\)](https://arxiv.org/pdf/1707.01495.pdf)\(HER\), which allows agents to learn from a binary reward by pretending that a failure was what they wanted to do all along and learning from it accordingly\. \(By analogy, imagine looking for a gas station but ending up at a pizza shop\. You still don’t know where to get gas, but you’ve now learned where to get pizza\.\) We also used[domain randomization⁠](https://openai.com/index/spam-detection-in-the-physical-world/)on the visual shapes to learn a vision system robust enough for the physical world\. Our HER implementation uses the actor\-critic technique with asymmetric information\. \(The*actor*is the policy, and the*critic*is a network which receives action/state pairs and estimates their Q\-value, or sum of future rewards, providing training signal to the actor\.\) While the critic has access to the full state of the simulator, the actor only has access to RGB and depth data\. Thus the critic can provide fully accurate feedback, while the actor uses only data present in the real world\.

Generalizing from simulation

Similar Articles

Ingredients for robotics research

Hindsight Experience Replay

Gotta Learn Fast: A new benchmark for generalization in RL

Sim-to-real transfer of robotic control with dynamics randomization

Reinforcement learning with prediction-based rewards

Submit Feedback

Similar Articles

Ingredients for robotics research

Gotta Learn Fast: A new benchmark for generalization in RL

Sim-to-real transfer of robotic control with dynamics randomization

Reinforcement learning with prediction-based rewards