Ingredients for robotics research

OpenAI Blog Papers

Summary

OpenAI presents Hindsight Experience Replay (HER), a reinforcement learning technique that enables robots to learn from failed attempts by retroactively treating achieved alternative outcomes as successful goals, allowing learning even with sparse reward signals.

We’re releasing eight simulated robotics environments and a Baselines implementation of Hindsight Experience Replay, all developed for our research over the past year. We’ve used these environments to train models which work on physical robots. We’re also releasing a set of requests for robotics research.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:45 PM

# Ingredients for robotics research Source: [https://openai.com/index/ingredients-for-robotics-research/](https://openai.com/index/ingredients-for-robotics-research/) To understand what HER does, let’s look at in the context of[FetchSlide⁠\(opens in a new window\)](https://gym.openai.com/envs/FetchSlide-v0), a task where we need to learn to slide a puck across the table and hit a target\. Our first attempt very likely will not be a successful one\. Unless we get very lucky, the next few attempts will also likely not succeed\. Typical reinforcement learning algorithms would not learn anything from this experience since they just obtain a constant reward \(in this case:`\-1`\) that does not contain any learning signal\. The key insight that HER formalizes is what humans do intuitively: Even though we have not succeeded at a specific goal, we have at least achieved a different one\. So why not just pretend that we wanted to achieve this goal to begin with, instead of the one that we set out to achieve originally? By doing this substitution, the reinforcement learning algorithm can obtain a learning signal since it has achieved*some*goal; even if it wasn’t the one that we meant to achieve originally\. If we repeat this process, we will eventually learn how to achieve arbitrary goals, including the goals that we really want to achieve\. This approach lets us learn how to slide a puck across the table even though our reward is fully sparse and even though we may have never actually hit the desired goal early on\. We call this technique Hindsight Experience Replay since it replays experience \(a technique often used in off\-policy RL algorithms like[DQN⁠](https://openai.com/index/openai-baselines-dqn/)and[DDPG⁠\(opens in a new window\)](https://arxiv.org/abs/1509.02971)\) with goals which are chosen in hindsight, after the episode has finished\. HER can therefore be combined with any off\-policy RL algorithm \(for example, HER can be combined with DDPG, which we write as “DDPG \+ HER”\)\.

Similar Articles

Hindsight Experience Replay

OpenAI Blog

OpenAI presents Hindsight Experience Replay (HER), a technique enabling sample-efficient reinforcement learning from sparse binary rewards without complex reward engineering. It is demonstrated on robotic arm manipulation tasks including pushing, sliding, and pick-and-place, and validated on physical robots.

Generalizing from simulation

OpenAI Blog

OpenAI describes challenges with conventional RL on robotics tasks and introduces Hindsight Experience Replay (HER), a new RL algorithm that enables agents to learn from binary rewards by reframing failures as intended outcomes, combined with domain randomization for sim-to-real transfer.

Learning from human preferences

OpenAI Blog

OpenAI presents a method for training AI agents using human preference feedback, where an agent learns reward functions from human comparisons of behavior trajectories and uses reinforcement learning to optimize for the inferred goals. The approach demonstrates strong sample efficiency, requiring less than 1000 bits of human feedback to train an agent to perform a backflip.

Robots that learn

OpenAI Blog

OpenAI describes a robot learning system powered by two neural networks — a vision network trained on simulated images and an imitation network that generalizes task demonstrations to new configurations. The system is applied to block-stacking tasks, learning to infer and replicate task intent from paired demonstration examples.