Learning from human preferences

OpenAI Blog 06/13/17, 07:00 AM Papers

Summary

OpenAI presents a method for training AI agents using human preference feedback, where an agent learns reward functions from human comparisons of behavior trajectories and uses reinforcement learning to optimize for the inferred goals. The approach demonstrates strong sample efficiency, requiring less than 1000 bits of human feedback to train an agent to perform a backflip.

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 02:56 PM

# Learning from human preferences Source: [https://openai.com/index/learning-from-human-preferences/](https://openai.com/index/learning-from-human-preferences/) Our AI agent starts by acting randomly in the environment\. Periodically, two video clips of its behavior are given to a human, and the human decides which of the two clips is closest to fulfilling its goal—in this case, a backflip\. The AI gradually builds a model of the goal of the task by finding the reward function that best explains the human’s judgments\. It then uses RL to learn how to achieve that goal\. As its behavior improves, it continues to ask for human feedback on trajectory pairs where it’s most uncertain about which is better, and further refines its understanding of the goal\. Our approach demonstrates promising sample efficiency—as stated previously, the backflip video required under 1000 bits of human feedback\. It took less than an hour of a human evaluator’s time, while in the background the policy accumulated about 70 hours of overall experience \(simulated at a much faster rate than real\-time\.\) We will continue to work on reducing the amount of feedback a human needs to supply\. You can see a sped\-up version of the training process in the following video\.

Learning from human preferences

Similar Articles

Gathering human feedback

Learning to summarize with human feedback

Less human AI agents, please

Learning Montezuma’s Revenge from a single demonstration

Reinforcement learning with prediction-based rewards

Submit Feedback

Similar Articles

Learning to summarize with human feedback

Learning Montezuma’s Revenge from a single demonstration

Reinforcement learning with prediction-based rewards