Learning from human preferences

OpenAI Blog Papers

Summary

OpenAI presents a method for training AI agents using human preference feedback, where an agent learns reward functions from human comparisons of behavior trajectories and uses reinforcement learning to optimize for the inferred goals. The approach demonstrates strong sample efficiency, requiring less than 1000 bits of human feedback to train an agent to perform a backflip.

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:56 PM

# Learning from human preferences Source: [https://openai.com/index/learning-from-human-preferences/](https://openai.com/index/learning-from-human-preferences/) Our AI agent starts by acting randomly in the environment\. Periodically, two video clips of its behavior are given to a human, and the human decides which of the two clips is closest to fulfilling its goal—in this case, a backflip\. The AI gradually builds a model of the goal of the task by finding the reward function that best explains the human’s judgments\. It then uses RL to learn how to achieve that goal\. As its behavior improves, it continues to ask for human feedback on trajectory pairs where it’s most uncertain about which is better, and further refines its understanding of the goal\. Our approach demonstrates promising sample efficiency—as stated previously, the backflip video required under 1000 bits of human feedback\. It took less than an hour of a human evaluator’s time, while in the background the policy accumulated about 70 hours of overall experience \(simulated at a much faster rate than real\-time\.\) We will continue to work on reducing the amount of feedback a human needs to supply\. You can see a sped\-up version of the training process in the following video\.

Similar Articles

Gathering human feedback

OpenAI Blog

OpenAI releases RL-Teacher, an open-source tool for training AI systems through human feedback instead of hand-crafted reward functions, with applications to safe AI development and complex reinforcement learning problems.

Learning to summarize with human feedback

OpenAI Blog

OpenAI demonstrates a technique for improving language model summarization by training a reward model on human preferences and fine-tuning models with reinforcement learning, achieving significant quality improvements that generalize across datasets. This work advances model alignment through human feedback at scale, with applications beyond summarization.

Less human AI agents, please

Hacker News Top

A blog post argues that current AI agents exhibit overly human-like flaws such as ignoring hard constraints, taking shortcuts, and reframing unilateral pivots as communication failures, while citing Anthropic research on how RLHF optimization can lead to sycophancy and truthfulness sacrifices.

Learning Montezuma’s Revenge from a single demonstration

OpenAI Blog

OpenAI demonstrates a method for training a reinforcement learning agent to play Montezuma's Revenge from a single human demonstration, addressing the challenge of sparse rewards through curriculum learning and careful hyperparameter tuning. The approach achieves strong performance on the notoriously difficult Atari game while showing generalization limitations on other titles.

Reinforcement learning with prediction-based rewards

OpenAI Blog

OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.