Large-scale study of curiosity-driven learning
Summary
OpenAI presents a large-scale empirical study of curiosity-driven reinforcement learning without extrinsic rewards across 54 benchmark environments, showing strong performance and investigating the role of feature spaces in prediction-based reward signals.
View Cached Full Text
Cached at: 04/20/26, 02:45 PM
Similar Articles
Reinforcement learning with prediction-based rewards
OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.
#Exploration: A study of count-based exploration for deep reinforcement learning
OpenAI researchers demonstrate that a simple count-based exploration approach using hash codes can achieve near state-of-the-art performance on high-dimensional deep RL benchmarks, challenging the assumption that count-based methods cannot scale to continuous state spaces.
@OpenAI: As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyon…
OpenAI releases research on reinforcement learning for training models to exhibit beneficial traits like honesty and corrigibility, showing that such training generalizes across domains and persists under adversarial pressure.
Reinforcement learning towards broadly and persistently beneficial models (22 minute read)
OpenAI researchers show that reinforcement learning on realistic scenarios targeting beneficial traits (honesty, transparency, corrigibility) produces broad improvements across dozens of alignment benchmarks, with gains generalizing beyond training domains and persisting under adversarial pressure.
Learning from human preferences
OpenAI presents a method for training AI agents using human preference feedback, where an agent learns reward functions from human comparisons of behavior trajectories and uses reinforcement learning to optimize for the inferred goals. The approach demonstrates strong sample efficiency, requiring less than 1000 bits of human feedback to train an agent to perform a backflip.