exploration

#exploration

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers propose SPS (Steering Probability Squeezing), a training paradigm combining reinforcement learning with inverse reinforcement learning to address probability squeezing in LLM reasoning training, where probability mass concentrates too narrowly on high-reward trajectories, limiting exploration and multi-sample performance (Pass@k). Experiments on five reasoning benchmarks demonstrate improved exploration and Pass@k metrics.

0 favorites 0 likes

#exploration

Plan online, learn offline: Efficient learning and exploration via model-based control

OpenAI Blog ↗ · 2018-11-05 Cached

OpenAI proposes POLO (Plan Online, Learn Offline), a framework combining model-based control with value function learning and coordinated exploration to enable efficient learning on complex control tasks like humanoid locomotion and dexterous manipulation with minimal real-world experience.

0 favorites 0 likes

#exploration

Reinforcement learning with prediction-based rewards

OpenAI Blog ↗ · 2018-10-31 Cached

OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.

0 favorites 0 likes

#exploration

Some considerations on learning to explore via meta-reinforcement learning

OpenAI Blog ↗ · 2018-03-03 Cached

OpenAI researchers introduce E-MAML and E-RL², two meta-reinforcement learning algorithms designed to improve exploration in tasks where discovering optimal policies requires significant exploration. The work demonstrates these algorithms' effectiveness on novel environments including Krazy World and maze tasks.

0 favorites 0 likes

#exploration

Better exploration with parameter noise

OpenAI Blog ↗ · 2017-07-27 Cached

OpenAI presents parameter noise, a technique that adds adaptive noise to neural network policy parameters rather than action spaces, enabling agents to learn tasks significantly faster than traditional action noise approaches. The method achieves 2x faster learning on HalfCheetah and represents a middle ground between evolution strategies and deep RL approaches like TRPO and DDPG.

0 favorites 0 likes

#exploration

UCB exploration via Q-ensembles

OpenAI Blog ↗ · 2017-06-05 Cached

OpenAI presents a novel exploration strategy for deep reinforcement learning using ensembles of Q-functions with upper-confidence bounds (UCB), demonstrating significant performance improvements on the Atari benchmark.

0 favorites 0 likes

#exploration

#Exploration: A study of count-based exploration for deep reinforcement learning

OpenAI Blog ↗ · 2016-11-15 Cached

OpenAI researchers demonstrate that a simple count-based exploration approach using hash codes can achieve near state-of-the-art performance on high-dimensional deep RL benchmarks, challenging the assumption that count-based methods cannot scale to continuous state spaces.

0 favorites 0 likes

exploration

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

Plan online, learn offline: Efficient learning and exploration via model-based control

Reinforcement learning with prediction-based rewards

Some considerations on learning to explore via meta-reinforcement learning

Better exploration with parameter noise

UCB exploration via Q-ensembles

#Exploration: A study of count-based exploration for deep reinforcement learning

Submit Feedback