Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
Summary
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
View Cached Full Text
Cached at: 05/14/26, 08:20 PM
Paper page - Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
Source: https://huggingface.co/papers/2605.08978
Abstract
Agents use variational inference to evaluate exploratory actions and selectively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose anexploration-awarereinforcement learningframework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces afine-grained reward functionviavariational inferencethat explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with anexploration-aware grouping mechanismthat separates exploratory actions from task-completion actions during optimization. By targetinginformational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.
View arXiv pageView PDFGitHub1Add to collection
Community
Get this paper in your agent:
hf papers read 2605\.08978
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### hansenhua/EAPO-ICML26 Image-Text-to-Text• 8B• Updated1 day ago • 129
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08978 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08978 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Look Before You Leap: Autonomous Exploration for LLM Agents
This paper identifies autonomous exploration as a critical capability for LLM agents and proposes the Explore-then-Act paradigm, which decouples information gathering from task execution to improve adaptability and real-world performance. It also introduces Exploration Checkpoint Coverage as a verifiable metric for evaluating exploration breadth.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.
Some considerations on learning to explore via meta-reinforcement learning
OpenAI researchers introduce E-MAML and E-RL², two meta-reinforcement learning algorithms designed to improve exploration in tasks where discovering optimal policies requires significant exploration. The work demonstrates these algorithms' effectiveness on novel environments including Krazy World and maze tasks.
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
NudgeRL is a framework that enhances reinforcement learning with verifiable rewards (RLVR) by introducing structured exploration and strategy nudging, achieving better reasoning performance in large language models more efficiently than brute-force scaling methods.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
Researchers propose SPS (Steering Probability Squeezing), a training paradigm combining reinforcement learning with inverse reinforcement learning to address probability squeezing in LLM reasoning training, where probability mass concentrates too narrowly on high-reward trajectories, limiting exploration and multi-sample performance (Pass@k). Experiments on five reasoning benchmarks demonstrate improved exploration and Pass@k metrics.