Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
Summary
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
View Cached Full Text
Cached at: 05/14/26, 08:20 PM
Paper page - Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
Source: https://huggingface.co/papers/2605.08978
Abstract
Agents use variational inference to evaluate exploratory actions and selectively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose anexploration-awarereinforcement learningframework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces afine-grained reward functionviavariational inferencethat explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with anexploration-aware grouping mechanismthat separates exploratory actions from task-completion actions during optimization. By targetinginformational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.
View arXiv pageView PDFGitHub1Add to collection
Community
Get this paper in your agent:
hf papers read 2605\.08978
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### hansenhua/EAPO-ICML26 Image-Text-to-Text• 8B• Updated1 day ago • 129
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08978 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08978 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
Proposes SCALE, a framework for self-improving web agents using cognitive-aware exploration with three adversarial roles and a graph exploration strategy. Also introduces a large-scale dataset SCALE-20k from real websites, showing significant improvements in MLLM-based web agents.
Look Before You Leap: Autonomous Exploration for LLM Agents
This paper identifies autonomous exploration as a critical capability for LLM agents and proposes the Explore-then-Act paradigm, which decouples information gathering from task execution to improve adaptability and real-world performance. It also introduces Exploration Checkpoint Coverage as a verifiable metric for evaluating exploration breadth.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.
Some considerations on learning to explore via meta-reinforcement learning
OpenAI researchers introduce E-MAML and E-RL², two meta-reinforcement learning algorithms designed to improve exploration in tasks where discovering optimal policies requires significant exploration. The work demonstrates these algorithms' effectiveness on novel environments including Krazy World and maze tasks.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
This paper proposes the LLM-as-Environment-Engineer framework, where a policy model analyzes failures to automatically redesign the training environment for reinforcement learning, and introduces MAPF-FrozenLake as a controllable testbed. The framework, using Qwen3-4B, outperforms larger models like GPT and Gemini, showing that policy learning improves the model's ability to diagnose weaknesses.