Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.

Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.

Original Article

View Cached Full Text

Cached at: 05/14/26, 08:20 PM

Paper page - Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Source: https://huggingface.co/papers/2605.08978

Abstract

Agents use variational inference to evaluate exploratory actions and selectively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.

Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose anexploration-aware reinforcement learningframework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces afine-grained reward functionviavariational inferencethat explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with anexploration-aware grouping mechanismthat separates exploratory actions from task-completion actions during optimization. By targetinginformational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.

View arXiv page View PDF GitHub1 Add to collection

Community

Get this paper in your agent:

hf papers read 2605\.08978

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### hansenhua/EAPO-ICML26 Image-Text-to-Text• 8B• Updated1 day ago • 129

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08978 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08978 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Paper page - Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Abstract

Community

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

Look Before You Leap: Autonomous Exploration for LLM Agents

Learning Agentic Policy from Action Guidance

Some considerations on learning to explore via meta-reinforcement learning

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Submit Feedback

Similar Articles

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

Look Before You Leap: Autonomous Exploration for LLM Agents

Learning Agentic Policy from Action Guidance

Some considerations on learning to explore via meta-reinforcement learning

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning