AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
Summary
This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.
View Cached Full Text
Cached at: 05/11/26, 07:19 AM
Paper page - AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
Source: https://huggingface.co/papers/2605.00425 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
A novel supervision-free credit assignment method for reinforcement learning in language model agents that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs and task performance.
Reinforcement learning(RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-freecredit assignmentmethod that adaptively modulatesentropy dynamicsduring RL training to improve theexploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis liftsentropy dynamicsfrom the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift undernatural-gradient updatesis governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practicalresponse-level uncertaintyproxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.00425
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.00425 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.00425 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.00425 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
This paper proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation in LLM reinforcement learning by addressing policy entropy collapse through difficulty-aware coefficient allocation and initial-anchored target entropy. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and exploration capability.
How Maximum Entropy makes Reinforcement Learning Robust
This article explains how incorporating Shannon entropy into reinforcement learning objectives creates more robust agents capable of handling unexpected or adversarial changes in rewards and dynamics.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
Some considerations on learning to explore via meta-reinforcement learning
OpenAI researchers introduce E-MAML and E-RL², two meta-reinforcement learning algorithms designed to improve exploration in tasks where discovering optimal policies requires significant exploration. The work demonstrates these algorithms' effectiveness on novel environments including Krazy World and maze tasks.
@dair_ai: // Harnessing Agentic Evolution // Pay attention to this one if you run iterative agentic search loops. (bookmark it) A…
AEvo is a meta-editing framework that improves iterative agentic search by separating proposal and evaluation into two roles and using accumulated memory to guide future search. It achieves a 26% relative gain over baselines and state-of-the-art results on open-ended optimization tasks.