AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

reinforcement-learning llm-agents entropy-modulation credit-assignment multi-turn baidu-qianfan

Summary

This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.

Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to improve the exploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis lifts entropy dynamics from the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift under natural-gradient updates is governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practical response-level uncertainty proxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.

Original Article

View Cached Full Text

Cached at: 05/11/26, 07:19 AM

Paper page - AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Source: https://huggingface.co/papers/2605.00425 Authors:

Abstract

A novel supervision-free credit assignment method for reinforcement learning in language model agents that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs and task performance.

Reinforcement learning(RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-freecredit assignmentmethod that adaptively modulatesentropy dynamicsduring RL training to improve theexploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis liftsentropy dynamicsfrom the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift undernatural-gradient updatesis governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practicalresponse-level uncertaintyproxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.00425

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.00425 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.00425 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.00425 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Paper page - AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

How Maximum Entropy makes Reinforcement Learning Robust

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Some considerations on learning to explore via meta-reinforcement learning

@dair_ai: // Harnessing Agentic Evolution // Pay attention to this one if you run iterative agentic search loops. (bookmark it) A…

Submit Feedback

Similar Articles

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

How Maximum Entropy makes Reinforcement Learning Robust

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Some considerations on learning to explore via meta-reinforcement learning

@dair_ai: // Harnessing Agentic Evolution // Pay attention to this one if you run iterative agentic search loops. (bookmark it) A…