Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

arXiv cs.CL 04/20/26, 04:00 AM Papers

reinforcement-learning llm entropy-regularization reasoning rlvr exploration-exploitation

Summary

This paper proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation in LLM reinforcement learning by addressing policy entropy collapse through difficulty-aware coefficient allocation and initial-anchored target entropy. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and exploration capability.

arXiv:2510.10959v3 Announce Type: replace-cross Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)--a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Source: https://arxiv.org/html/2510.10959

## Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Xiaoyun Zhang¹,³,† Xiaojian Yuan²,† Di Huang¹ Wang You⁴ Chen Hu⁴ Jingqing Ruan³ Ai Jian⁴ Kejiang Chen² Xing Hu¹,*

¹State Key Lab of Processors, Institute of Computing Technology, CAS
²University of Science and Technology of China
³University of Chinese Academy of Sciences
⁴StepFun Inc

[email protected]
[email protected]

###### Abstract

Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER) — a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

## 1 Introduction

Reasoning ability has become a crucial capability for Large Language Models (LLMs) to solve complex tasks in mathematics and coding. Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as an effective paradigm to enhance this capability, driving advances in state-of-the-art models such as OpenAI-o1 and DeepSeek-R1 ([Jaech et al., 2024](https://arxiv.org/html/2510.10959#bib.bib1); [Guo et al., 2025](https://arxiv.org/html/2510.10959#bib.bib2)). However, recent studies observe that *policy entropy collapse* may pose a significant bottleneck in RLVR training ([Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3); [He et al., 2025](https://arxiv.org/html/2510.10959#bib.bib4); [Cheng et al., 2025](https://arxiv.org/html/2510.10959#bib.bib11); [Dai et al., 2025](https://arxiv.org/html/2510.10959#bib.bib15)), closely tied to the long-standing exploration–exploitation dilemma ([Sutton et al., 1998](https://arxiv.org/html/2510.10959#bib.bib16)). Specifically, the model's policy often converges prematurely to a narrow set of exploitative reasoning trajectories, thereby suppressing exploration of the broader solution space ([Chen et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib5)). This premature convergence typically manifests as a rapid decline in policy entropy during the early stages of training ([Yu et al., 2025](https://arxiv.org/html/2510.10959#bib.bib6)), trapping the policy in local optima and leading to performance plateaus that constrain the model's overall reasoning potential ([Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3)).

![An overview of the AER framework.](https://arxiv.org/html/2510.10959#S1.F1)

A conventional approach in reinforcement learning to alleviate policy entropy collapse is to introduce an entropy regularization term, which explicitly penalizes overly deterministic policies and encourages exploration ([Schulman et al., 2017](https://arxiv.org/html/2510.10959#bib.bib7)). Despite its simplicity and conceptual appeal, this technique is often omitted in recent RLVR pipelines for LLMs ([Yu et al., 2025](https://arxiv.org/html/2510.10959#bib.bib6); [Hu et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib13); [Liu et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib14); [Cui et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib12)), as its effectiveness is highly sensitive to the choice of the entropy coefficient. Small coefficients cannot prevent entropy collapse, whereas excessively large coefficients may induce entropy explosion ([Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3); [Jiang et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib44)). Moreover, a slight change in the base model or dataset may flip the effect of a tuned coefficient from beneficial to harmful ([He et al., 2025](https://arxiv.org/html/2510.10959#bib.bib4)).

Intuitively, the balance between exploration (high entropy) and exploitation (low entropy) should be dynamic throughout training. Fixed coefficients struggle to deal with this evolving trade-off ([He et al., 2025](https://arxiv.org/html/2510.10959#bib.bib4); [Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3)). This naturally raises the question: *Can we adaptively adjust the coefficient for entropy regularization during RLVR training?*

In this work, we revisit entropy regularization in the context of RLVR for LLMs and argue that its potential has been largely underestimated due to the limitations of fixed-coefficient designs. Motivated by this concern, we conduct preliminary analysis in Section 3 and have two observations: (*i*) tasks of different difficulty levels require distinct exploration intensities, suggesting the need for a difficulty-aware mechanism that enables sample-level control of entropy regularization; and (*ii*) effective exploration during training requires maintaining the policy entropy at a specific target value below its initial entropy.

Therefore, we propose *Adaptive Entropy Regularization (AER)* as shown in Figure 1, which dynamically balances exploration and exploitation through adaptive coefficients, including three components: (*i*) *Difficulty-Aware Coefficient Allocation* estimates task difficulty relative to the current policy and assigns sample-level entropy coefficients to achieve fine-grained entropy regularization; (*ii*) *Initial-Anchored Target Entropy* adaptively determines the target entropy value based on each run's initial entropy, maintaining consistent relative exploration budget among different settings; and (*iii*) *Dynamic Global Coefficient Adjustment* adaptively adjusts a global scaling factor for entropy coefficients according to the current policy entropy to ensure that the policy entropy is maintained near the target entropy during training. Together, these components form an adaptive controller that maintains policy entropy within a reasonable range, stabilizing training while retaining balanced exploration.

We conducted empirical evaluations on various complex mathematical reasoning benchmarks, and AER showed consistent improvements in reasoning performance and diversity. Our contributions are summarized as follows:

- We conduct preliminary analysis to show that exploration should adapt to task difficulty and that balanced exploration may require maintaining the policy entropy within a moderate range below its initial level, motivating adaptive, difficulty-aware entropy regularization.

- We introduce the *Adaptive Entropy Regularization (AER)*, a framework that can dynamically and adaptively adjust the coefficients of entropy regularization to better balance exploration and exploitation throughout training.

- Extensive experiments on various mathematical reasoning benchmarks demonstrate that AER consistently outperforms advanced baselines in both reasoning performance (pass@1) and exploration capability (pass@k), validating the potential of adaptive entropy regularization in RLVR training.

![Preliminary experimental results. (a-b) show the effect of different entropy coefficients on test accuracy and average token length on easy and difficult datasets, respectively. (c) demonstrates that different base models, datasets, and sampling temperatures significantly affect the value of the initial entropy.](https://arxiv.org/html/2510.10959#S3.F2)

## 2 Related Work

#### Reinforcement Learning for LLMs

Reinforcement learning is an important paradigm for training LLMs ([Ouyang et al., 2022](https://arxiv.org/html/2510.10959#bib.bib17); [Team et al., 2023](https://arxiv.org/html/2510.10959#bib.bib18); [Lee et al., 2024](https://arxiv.org/html/2510.10959#bib.bib19); [Jian et al., 2026](https://arxiv.org/html/2510.10959#bib.bib67)). Recently, reinforcement learning with verifiable rewards (RLVR) has shown remarkable success, demonstrating significant performance in complex tasks ([Guo et al., 2025](https://arxiv.org/html/2510.10959#bib.bib2); [Jaech et al., 2024](https://arxiv.org/html/2510.10959#bib.bib1); [Yang et al., 2025](https://arxiv.org/html/2510.10959#bib.bib20); [He et al., 2025](https://arxiv.org/html/2510.10959#bib.bib4); [Team et al., 2025](https://arxiv.org/html/2510.10959#bib.bib21); [Liu et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib14); [Zeng et al., 2025](https://arxiv.org/html/2510.10959#bib.bib22); [Zhang et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib65); [Luo et al., 2026](https://arxiv.org/html/2510.10959#bib.bib61)). In addition, a series of actor-only methods further reduce the resource burden and complexity of RLVR ([Shao et al., 2024](https://arxiv.org/html/2510.10959#bib.bib23); [Li et al., 2024](https://arxiv.org/html/2510.10959#bib.bib25); [Hu et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib24); [Yu et al., 2025](https://arxiv.org/html/2510.10959#bib.bib6); [Zheng et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib26); [Li et al., 2026](https://arxiv.org/html/2510.10959#bib.bib63)). However, RLVR still suffers from the exploration-exploitation dilemma, which manifests as a rapid decrease in policy entropy, thus limiting the performance of LLMs ([Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3); [Cheng et al., 2025](https://arxiv.org/html/2510.10959#bib.bib11); [He et al., 2025](https://arxiv.org/html/2510.10959#bib.bib4); [Zhang et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib64)).

#### Exploration in Reinforcement Learning

Exploration is a central challenge in reinforcement learning, typically approached through theoretical analysis ([Cai et al., 2020](https://arxiv.org/html/2510.10959#bib.bib31); [Ishfaq et al., 2021](https://arxiv.org/html/2510.10959#bib.bib32)), curiosity-driven signals ([Pathak et al., 2017](https://arxiv.org/html/2510.10959#bib.bib33); [Burda et al.](https://arxiv.org/html/2510.10959#bib.bib34); [Raileanu and Rocktäschel](https://arxiv.org/html/2510.10959#bib.bib35); [Henaf et al., 2022](https://arxiv.org/html/2510.10959#bib.bib29)), and entropy maximization ([Ziebart et al., 2008](https://arxiv.org/html/2510.10959#bib.bib8); [Toussaint, 2009](https://arxiv.org/html/2510.10959#bib.bib36)). In the context of LLMs, several studies have employed entropy as a performance indicator ([Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3)) or as a heuristic for advantage shaping, enhancing the rollout phase, or loss masking ([Wang et al., 2025](https://arxiv.org/html/2510.10959#bib.bib37); [Cheng et al., 2025](https://arxiv.org/html/2510.10959#bib.bib11); [Zheng et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib38); [Li et al., 2025d](https://arxiv.org/html/2510.10959#bib.bib59)). Entropy regularization or KL penalty helps control policy distributions ([He et al., 2025](https://arxiv.org/html/2510.10959#bib.bib4); [Liu et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib39)), while complementary techniques such as loss reweighting ([Wang et al., 2025](https://arxiv.org/html/2510.10959#bib.bib37); [Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3)) and clip-higher ([Yu et al., 2025](https://arxiv.org/html/2510.10959#bib.bib6)) further mitigate entropy collapse. Additional strategies for promoting exploration include adjusting sampling hyperparameters ([Chen et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib40)), performing self-reflection ([Jiang et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib41)), leveraging external verification ([Zha et al., 2025](https://arxiv.org/html/2510.10959#bib.bib42)), emphasizing high-entropy tokens via critical-token training ([Wang et al., 2025](https://arxiv.org/html/2510.10959#bib.bib37); [Li et al., 2025c](https://arxiv.org/html/2510.10959#bib.bib43); [Jiang et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib44)), and designing custom intrinsic signals ([Li et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib46); [Dai et al., 2025](https://arxiv.org/html/2510.10959#bib.bib15); [Gao et al., 2025](https://arxiv.org/html/2510.10959#bib.bib45); [Song et al., 2025](https://arxiv.org/html/2510.10959#bib.bib60)). However, the necessity of entropy regularization in RLVR remains debated, with some studies questioning its impact on exploration effectiveness ([Ouyang et al., 2022](https://arxiv.org/html/2510.10959#bib.bib17); [Shao et al., 2024](https://arxiv.org/html/2510.10959#bib.bib23); [Hu et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib13); [Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3)).

## 3 Preliminary Analysis

Although adding explicit entropy regularization (e.g., an entropy loss term in the objective) is a straightforward, plug-and-play remedy to prevent the policy from becoming overly deterministic and thus mitigate entropy collapse, most recent works on RLVR for LLMs do not include this technique ([Hu et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib13); [Liu et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib14); [Cui et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib12); [Yu et al., 2025](https://arxiv.org/html/2510.10959#bib.bib6)). Yet entropy regularization is highly sensitive to coefficients in practice; small coefficients cannot effectively prevent entropy collapse, while large coefficients will lead to entropy explosion, causing training instability or performance degradation ([Cui et al., 2025b](https://arxiv.org/html/2510.10959#bib.bib3)). Moreover, slight changes in the experimental setup can cause carefully selected coefficients to have the opposite effect ([He et al., 2025](https://arxiv.org/html/2510.10959#bib.bib4)).

Intuitively, the degree of exploration should correlate with task difficulty: excessive exploration on easy tasks may introduce unnecessary randomness and hinder convergence, whereas difficult tasks often require stronger exploration to escape local optima and discover effective reasoning trajectories ([Li et al., 2025a](https://arxiv.org/html/2510.10959#bib.bib46), [b](https://arxiv.org/html/2510.10959#bib.bib49); [Jiang et al., 2026](https://arxiv.org/html/2510.10959#bib.bib66)). To examine this intuition, we train Qwen3-4B-Base with GRPO on mathematical datasets of different difficulty levels, varying the strength of

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Similar Articles

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

How Maximum Entropy makes Reinforcement Learning Robust

Submit Feedback

Similar Articles

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

How Maximum Entropy makes Reinforcement Learning Robust