Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

arXiv cs.CL Papers

Summary

This paper introduces Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy in LLMs to balance reasoning depth and conciseness, achieving improved accuracy while reducing response length on mathematical benchmarks.

arXiv:2605.19358v1 Announce Type: new Abstract: Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:25 AM

# Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning
Source: [https://arxiv.org/html/2605.19358](https://arxiv.org/html/2605.19358)
Shuyu Wei1,\*,Jian Sun2,\*,Delai Qiu2,Yining Wang2,Shengping Liu2,Jiaen Liang2, Ying Fu2,Wei Huang2,Jitao Sang1,†\\dagger

1Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University 2Unisound AI Technology Co\., Ltd\. \*Equal contribution\.†\\daggerCorresponding author

###### Abstract

Entropy\-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models \(LLMs\), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy\. To better balance this trade\-off, we introduceConditionalEntropyShaping \(CES\), a framework that dynamically controls token\-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones\. Built on DAPO, CES uses token\-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high\-entropy “forking point” tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction\. We implement CES on DeepSeek\-R1\-Distill\-7B and evaluate it on 12 mathematical benchmarks\. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1\.5B backbone and on out\-of\-domain benchmarks\.

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

Shuyu Wei1,\*, Jian Sun2,\*, Delai Qiu2, Yining Wang2, Shengping Liu2, Jiaen Liang2,Ying Fu2,Wei Huang2,Jitao Sang1,†\\dagger1Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence,Beijing Jiaotong University2Unisound AI Technology Co\., Ltd\.\*Equal contribution\.†\\daggerCorresponding author\.

## 1Introduction

In recent years, Large Language Models \(LLMs\) have demonstrated remarkable capabilities in complex reasoning tasks such as mathematical derivation, code generation, and logical planningWeiet al\.\([2022](https://arxiv.org/html/2605.19358#bib.bib1)\); Kojimaet al\.\([2022](https://arxiv.org/html/2605.19358#bib.bib2)\)\. Advanced reasoning models, exemplified by DeepSeek\-R1Guoet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib3)\), Qwen3 seriesYanget al\.\([2025a](https://arxiv.org/html/2605.19358#bib.bib4)\)and OpenAI o3 series, leverage explicit Chain\-of\-Thought \(CoT\) prompting to emulate human\-like thought processes, thereby achieving powerful problem\-solving abilities\. However, the very mechanism that enables this high performance introduces a fundamental tension with a second critical requirement: computational efficiency\. The explicit generation of reasoning steps, while crucial for accuracy on complex tasks, inherently increases the number of generated tokens, leading to high latency and computational costs that can hinder real\-world applications\. This underscores a core dilemma in the field\. On one hand, to achieve the highest possible performance, models are encouraged to explore detailed reasoning paths\. On the other hand, this may lead to significant inefficiency, a phenomenon often described as “overthinking”, where models produce unnecessarily lengthy thought processes for trivial questions like “What is 2\+3?”Chenet al\.\([2024](https://arxiv.org/html/2605.19358#bib.bib5)\); Maet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib6)\); Yanget al\.\([2025b](https://arxiv.org/html/2605.19358#bib.bib7)\)\.

![Refer to caption](https://arxiv.org/html/2605.19358v1/x1.png)Figure 1:Overview of our CES pipeline\.A novel research direction, which we term entropy\-based deep reasoning, has emerged by leveraging token\-level entropy to analyze and guide the reasoning process\. One study revealed that a few high\-entropy tokens within a CoT often act as critical “forking points” in the reasoning path, serving as key levers for decision\-makingWanget al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib8)\)\. They train the model exclusively on the top 20% high\-entropy tokens and report performance surpassing that of training on all tokens\. Another study demonstrated that rewarding high\-entropy tokens can encourage model exploration and significantly improve reasoning accuracyChenget al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib9)\)\. Meanwhile, similar work identifies high\-covariance tokens as the primary cause of “entropy collapse” during reinforcement learningCuiet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib25)\)\. They restrict their updates to sustain exploration and ultimately improve the model’s reasoning accuracy\. While these approaches successfully improve model’s performance, it comes with the adverse side effect of further elongating the thought process, thereby exacerbating the “overthinking” phenomenon and increasing inference costs\.

In parallel, another line of research has focused on improving reasoning efficiency through reinforcement learning, aiming to shorten responses and realize on\-demand thinking\. Initial efforts included relatively inflexible methods such as post\-hoc pruning of generated thoughtsMuennighoffet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib13)\)or training models to adhere to manually specified length budgetsAggarwal and Welleck \([2025](https://arxiv.org/html/2605.19358#bib.bib17)\)\. More methods have been designed with finer\-grained reinforcement learning strategies to achieve the goal of conciseness\. For instance, GRPO\-LEADZhang and Zuo \([2025](https://arxiv.org/html/2605.19358#bib.bib15)\)penalizes correct responses that are longer than average\. AdaCoTLouet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib26)\)uses reinforcement learning to learn an optimal policy for triggering the entire CoT process based on query complexity, while Ada\-R1Luoet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib27)\)first merges long and short CoT models and then uses bi\-level preference training to select the most suitable reasoning style for a given problem\. While effective at reducing length, these approaches may face a critical trade\-off: the gains in efficiency frequently come at the cost of performance degradation on more complex problems that genuinely require deliberate reasoning\.

This presents a clear dilemma: methods that enhance efficiency risk may hurt accuracy, while methods that boost accuracy may hurt efficiency\. Inspired by recent advances in token\-level entropyWanget al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib8)\); Chenget al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib9)\); Cuiet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib25)\), our work aims to resolve this trade\-off by conditioning the model’s exploratory behavior on the correctness of its reasoning path\. In contrast, the previous workChenget al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib9)\)applies a single, fixed strategy regardless of the reasoning correctness\. Based on this core insight, we propose our novel frameworkConditional Entropy Shaping \(CES\)\. CES operates within Decoupled Clip and Dynamic Sampling Policy Optimization \(DAPO\)Yuet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib10)\)reinforcement learning framework and intelligently modulates the model’s exploratory behavior based on the correctness of its reasoning\. As shown in Figure[1](https://arxiv.org/html/2605.19358#S1.F1), CES guides the model to:

1. 1\.Discourage Exploration: When a generated reasoning path iscorrect, CES applies a penalty to the highest\-entropy tokens within that path\. This encourages the model to become more confident and efficient, refining its thought process toward a concise, direct solution\.
2. 2\.Encourage Exploration: Conversely, when the path isincorrect, CES rewards these same high\-entropy “forking point” tokens\. This incentivizes the model to explore alternative pathways, and correct its flawed logic\.

Empirical results across 12 math benchmarks show that CES improves both accuracy and efficiency on average\. The primary contributions of this paper are:

- •We introduce CES, a novel reinforcement learning mechanism that implements a conditional and bidirectional control policy for LLM reasoning\.
- •We demonstrate on 12 mathematical benchmarks that CES improves the average accuracy?efficiency trade\-off over DAPO\. We further show the robustness of CES through additional experiments on smaller 1\.5B backbone and out\-of\-domain benchmarks\.
- •We provide a comprehensive analysis of CES’s learned behavior, revealing how it develops an adaptive, “on\-demand” reasoning strategy that strategically allocates computational effort\.

## 2Method

Our proposed method, CES, introduces a novel advantage\-shaping mechanism into the DAPO framework\. DAPO is a reinforcement learning algorithm designed for eliciting complex reasoning in LLMs, which already incorporates several key techniques to stabilize training and improve performance in long CoT scenarios\. CES builds upon this strong foundation by introducing an explicit mechanism to manage the trade\-off between exploration for accuracy and conciseness for efficiency\. It achieves this by dynamically reshaping the token\-level advantage signal based on two factors: the correctness of a given model response and the generation entropy of its constituent tokens\. Specifically, for correct responses, CES penalizes high\-entropy tokens to encourage more direct and concise reasoning paths\. Conversely, for incorrect responses, it rewards high\-entropy tokens to stimulate exploration and facilitate error correction\.

### 2\.1Preliminaries: The DAPO Framework

DAPO enhances the Group Relative Policy Optimization \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2605.19358#bib.bib20)\)algorithm with a suite of techniques tailored for large\-scale reinforcement learning\. For a given promptxx, a policyπθ\\pi\_\{\\theta\}generates a group ofNNresponses,Y=\{y1,y2,…,yN\}Y=\\\{y\_\{1\},y\_\{2\},\\ldots,y\_\{N\}\\\}\. The core of the DAPO objective function is to learn a preference by maximizing the advantage of “winner” responses over “loser” responses within the group\. The full objective is given by:

𝒥DAPO​\(θ\)=\\displaystyle\\mathcal\{J\}\_\{\\text\{DAPO\}\}\(\\theta\)=\{\}E\(q,a\)∼𝒟,\{oi\}∼πθold\[1∑i=1G\|oi\|∑i=1G∑t=1\|oi\|\\displaystyle E\_\{\\begin\{subarray\}\{c\}\(q,a\)\\sim\\mathcal\{D\},\\\\ \\\{o\_\{i\}\\\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\\end\{subarray\}\}\\Biggl\[\\frac\{1\}\{\\sum\_\{i=1\}^\{G\}\|o\_\{i\}\|\}\\sum\_\{i=1\}^\{G\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\(1\)min\(ri,t\(θ\)A^i,t,clip\(ri,t\(θ\),\\displaystyle\\min\\Bigl\(r\_\{i,t\}\(\\theta\)\\hat\{A\}\_\{i,t\},\\operatorname\{clip\}\\\!\\bigl\(r\_\{i,t\}\(\\theta\),1−ϵlow,1\+ϵhigh\)A^i,t\)\]\\displaystyle 1\-\\epsilon\_\{\\text\{low\}\},1\+\\epsilon\_\{\\text\{high\}\}\\bigr\)\\hat\{A\}\_\{i,t\}\\Bigr\)\\Biggr\]
The key components of DAPO relevant to our work are:

- •Group\-Relative Advantage \(A^i,t\\hat\{A\}\_\{i,t\}\): The advantage for a responseyiy\_\{i\}is calculated by normalizing its rewardRiR\_\{i\}against the mean and standard deviation of rewards within its group\{Rj\}j=1G\\left\\\{R\_\{j\}\\right\\\}\_\{j=1\}^\{G\}\. This group\-normalized advantage is then applied to every tokenttin the responseyiy\_\{i\}\.
- •Token\-Level Policy Gradient Loss: DAPO’s objective is normalized by the total number of tokens in the batch \(∑i=1G\|oi\|\\sum\_\{i=1\}^\{G\}\|o\_\{i\}\|\), ensuring that each token contributes equally to the final loss, regardless of the length of the sequence it belongs to\. This prevents shorter sequences from being overshadowed by longer ones\.

CES intervenes directly at the level of the advantage calculation,A^i,t\\hat\{A\}\_\{i,t\}, before it is used in the DAPO objective function\.

### 2\.2Conditional Entropy Shaping \(CES\)

CES modifies the advantage signal for each token to provide more nuanced guidance to the model\. The process involves three steps\.

#### 2\.2\.1Step 1: Initial Group\-Wise Calculations

For a given promptxx, we generate a response setY=\{y1,y2,…,yN\}Y=\\\{y\_\{1\},y\_\{2\},\\ldots,y\_\{N\}\\\}using the policyπθ\\pi\_\{\\theta\}\. We assign a composite rewardR​\(yi\)R\(y\_\{i\}\)to each response, which is the sum of two binary components: an accuracy rewardracc​\(yi\)∈\{0,1\}r\_\{\\text\{acc\}\}\(y\_\{i\}\)\\in\\\{0,1\\\}based on the correctness of the final answer, and a format rewardrfmt​\(yi\)∈\{0,1\}r\_\{\\text\{fmt\}\}\(y\_\{i\}\)\\in\\\{0,1\\\}for adherence to the<think\>\.\.\.</think\>structure\. The total reward isR​\(yi\)=racc​\(yi\)\+rfmt​\(yi\)R\(y\_\{i\}\)=r\_\{\\text\{acc\}\}\(y\_\{i\}\)\+r\_\{\\text\{fmt\}\}\(y\_\{i\}\)\.

The group accuracyaa, which is crucial for our conditional mechanism, is computed based only on the correctness reward:

a=1N​∑i=1Nracc​\(yi\)a=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}r\_\{\\text\{acc\}\}\(y\_\{i\}\)\(2\)
The initial, unshaped advantage for any token in responseyiy\_\{i\}is the standard group\-normalized advantage, calculated using the total rewardR​\(yi\)R\(y\_\{i\}\):

Ai=R​\(yi\)−mean​\(\{R​\(yj\)\}j=1N\)std​\(\{R​\(yj\)\}j=1N\)A\_\{i\}=\\frac\{R\(y\_\{i\}\)\-\\text\{mean\}\(\\\{R\(y\_\{j\}\)\\\}\_\{j=1\}^\{N\}\)\}\{\\text\{std\}\(\\\{R\(y\_\{j\}\)\\\}\_\{j=1\}^\{N\}\)\}\(3\)

#### 2\.2\.2Step 2: Dynamic Selection of High\-Entropy Tokens

Then, we compute the token\-level entropy\. The entropyH​\(tj\|yi,<j\)H\(t\_\{j\}\|y\_\{i,<j\}\)for a tokentjt\_\{j\}in responseyiy\_\{i\}at positionjjis calculated as:

H​\(tj\|yi,<j\)=−∑v∈𝒱p​\(v\|yi,<j\)​log2⁡p​\(v\|yi,<j\)H\(t\_\{j\}\|y\_\{i,<j\}\)=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\(v\|y\_\{i,<j\}\)\\log\_\{2\}p\(v\|y\_\{i,<j\}\)\(4\)
In Equation 4,VVrepresents the vocabulary size\. We then select the topkik\_\{i\}most entropic tokens in each responseyiy\_\{i\}to form a setSH​\(yi\)S\_\{H\}\(y\_\{i\}\)\. The numberkik\_\{i\}is determined dynamically to modulate the strength of our intervention:

ki=⌊\|yi\|⋅τ⋅bi⌋k\_\{i\}=\\lfloor\|y\_\{i\}\|\\cdot\\tau\\cdot b\_\{i\}\\rfloor\(5\)
In Equation 5,\|yi\|\|y\_\{i\}\|is the total length of responseyiy\_\{i\}, andτ\\tauis a base top\-rate hyperparameter\. The crucial component is thedynamic multiplierbib\_\{i\}, defined as:

bi=\{aif​racc​\(yi\)=11−aif​racc​\(yi\)=0b\_\{i\}=\\begin\{cases\}a&\\text\{if \}r\_\{\\text\{acc\}\}\(y\_\{i\}\)=1\\\\ 1\-a&\\text\{if \}r\_\{\\text\{acc\}\}\(y\_\{i\}\)=0\\end\{cases\}\(6\)
This design aims to apply a stronger intervention \(a largerkik\_\{i\}\) in two specific scenarios: \(1\) when penalizing a correct response in a group that was easy for the model \(highaa\), and \(2\) when rewarding an incorrect response in a group that was difficult for the model \(lowaa\)\.

#### 2\.2\.3Step 3: Entropy\-Based Advantage Shaping

Finally, we compute the reshaped advantageAi,j′A^\{\\prime\}\_\{i,j\}for each tokentjt\_\{j\}in responseyiy\_\{i\}\. The advantage is modified only for the selected high\-entropy tokens in the setSH​\(yi\)S\_\{H\}\(y\_\{i\}\)\.

Ai,j′=\{Ai−β1⋅H​\(tj\|yi,<j\)if​racc​\(yi\)=1​andtj∈SH​\(yi\)Ai\+β2⋅H​\(tj\|yi,<j\)if​racc​\(yi\)=0​andtj∈SH​\(yi\)AiotherwiseA^\{\\prime\}\_\{i,j\}=\\begin\{cases\}A\_\{i\}\-\\beta\_\{1\}\\cdot H\(t\_\{j\}\|y\_\{i,<j\}\)&\\text\{if \}\\begin\{subarray\}\{c\}r\_\{\\text\{acc\}\}\(y\_\{i\}\)=1\\text\{ and\}\\\\ t\_\{j\}\\in S\_\{H\}\(y\_\{i\}\)\\end\{subarray\}\\\\ A\_\{i\}\+\\beta\_\{2\}\\cdot H\(t\_\{j\}\|y\_\{i,<j\}\)&\\text\{if \}\\begin\{subarray\}\{c\}r\_\{\\text\{acc\}\}\(y\_\{i\}\)=0\\text\{ and\}\\\\ t\_\{j\}\\in S\_\{H\}\(y\_\{i\}\)\\end\{subarray\}\\\\ A\_\{i\}&\\text\{otherwise\}\\end\{cases\}\(7\)
In Equation 7,β1,β2\>0\\beta\_\{1\},\\beta\_\{2\}\>0is a hyperparameter scaling the magnitude of the entropy\-based shaping\. This final token\-level advantageAi,j′A^\{\\prime\}\_\{i,j\}replaces the originalA^i,t\\hat\{A\}\_\{i,t\}in the DAPO objective function \(Equation 1\), thereby injecting our fine\-grained control signal into the learning process\. The detailed pseudocode for CES is outlined in the Appendix\.

## 3Experimental Settings

### 3\.1Backbone Model and Baselines

Our experiments are conducted in the context of advanced reasoning models\. We select the powerful, open\-source DeepSeek\-R1\-Distill\-Qwen\-7BGuoet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib3)\)as our backbone model, which is known for its strong long\-chain reasoning capabilities\. To isolate the impact of our proposed method, we establish three baselines for comparison:

1. 1\.Original R1\-7B: The pretrained DeepSeek\-R1\-Distill\-Qwen\-7B model without any reinforcement learning fine\-tuning\.
2. 2\.DAPO Baseline \(the key baseline\): The same backbone model fine\-tuned using DAPO algorithm without the CES module\. This serves as our primary baseline to directly measure the improvements brought by CES\.
3. 3\.DAPO with “Entropy Advantage”: We compare CES with the previous workChenget al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib9)\)\. Their work introduces an “Entropy Advantage” that unconditionally adds an entropy\-based advantage to all tokens to encourage more exploratory reasoning paths, with the primary goal of improving performance on reasoning tasks\. This provides a clear contrast to our conditional, bidirectional approach which aims to balance both accuracy and efficiency\.

Table 1:Comparison of Accuracy and Response Length on Key Math Datasets\. The best result in each category is inbold\. The terms “Acc” and “Len” represent the mean accuracy and the mean response length across 4 assessments for each benchmark\.
### 3\.2Training Details

We utilize the OpenRLHF frameworkHuet al\.\([2024](https://arxiv.org/html/2605.19358#bib.bib22)\)to perform DAPO training, focusing on the domain of solving mathematical problems\. Due to resource constraints, our training set only consists of 2500 training samples randomly sampled from the DeepMath datasetHeet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib23)\)\. All experiments were carried out on 2 NVIDIA A800 GPUs with 80GB of memory\.

Notably, we disable the Dynamic Sampling feature of DAPO when training our CES model\. Standard DAPO discards batches where all responses are correct or all are incorrect, as these yield zero advantage and thus no gradient for the sequence\-level policy update\. However, as CES reshapes advantage at the token level using entropy, these seemingly “solved” or “hopeless” batches also provide a valuable, non\-zero learning signal\. This signal is crucial for refining the model’s confidence and reasoning style, making every sample useful for training\. A comprehensive list of hyperparameters can be found in the Appendix\.

### 3\.3Evaluation

For a standardized and reproducible assessment, we employ the evaluation script from the GitHub repository for Qwen2\.5\-mathQwen Team \([2025](https://arxiv.org/html/2605.19358#bib.bib24)\)\. To avoid repetition and instability in long\-form reasoning models, we adopt a non\-greedy decoding strategy, setting a temperature of 0\.4, top\-ppsampling withp=0\.95p=0\.95, and a repetition penalty of 1\.05\. For each problem in the test sets, we independently generate 4 responses to ensure a stable and representative measurement\. Our evaluation focuses on two primary metrics:

1. 1\.Accuracy \(Acc\): The average correctness of the final answers\.
2. 2\.Average Response Length \(Len\): The average number of tokens in the generated responses\.

We conduct an extensive evaluation across 12 diverse mathematical reasoning benchmarks: AIME24, AMC23, CMATH, CN Middle School 24, College Math, GaoKao Math Cloze, GaoKao 2023 En, GSM8K, Minerva Math, Olympiad Bench, SVAMP, and TABMWP\.

### 3\.4Generalization Experiments

Our main experiments are conducted on DeepSeek\-R1\-Distill\-Qwen\-7B\. To assess robustness beyond the primary setting, we further replicate CES on a smaller DeepSeek\-R1\-Distill\-1\.5B backbone and evaluate the resulting models on out\-of\-domain coding and general\-reasoning benchmarks\. These additional results are reported in Appendix E\.

## 4Results

As shown in Table 1, CES demonstrates superior performance by achieving the best overall balance between accuracy and efficiency\. On average across all 12 mathematical reasoning datasets, CES achieves the highest accuracy of 72\.1% while simultaneously producing the shortest average response length of 1965 tokens\. This represents a significant improvement over our primary baseline, DAPO, with an average accuracy gain of \+2\.5% and a substantial average length reduction of 411 tokens\.

CES learns to generate more effective and efficient reasoning paths across a wide spectrum of difficulties\. For instance, on AIME24, a notoriously difficult competition\-level dataset, CES boosts accuracy by a remarkable \+6\.7% while cutting the response length by 997 tokens\. Similarly, on AMC23 and Olympiad Bench, CES achieves accuracy gains of \+1\.9% and \+2\.2% respectively, along with massive efficiency improvements, shortening the reasoning paths by 1014 and 839 tokens\. This “win\-win” outcome indicates that CES is not merely pruning the responses, but simultaneously improving the quality and directness of the model’s problem\-solving strategies\. In addition, on test sets such as CN Middle School 24 and GSM8K, it correctly identifies an opportunity where a modest investment in length \(\+36/\+32 tokens\) can yield a considerable gain in accuracy \(\+9\.1%/\+3\.6%\)\. This behavior shows that CES is not a naive length reduction algorithm but an intelligent controller that strategically allocates computational budget\.

We also observe consistent robustness in the 1\.5B backbone\. We defer the detailed table to Appendix E\.1 due to space limits\.

## 5Analysis

### 5\.1Training Dynamics

![Refer to caption](https://arxiv.org/html/2605.19358v1/x2.png)\(a\)Response length
![Refer to caption](https://arxiv.org/html/2605.19358v1/x3.png)\(b\)Entropy
![Refer to caption](https://arxiv.org/html/2605.19358v1/x4.png)\(c\)Accuracy

Figure 2:Training dynamics of average response length \(a\), entropy \(b\), and accuracy \(c\) for the DAPO baseline \(blue\) and our CES method \(green\)\.To gain deeper insight into the mechanism of CES, we analyze the evolution of key metrics throughout the training process\. Figure[2](https://arxiv.org/html/2605.19358#S5.F2)plots the average response length, average token entropy and average group accuracy, comparing our CES\-enhanced DAPO training against the standard DAPO baseline\.

A striking pattern emerges in the Response Length and Entropy plots\. For the first 1000 training samples, both the CES and baseline models exhibit similar behavior, maintaining a high and stable average length and entropy\. This initial phase can be interpreted as the primary task acquisition stage, where both models are focused on learning the fundamental mechanics of solving the problems to achieve a reward\. During this period, the policy is highly exploratory, and the CES mechanism has not yet become a dominant optimization force\.

However, a clear divergence occurs after 1000 training samples\. While the DAPO baseline’s length and entropy remain high and relatively constant, the CES model’s metrics begin a steep and consistent decline\. The average response length drops from over 5000 to nearly 3000 tokens, and the average entropy falls from 0\.4 to below 0\.2\. This second phase demonstrates onset of CES’s core effect, where the entropy penalty on correct answers becomes a powerful and consistent training signal\. The model learns that it can maximize its reward not just by being correct, but by being correct and confident\. The strong correlation between the decline in entropy and length empirically validates our hypothesis that penalizing high\-entropy “forking points” effectively prunes unnecessary, verbose exploration, leading to more concise reasoning paths\.

In Figure[2](https://arxiv.org/html/2605.19358#S5.F2)\(c\), we observe that neither the baseline nor the CES model shows a significant, sustained upward trend in accuracy, with both curves fluctuating in a similar trend throughout training\. This behavior is likely attributable to the limited size of our training set \(2500 samples\) and the absence of carefully\-designed data strategy\. However, it also reflects that the improvements in efficiency \(i\.e\., shorter length and lower entropy\) achieved by CES are realized without sacrificing model’s problem\-solving performance\. The CES model maintains an accuracy level competitive with the baseline, while operating at a significantly lower computational budget\. In general, these dynamics reveal that CES successfully introduces a distinct optimization phase into training: after the initial task acquisition, it effectively teaches the model to become more efficient and decisive, achieving conciseness without compromising its learned reasoning capabilities\.

### 5\.2Analysis of increasing response length on simple test sets

A notable observation from our main results is that while CES significantly shortens responses on most datasets, it increases the average response length on four specific datasets: CMATH, CN Middle School 24, GSM8K and TABMWP\. A common characteristic of these datasets is their relatively shorter response length \(typically under 1000 tokens\) and higher performance, suggesting they are simpler overall\. This phenomenon seems to contradict our goal of improving efficiency\.

We assume that this is a characteristic of adaptive reasoning manifested by CES\. The key to understanding this lies in moving beyond dataset\-level averages and analyzing model behavior on a finer\-grained, per\-question difficulty level\. To test this, we stratified the questions within these four datasets into two categories based on the original R1\-7B model’s performance:

1. 1\.“Simple Questions”: Questions where the R1\-7B model’s accuracy is greater than 50%\.
2. 2\.“Difficult Questions”: Questions where the R1\-7B model’s accuracy is less than or equal to 50%\.

![Refer to caption](https://arxiv.org/html/2605.19358v1/x5.png)Figure 3:Comparison of average response length, stratified by question difficulty on four simpler datasets\.Figure[3](https://arxiv.org/html/2605.19358#S5.F3)presents the average response length of both DAPO and CES models across these stratified difficulties\. For “Difficult Questions” within these simpler datasets, CES triggers a significant increase in response length in most datasets\. Conversely, for “Simple Questions”, the response lengths remain relatively stable or change minimally\.

The failure mode of the original R1\-7B model on these “difficult” questions may be insufficient exploration\. Accustomed to the simple patterns of the dataset, it applies a short, inadequate template and fails\. CES, through its mechanism of rewarding entropy on incorrect answers, correctly identifies these failures and provides a strong incentive for deeper exploration\. It forces the model to abandon the failed template and invest more effort in finding a correct solution\. In contrast, on complex datasets like Olympiad Bench, R1\-7B’s failure mode is often inefficient overthinking, producing long, verbose, and incorrect reasoning\. There, CES’s primary role is to prune this redundancy\. In summary, the strategic investment in reasoning for difficult problems outweighs the minor length changes on simple ones, leading to an increase in the dataset’s overall average response length\.

### 5\.3Ablation Studies

To validate the key components of our CES framework, we conduct two main ablation studies\. These experiments are designed to investigate the importance of our dynamic token selection mechanism and the role of the entropy gradient in our advantage shaping formula\.

MethodAcc ↑Len ↓Original R1\-7B69\.12583DAPO \(Baseline\)69\.62376CES w/o Dynamicbb69\.52462CES w/o Entropy Gradient69\.42539CES \(Ours\)72\.11965Table 2:Ablation study on the core components of CES\.#### 5\.3\.1The Importance of Dynamic Token Selection

A core feature of CES is the dynamic calculation ofkk, the number of high\-entropy tokens to be shaped in each response\. This number is modulated by a dynamic multiplierbb\(whereb=ab=afor correct responses andb=1−ab=1\-afor incorrect ones, withaabeing the group accuracy\), which adjusts the intervention strength based on the perceived difficulty of the problem\. To test the necessity of this design, we trained an ablated model, “RemoveAcc”, where we removed this dynamic multiplier by fixingb=1b=1\. In this setting, a constant percentage of tokens with the highest entropy is always selected for entropy shaping, regardless of group accuracy\.

The results shown in Table[2](https://arxiv.org/html/2605.19358#S5.T2)indicates that the “RemoveAcc” model’s average accuracy drops to 69\.5%, nearly identical to the DAPO baseline \(69\.6%\) and significantly underperforming the full CES model \(72\.1%\)\. Furthermore, its average response length increases to 2462, making it even less efficient than the DAPO baseline \(2376\)\.

While the behavior of a fixedb=1b=1is identical to our dynamicbbat the absolute extremes \(when group accuracya=1a=1ora=0a=0\), the critical difference emerges in the vast majority of training scenarios where the model’s performance is mixed \(0<a<10<a<1\)\. Consider a difficult problem where the model finds a correct solution for the first time, resulting in a low group accuracy \(e\.g\.,a=0\.25a=0\.25\)\. Our full CES method applies a very gentle penalty, scaling the intervention byb=a=0\.25b=a=0\.25\. This protects the newfound, likely inefficient reasoning path, acknowledging that it is a valuable success on a difficult problem\. The “RemoveAcc” ablation, in contrast, applies the maximal penalty \(b=1b=1\)\. It aggressively punishes the high\-entropy tokens in this fragile, correct solution, effectively signaling to the model that this “messy” path to success is undesirable\. This can cause the model to discard the correct reasoning logic in subsequent updates, leading to performance degradation\.

Therefore, the dynamic multiplierbbacts as a crucial adaptive regularizer\. It provides a proportional response: applying gentle, protective pressure on novel solutions to difficult problems, while applying strong, optimizing pressure on mastered solutions to easy problems\. By removing this calibrated intelligence, the “RemoveAcc” model fails, demonstrating that the dynamic selection of tokens is essential for robustly learning to be both accurate and efficient\.

#### 5\.3\.2The Role of the Entropy Gradient in Bidirectional Control

In our CES formulation, the entropy termHHis included in the computation graph, meaning the model’s policy is explicitly optimized to produce outputs that align with our entropy\-based objectives\. However, a related workChenget al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib9)\)that also uses an entropy\-based advantage term introduces a “detach” operation in their implementation\. This prevents the gradient of the entropy term from being computed, using it only to scale the magnitude of the existing policy gradient rather than setting a new optimization goal\. To investigate this choice, we trained an ablated model, where we detached our entropy shaping term from the computation graph\.

The results shown in Table[2](https://arxiv.org/html/2605.19358#S5.T2)indicate that this change is detrimental to our method\. The Detach model’s performance \(69\.4%\) regresses to that of the DAPO baseline \(69\.6%\) in accuracy, while its average length balloons to 2539, becoming the least efficient of all training configurations\. The reason for this failure lies in the fundamental difference in goals between CES and the method of related workChenget al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib9)\)\. As their objective is unconditional exploration, detaching the entropy term serves as a clever way to amplify the existing policy updates at uncertain steps without asking the model to learn to be “more uncertain”\.

However, CES has a dual, conditional objective\. The “inhibit exploration” part of our mechanism \(A′←A−β⋅HA^\{\\prime\}\\leftarrow A\-\\beta\\cdot Hfor correct answers\) is predicated on teaching the model to become more efficient by producing lower\-entropy outputs\. This requires a non\-zero gradient so the model can learn to directly reduce entropy to avoid the penalty\. Detaching the term completely breaks this crucial learning signal\. Without the gradient, the penalty becomes a simple, static reduction in advantage that provides no direction for how to improve efficiency\. This lead to the observed outcome: baseline accuracy with uncontrolled, verbose responses\. Therefore, maintaining the entropy gradient is essential for the bidirectional control at the heart of CES to function as intended\.

## 6Related Work

### 6\.1Reinforcement Learning for LLMs

Reinforcement learning is a core technique for aligning pretrained language models\. Early RLHF pipelines commonly relied on Proximal Policy Optimization \(PPO\)Schulmanet al\.\([2017](https://arxiv.org/html/2605.19358#bib.bib18)\)with a separately trained reward model, while more recent work has shifted toward direct optimization methods to improve stability and simplify training\. A representative example is Direct Preference Optimization \(DPO\)Rafailovet al\.\([2023](https://arxiv.org/html/2605.19358#bib.bib19)\), which derives an optimization signal directly from preference data\. This paradigm has been extended to reasoning settings that compare multiple responses to the same prompt, leading to algorithms such as GRPO and our baseline DAPO, which optimize policies using sequence\-level preferences\. Building on this line, our method CES introduces a more fine\-grained mechanism by intervening at the token level and dynamically shaping the learning signal within the DAPO framework\.

### 6\.2Entropy in LLMs

Entropy quantifies the uncertainty of a probability distribution\. In LLMs, token\-level entropy measures the uncertainty of the predicted distribution over the vocabulary at each generation step: higher entropy corresponds to a flatter distribution and lower confidence in selecting the next tokenLiet al\.\([2025](https://arxiv.org/html/2605.19358#bib.bib11)\)\.

## 7Conclusion

In this work, we address the fundamental challenge of balancing performance and efficiency in LLM reasoning\. To resolve this trade\-off, we propose CES, a framework that enables models to adapt their reasoning strategy: thinking concisely when confident, and reasoning deeply when uncertain\. CES achieves consistent improvements in both accuracy and computational efficiency across diverse mathematical reasoning benchmarks, alleviating the inherent trade\-off between exploration and exploitation\.

Beyond empirical gains, this work suggests a broader principle: LLMs can learn not just to reason accurately, but to regulate how they reason\. This opens directions for building fine\-grained, resource\-aware reasoning systems that require cost\-sensitive inference\.

## Limitations

While CES achieves an average win?win by conditionally shaping token\-level advantages with entropy, several limitations remain\. First, the current formulation still relies on outcome\-verifiable correctness signals to compute group accuracyaaand to determine the direction of entropy shaping\. As a result, applying the same mechanism to tasks with ambiguous, subjective, or weakly verifiable outcomes is less straightforward\. Meanwhile, CES remains moderately sensitive to hyperparameters such asτ\\tauandβ\\beta\. In practice, the method is robust within a reasonable range, but achieving the best accuracy–efficiency balance may still require light calibration when transferring to a new backbone or task distribution\.

## References

- L1: controlling how long a reasoning model thinks with reinforcement learning\.arXiv preprint arXiv:2503\.04697\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p3.1)\.
- X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang,et al\.\(2024\)Do not think that much for 2\+3=? on the overthinking of o1\-like llms\.arXiv preprint arXiv:2412\.21187\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p1.1)\.
- D\. Cheng, S\. Huang, X\. Zhu, B\. Dai, W\. X\. Zhao, Z\. Zhang, and F\. Wei \(2025\)Reasoning with exploration: an entropy perspective\.arXiv preprint arXiv:2506\.14758\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p2.1),[§1](https://arxiv.org/html/2605.19358#S1.p4.1),[item 3](https://arxiv.org/html/2605.19358#S3.I1.i3.p1.1),[§5\.3\.2](https://arxiv.org/html/2605.19358#S5.SS3.SSS2.p1.1),[§5\.3\.2](https://arxiv.org/html/2605.19358#S5.SS3.SSS2.p2.1)\.
- G\. Cui, Y\. Zhang, J\. Chen, L\. Yuan, Z\. Wang, Y\. Zuo, H\. Li, Y\. Fan, H\. Chen, W\. Chen,et al\.\(2025\)The entropy mechanism of reinforcement learning for reasoning language models\.arXiv preprint arXiv:2505\.22617\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p2.1),[§1](https://arxiv.org/html/2605.19358#S1.p4.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.19358#S3.SS1.p1.1)\.
- Z\. He, T\. Liang, J\. Xu, Q\. Liu, X\. Chen, Y\. Wang, L\. Song, D\. Yu, Z\. Liang, W\. Wang,et al\.\(2025\)Deepmath\-103k: a large\-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning\.arXiv preprint arXiv:2504\.11456\.Cited by:[§3\.2](https://arxiv.org/html/2605.19358#S3.SS2.p1.1)\.
- J\. Hu, X\. Wu, Z\. Zhu, W\. Wang, D\. Zhang, Y\. Cao,et al\.\(2024\)Openrlhf: an easy\-to\-use, scalable and high\-performance rlhf framework\.arXiv preprint arXiv:2405\.11143\.Cited by:[§3\.2](https://arxiv.org/html/2605.19358#S3.SS2.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p1.1)\.
- X\. Li, E\. Callanan, X\. Zhu, M\. Sibue, A\. Papadimitriou, M\. Mahfouz, Z\. Ma, and X\. Liu \(2025\)Entropy\-aware branching for improved mathematical reasoning\.arXiv preprint arXiv:2503\.21961\.Cited by:[§6\.2](https://arxiv.org/html/2605.19358#S6.SS2.p1.1)\.
- C\. Lou, Z\. Sun, X\. Liang, M\. Qu, W\. Shen, W\. Wang, Y\. Li, Q\. Yang, and S\. Wu \(2025\)AdaCoT: pareto\-optimal adaptive chain\-of\-thought triggering via reinforcement learning\.arXiv preprint arXiv:2505\.11896\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p3.1)\.
- H\. Luo, H\. He, Y\. Wang, J\. Yang, R\. Liu, N\. Tan, X\. Cao, D\. Tao, and L\. Shen \(2025\)Ada\-r1: hybrid\-cot via bi\-level adaptive reasoning optimization\.arXiv preprint arXiv:2504\.21659\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p3.1)\.
- W\. Ma, J\. He, C\. Snell, T\. Griggs, S\. Min, and M\. Zaharia \(2025\)Reasoning models can be effective without thinking\.arXiv preprint arXiv:2504\.09858\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p1.1)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. Hashimoto \(2025\)S1: simple test\-time scaling\.arXiv preprint arXiv:2501\.19393\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p3.1)\.
- Qwen Team \(2025\)Qwen2\.5\-math\.Note:[https://github\.com/QwenLM/Qwen2\.5\-Math](https://github.com/QwenLM/Qwen2.5-Math)Accessed: 2025\-07\-22Cited by:[§3\.3](https://arxiv.org/html/2605.19358#S3.SS3.p1.2)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§6\.1](https://arxiv.org/html/2605.19358#S6.SS1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§6\.1](https://arxiv.org/html/2605.19358#S6.SS1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2\.1](https://arxiv.org/html/2605.19358#S2.SS1.p1.4)\.
- S\. Wang, L\. Yu, C\. Gao, C\. Zheng, S\. Liu, R\. Lu, K\. Dang, X\. Chen, J\. Yang, Z\. Zhang,et al\.\(2025\)Beyond the 80/20 rule: high\-entropy minority tokens drive effective reinforcement learning for llm reasoning\.arXiv preprint arXiv:2506\.01939\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p2.1),[§1](https://arxiv.org/html/2605.19358#S1.p4.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p1.1)\.
- C\. Yang, N\. Srebro, D\. McAllester, and Z\. Li \(2025b\)Pencil: long thoughts with short memory\.arXiv preprint arXiv:2503\.14337\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p4.1)\.
- J\. Zhang and C\. Zuo \(2025\)Grpo\-lead: a difficulty\-aware reinforcement learning approach for concise mathematical reasoning in language models\.arXiv preprint arXiv:2504\.09696\.Cited by:[§1](https://arxiv.org/html/2605.19358#S1.p3.1)\.

## Appendix AAlgorithm

Algorithm[1](https://arxiv.org/html/2605.19358#alg1)details the complete procedure for implementing Conditional Entropy Shaping \(CES\) within the DAPO framework\.

Algorithm 1CES within DAPO FrameworkInput: Promptxx, current policyπθ\\pi\_\{\\theta\} Parameters: Generation group sizeNN, top\-rate hyperparameterτ\\tau, entropy scaling factorsβ1,β2\\beta\_\{1\},\\beta\_\{2\} Output: A set of shaped, token\-level advantages𝒜′\\mathcal\{A\}^\{\\prime\}for gradient update

1:Generate response set

Y=\{y1,…,yN\}Y=\\\{y\_\{1\},\\dots,y\_\{N\}\\\}from

πθ\(⋅\|x\)\\pi\_\{\\theta\}\(\\cdot\|x\)\.

2:Compute rewards

R​\(yi\),racc​\(yi\)R\(y\_\{i\}\),r\_\{\\text\{acc\}\}\(y\_\{i\}\)for each

yi∈Yy\_\{i\}\\in Y\.

3:Compute group accuracy

aabased on

\{racc​\(yi\)\}\\\{r\_\{\\text\{acc\}\}\(y\_\{i\}\)\\\}\.

4:Initialize set of all shaped advantages

𝒜′←∅\\mathcal\{A\}^\{\\prime\}\\leftarrow\\emptyset\.

5:foreach response

yiy\_\{i\}in

YYdo

6:

Ai←GroupNormalize​\(\{R​\(yj\)\},R​\(yi\)\)A\_\{i\}\\leftarrow\\text\{GroupNormalize\}\(\\\{R\(y\_\{j\}\)\\\},R\(y\_\{i\}\)\)\.

7:if

racc​\(yi\)=1r\_\{\\text\{acc\}\}\(y\_\{i\}\)=1then

8:

bi←ab\_\{i\}\\leftarrow a
9:else

10:

bi←1−ab\_\{i\}\\leftarrow 1\-a
11:endif

12:Compute number of tokens to select

ki=⌊\|yi\|⋅τ⋅bi⌋k\_\{i\}=\\lfloor\|y\_\{i\}\|\\cdot\\tau\\cdot b\_\{i\}\\rfloor\.

13:

SH​\(yi\)←S\_\{H\}\(y\_\{i\}\)\\leftarrowIdentify top

kik\_\{i\}high\-entropy tokens in

yiy\_\{i\}\.

14:foreach token

tjt\_\{j\}in

yiy\_\{i\}do

15:

Ai,j′←AiA^\{\\prime\}\_\{i,j\}\\leftarrow A\_\{i\}\{Initialize with base advantage\}

16:if

tj∈SH​\(yi\)t\_\{j\}\\in S\_\{H\}\(y\_\{i\}\)then

17:

Hj←H​\(tj\|yi,<j\)H\_\{j\}\\leftarrow H\(t\_\{j\}\|y\_\{i,<j\}\)\{Calculate entropy\}

18:if

racc​\(yi\)=1r\_\{\\text\{acc\}\}\(y\_\{i\}\)=1then

19:

Ai,j′←Ai−β1⋅HjA^\{\\prime\}\_\{i,j\}\\leftarrow A\_\{i\}\-\\beta\_\{1\}\\cdot H\_\{j\}\{Apply entropy penalty\}

20:else

21:

Ai,j′←Ai\+β2⋅HjA^\{\\prime\}\_\{i,j\}\\leftarrow A\_\{i\}\+\\beta\_\{2\}\\cdot H\_\{j\}\{Apply entropy reward\}

22:endif

23:endif

24:Add

Ai,j′A^\{\\prime\}\_\{i,j\}to

𝒜′\\mathcal\{A\}^\{\\prime\}\.

25:endfor

26:endfor

27:return

𝒜′\\mathcal\{A\}^\{\\prime\}\{Return token\-level shaped advantages

𝒜′\\mathcal\{A\}^\{\\prime\}for computing policy gradients in DAPO\}

## Appendix BHyperparameters and Prompt

### B\.1Training hyperparameters

Table 3 lists the hyperparameters for our reinforcement learning experiments\.

Table 3:Hyperparameters for RL training\.
### B\.2Evaluation prompt

For all evaluation scenarios, we used the following standardized prompt to ensure the model generates answers in a step\-by\-step manner and formats the final result correctly:

You are a helpful and harmless assistant\. You should think step\-by\-step\. Please put your final answer within\\boxed\{\}\.

## Appendix CSensitivity to key hyperparameters

To investigate the sensitivity of CES to its core hyperparameters and validate the robustness of CES, we conduct an ablation study on the top\-rateτ\\tauand the entropy scaling factorβ1,β2\\beta\_\{1\},\\beta\_\{2\}\. We evaluate five different hyperparameter configurations on a representative subset of five datasets and compare their average performance against the DAPO baseline\. The results are summarized in Table 4\.

𝝉\\boldsymbol\{\\tau\}𝜷𝟏,𝜷𝟐\\boldsymbol\{\\beta\_\{1\},\\beta\_\{2\}\}Acc ↑Len ↓0\.0051\.075\.128550\.011\.074\.229920\.051\.070\.728180\.010\.476\.927570\.011\.074\.229920\.012\.073\.42997DAPO \(Baseline\)72\.63407Table 4:Hyperparameter sensitivity analysis for CES on the average of 5 datasets \(AIME24, AMC23, GaoKao Math Cloze, GaoKao 2023 En and SVAMP\)\. The optimal configuration is highlighted inbold\.### C\.1Analysis of Top\-rateτ\\tau

The hyperparameterτ\\taucontrols the proportion of selected high\-entropy tokens\. Withβ1,β2\\beta\_\{1\},\\beta\_\{2\}fixed at 1\.0, we testedτ\\tauvalues of 0\.005, 0\.01, and 0\.05\. The results indicate that a smaller, more targeted intervention is more effective\. Asτ\\tauincreases from 0\.01 to 0\.05, the average accuracy drops sharply from 74\.2% to 70\.7%, falling below the DAPO baseline\. This suggests that selecting too many tokens introduces noise by including tokens that are not critical “forking points”, thereby diluting the learning signal and degrading the policy\.

### C\.2Analysis of Entropy Scaling Factorβ1,β2\\beta\_\{1\},\\beta\_\{2\}

The hyperparametersβ1,β2\\beta\_\{1\},\\beta\_\{2\}indicate the scaling magnitude of the entropy reward and penalty\. Withτ\\taufixed at 0\.01, we testedβ1,β2\\beta\_\{1\},\\beta\_\{2\}values of 0\.4, 1\.0, and 2\.0\. The results show a clear trend: asβ1,β2\\beta\_\{1\},\\beta\_\{2\}increases, the average accuracy decreases while average response length increases\. A larger setting onβ1,β2\\beta\_\{1\},\\beta\_\{2\}gives excessive weight to the entropy shaping term, particularly the exploratory reward on incorrect answers\. This can cause the model to over\-optimize for the process of exploration rather than the outcome of correctness, leading to longer, less focused reasoning chains that do not necessarily improve accuracy\.

### C\.3Robustness of CES

Across four of the five tested hyperparameter settings, our method simultaneously outperforms the DAPO baseline in both accuracy and length\. This demonstrates that CES provides consistent benefits across a reasonable range of hyperparameters, validating it as a stable and effective method for improving reasoning models\.

## Appendix DGPU Cost

Table 5:Training GPU wall\-clock time under the same setup\.Table 6:Cross\-scale generalization on DeepSeek\-R1\-Distill\-1\.5B\. The best result in each category is inbold\. “Acc” and “Len” denote the mean accuracy and the mean response length across 4 assessments for each benchmark\.A natural concern is whether CES introduces noticeable additional computation during training\. Compared with vanilla DAPO, CES indeed adds two operations: token\-level entropy computation and selection/shaping of high\-entropy tokens\. However, these additions do not require extra model forward or backward passes\. In practice, entropy is computed directly from the logits that are already produced during rollout sampling, so the overhead is limited to lightweight vector reductions and top\-k selection rather than additional Transformer backbone computation\.

Results in Table[5](https://arxiv.org/html/2605.19358#A4.T5)indicates that the additional GPU time overhead of CES is negligible\. For DeepSeek\-R1\-Distill\-7B, DAPO takes 1\.43 days, while CES takes 1\.44 days, corresponding to only about \+0\.7% relative overhead\. For DeepSeek\-R1\-Distill\-1\.5B, DAPO takes 9\.36 hours, while CES takes 9\.14 hours, making CES slightly faster by about \-2\.3%\. These results suggest that the small constant\-time overhead of entropy computation is largely offset by shorter rollouts during training\.

## Appendix EGeneralization Experiments

Although the main paper trains only on math data, the mechanism of CES is not inherently math\-specific\. It relies on token\-level uncertainty \(entropy\) and correctness\-conditioned shaping, which aredomain\-agnostic signals\. We therefore evaluate generalization from two perspectives:

1. 1\.Acrossmodel scales\.
2. 2\.Acrossdomains\.

We find that the benefits of CES are not limited to the original 7B math setting, but generalize to a smaller backbone and to out\-of\-domain tasks in both general reasoning and code generation\.

### E\.1Cross\-Scale Generalization

To test whether the benefits of CES depend on a single backbone scale, we additionally train DeepSeek\-R1\-Distill\-1\.5B under the same training protocol as the 7B model, and evaluate it on the same 12 math benchmarks\.

Results in Table 6 show that CES remains effective on the 1\.5B backbone\. The average accuracy improves from 52\.5 to 56\.2, while the average response length is reduced from 3581 to 3283\. Meanwhile, CES improves accuracy on most of the 12 benchmarks, and also shortens responses on the majority of them\. Although a few benchmarks exhibit small length increases or minor accuracy fluctuations, the overall average still shows a clear win\-win trend\.

These results indicate that the benefit of CES is not tied to the 7B setting\. When the backbone is scaled down to 1\.5B, CES still consistently improves the accuracy?efficiency trade\-off, suggesting that it is a generally useful training mechanism rather than a technique specific to a single model size\.

Table 7:Cross\-domain generalization on general reasoning benchmarks\. The best result in each category is inbold\.
### E\.2Cross\-Domain Generalization: General Reasoning

To further test whether CES generalizes beyond the training distribution, we evaluate it on three general\-reasoning benchmarks: ARC\-Challenge, CommonsenseQA, and OpenBookQA\. Importantly, these datasets areoutside the training domain, since training uses only math data\.

Results in Table 7 shows that the gains of CES do not simply come from forcing the model to generate shorter outputs\. On ARC and CommonsenseQA, CES substantially improves accuracy while keeping the output length almost unchanged\. On OpenBookQA, CES spends slightly more tokens in exchange for a meaningful gain in accuracy\. In other words, CES does not learn a fixed preference for shorter responses; instead, it learns to allocate reasoning budget on demand\. It permits additional reasoning when that helps correctness, and suppresses redundant exploration when it does not\.

Therefore, these results suggest that the adaptive reasoning behavior learned by CES is not limited to math, but transfers to broader knowledge and commonsense reasoning tasks\.

### E\.3Cross\-Domain Generalization: Coding Benchmarks

In addition to general reasoning, we also evaluate on code\-generation benchmarks using EvalPlus\. Specifically, we test on HumanEval / HumanEval\+ and MBPP / MBPP\+, where the “\+” versions include stricter extra tests in addition to the original base tests\.

Table 8:Cross\-domain generalization on coding benchmarks\. The best result in each category is inbold\.Results are shown in Table 8, where CES outperforms DAPO on all four coding metrics\. These results provide strong evidence of generalization, since code generation differs substantially from math reasoning in output format, structural constraints, and failure modes\. Nevertheless, CES still improves both pass rate and token efficiency, suggesting that its optimization signal is task\-agnostic\. In addition, the improvements on HumanEval\+ and MBPP\+ indicate that the gains are robust under stricter extra tests, rather than appearing only on easier evaluation settings\. Finally, the reduced generation length shows that CES does not improve coding results by “thinking longer”, but by producing higher\-quality solutions more efficiently\.

Taken together, the coding results further support that CES generalizes beyond the math training domain to a structurally different reasoning task\.

## Appendix FStatistics of Simple vs\. Hard Cases

Table 9:Statistics of simple and hard cases on several representative benchmarks\. Following the definition in the main text, a question is classified as*simple*if the original R1\-7B achieves accuracy greater than 50% on that question; otherwise, it is classified as*hard*\.Section 5\.2 of the main paper notes that on a few relatively simple datasets, CES slightly increases the average response length\. We report the numbers of simple and hard cases within these datasets in Table 9\.

Similar Articles

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

arXiv cs.LG

This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

arXiv cs.CL

This paper proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation in LLM reinforcement learning by addressing policy entropy collapse through difficulty-aware coefficient allocation and initial-anchored target entropy. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and exploration capability.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Hugging Face Daily Papers

This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.