Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

arXiv cs.AI 05/11/26, 04:00 AM Papers
reinforcement-learning large-language-models reasoning compression post-training overthinking arxiv
Summary
This paper introduces Implicit Compression Regularization (ICR), a method to address LLM overthinking during RL post-training by guiding models toward concise yet accurate reasoning trajectories.
arXiv:2605.07316v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces. Existing methods mainly rely on length penalties or early-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated. To obtain a compression signal without these limitations, we revisit the training dynamics of existing compression methods. We observe that the length--accuracy correlation is initially negative but continually increases during compression, indicating that shorter responses are initially more likely to be correct but gradually lose this property as the policy moves toward underthinking. Based on this observation, we formalize overthinking: a negative correlation indicates an overthinking regime, while a positive one indicates underthinking. When overthinking, the shortest correct responses are shorter than the group-average response length in expectation, making them natural compression targets already present in on-policy rollouts. We therefore propose \emph{Implicit Compression Regularization} (ICR), an on-policy regularization method whose compression signal comes from a virtual shorter distribution induced by the shortest correct responses in rollout groups, guiding the policy toward concise yet correct trajectories. Training dynamics show that ICR maintains a better length--accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking. Experiments on three reasoning backbones and multiple mathematical and knowledge-intensive benchmarks show that ICR consistently shortens responses while preserving or improving accuracy, achieving a stronger accuracy--length Pareto frontier.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 07:16 AM
# Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
Source: [https://arxiv.org/html/2605.07316](https://arxiv.org/html/2605.07316)
\\svgsetup

inkscapearea=page

Chen Wang1,2,∗Hexuan Deng2,3Yining Zhang2,4Yuchen Zhang2,5Jionghao Bai2,6Zhaochun Li2,7Ge Lan1,†Yue Wang2,†1College of Software, Nankai University2Zhongguancun Academy3Harbin Institute of Technology4Institute of Automation, Chinese Academy of Sciences5East China Normal University6Zhejiang University7Beijing Institute of Technology

###### Abstract

Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces\. Existing methods mainly rely on length penalties or early\-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated\. To obtain a compression signal without these limitations, we revisit the training dynamics of existing compression methods\. We observe that the length–accuracy correlation is initially negative but continually increases during compression, indicating that shorter responses are initially more likely to be correct but gradually lose this property as the policy moves toward underthinking\. Based on this observation, we formalize overthinking: a negative correlation indicates an overthinking regime, while a positive one indicates underthinking\. When overthinking, the shortest correct responses are shorter than the group\-average response length in expectation, making them natural compression targets already present in on\-policy rollouts\. We therefore propose*Implicit Compression Regularization*\(ICR\), an on\-policy regularization method whose compression signal comes from a virtual shorter distribution induced by the shortest correct responses in rollout groups, guiding the policy toward concise yet correct trajectories\. Training dynamics show that ICR maintains a better length–accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking\. Experiments on three reasoning backbones and multiple mathematical and knowledge\-intensive benchmarks show that ICR consistently shortens responses while preserving or improving accuracy, achieving a stronger accuracy–length Pareto frontier\.

††footnotetext:∗Email:s\-wc25@bjzgca\.edu\.cn\.††footnotetext:†Correspondence to Ge Lan, email:lange@nankai\.edu\.cn\.††footnotetext:†Correspondence to Yue Wang, email:yuewang@bza\.edu\.cn\.## 1Introduction

Large language models \(LLMs\) have achieved strong reasoning performance by scaling test\-time computation through long chain\-of\-thought reasoning\[[29](https://arxiv.org/html/2605.07316#bib.bib809),[14](https://arxiv.org/html/2605.07316#bib.bib249),[27](https://arxiv.org/html/2605.07316#bib.bib344)\]\. Reinforcement learning with verifiable rewards \(RLVR\) further strengthens this capability by optimizing models with outcome\-level correctness signals, enabling them to explore, reflect, and revise reasoning trajectories\[[22](https://arxiv.org/html/2605.07316#bib.bib723),[14](https://arxiv.org/html/2605.07316#bib.bib249),[27](https://arxiv.org/html/2605.07316#bib.bib344)\]\. However, longer reasoning is not always beneficial\. During RL post\-training, models may generate redundant intermediate steps, repeat self\-reflections, or allocate excessive computation to questions that have already been solved\[[25](https://arxiv.org/html/2605.07316#bib.bib810),[1](https://arxiv.org/html/2605.07316#bib.bib811)\]\. This phenomenon, commonly known as*overthinking*, increases inference cost and can even hurt correctness by introducing spurious alternatives or unnecessary self\-correction\[[5](https://arxiv.org/html/2605.07316#bib.bib812),[26](https://arxiv.org/html/2605.07316#bib.bib813)\]\. Therefore, an important problem is how to reduce redundant reasoning while preserving the reasoning capability acquired through RL\.

Existing methods mainly address overthinking in two ways\. The first category adds length penalties to the RL reward\[[27](https://arxiv.org/html/2605.07316#bib.bib344),[34](https://arxiv.org/html/2605.07316#bib.bib726),[33](https://arxiv.org/html/2605.07316#bib.bib815),[18](https://arxiv.org/html/2605.07316#bib.bib814)\]\. Although effective at reducing token usage, these methods make response length an explicit optimization target, which can degrade accuracy and make the policy prone to underthinking\[[21](https://arxiv.org/html/2605.07316#bib.bib816),[35](https://arxiv.org/html/2605.07316#bib.bib741)\]\. The second category uses early\-exit or truncation\-style strategies to stop reasoning once sufficient evidence is estimated to be available\[[10](https://arxiv.org/html/2605.07316#bib.bib817),[4](https://arxiv.org/html/2605.07316#bib.bib818),[6](https://arxiv.org/html/2605.07316#bib.bib819)\]\. However, these methods rely on the assumption that large portions of reasoning traces are redundant and can be safely discarded, which may fail on harder, information\-dense problems where later steps remain tightly coupled with correctness\. These limitations motivate a different question: can we obtain a compression signal during on\-policy training without length penalties or reasoning truncation?

Motivated by this question, we revisit the training dynamics of existing compression methods\. By tuning the length coefficient, we observe that stronger length penalties shorten responses faster, but also accelerate accuracy degradation and worsen the accuracy–length Pareto frontier\. More importantly, we find that the group\-wise length–accuracy correlation starts negative but continually increases during compression\. This indicates that shorter responses are initially more likely to be correct, suggesting the existence of safe compression opportunities within the current rollout distribution\. However, as training proceeds, this property gradually vanishes, implying that the policy is pushed from removing redundancy toward underthinking\. This suggests that compression itself is not inherently harmful, but directly optimizing shortness makes the policy exploit an easier optimization direction than improving correctness\. According to the observation, we formalize overthinking by the expected group\-wise correlation between correctness and response length: a negative value indicates an overthinking regime, while a positive value indicates an underthinking regime\. In the overthinking regime, correct responses are shorter than the group average in expectation, so the shortest correct samples naturally provide safe compression targets already present in on\-policy rollouts\. Based on this insight, we propose*Implicit Compression Regularization*\(ICR\), an on\-policy regularization method that extracts compression signals from these shortest correct samples\. Instead of adding a handcrafted length\-dependent reward or truncating reasoning traces, ICR uses the shortest correct responses within rollout groups to induce a virtual shorter distribution\. This distribution guides the policy toward concise yet correct trajectories already discovered by its own rollouts\. Training dynamics show that ICR maintains a better length–accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking\. Experiments across three reasoning backbones and multiple mathematical and knowledge\-intensive benchmarks show that ICR consistently reduces response length while preserving or improving accuracy, achieves a stronger accuracy–length Pareto frontier, and remains compatible with mild length penalties when stronger compression is required\.

Our contributions are summarized as follows:

- •We reveal a key training dynamic behind compression: the group\-wise length–accuracy correlation starts negative but continually increases, showing that short responses are initially more likely to be correct but gradually lose this advantage as compression moves toward underthinking\.
- •We formalize overthinking by the expected group\-wise correlation between correctness and response length: a negative value indicates an overthinking regime, while a positive value indicates an underthinking regime\.
- •We propose*Implicit Compression Regularization*\(ICR\), an on\-policy regularization method that extracts compression signals from the shortest correct samples within rollout groups, without introducing explicit length penalties or truncating reasoning traces\.
- •We demonstrate across multiple backbones and benchmarks that ICR achieves accuracy\-preserving compression, maintains a better length–accuracy correlation, and yields a stronger accuracy–length Pareto frontier\. We further show that ICR is compatible with length penalties when stronger compression is required\.

## 2Related Work

#### RL post\-training and GRPO\.

Reinforcement learning with verifiable rewards \(RLVR\) has become a central paradigm for improving the reasoning ability of LLMs\. Given a queryqqand a sampled responseoo, RLVR evaluates the response with a verifiable reward functionR\(q,o\)R\(q,o\)and maximizes

𝒥RL\(θ\)=𝔼q∼P\(Q\),o∼πθ\[R\(q,o\)\]\.\\mathcal\{J\}\_\{\\rm RL\}\(\\theta\)=\\mathbb\{E\}\_\{q\\sim P\(Q\),\\,o\\sim\\pi\_\{\\theta\}\}\[R\(q,o\)\]\.\(1\)Among existing RLVR methods, Group Relative Policy Optimization \(GRPO\) is widely used for reasoning LLMs\. For each queryqq, GRPO samples a group ofGGresponses\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}fromπθold\\pi\_\{\\theta\_\{\\rm old\}\}and optimizes

𝒥GRPO\(θ\)=𝔼q∼P\(Q\),\{oi\}i=1G∼πθold\[1G∑i=1G1\|oi\|∑t=1\|oi\|min⁡\(ri,t\(θ\)Ai,clip\(ri,t\(θ\),1−ϵlow,1\+ϵhigh\)Ai\)\],\\small\\mathcal\{J\}\_\{\\rm GRPO\}\(\\theta\)=\\mathbb\{E\}\_\{q\\sim P\(Q\),\\,\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\rm old\}\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\min\\\!\\Big\(r\_\{i,t\}\(\\theta\)A\_\{i\},\\,\\mathrm\{clip\}\\\!\\big\(r\_\{i,t\}\(\\theta\),1\-\\epsilon\_\{\\text\{low\}\},1\+\\epsilon\_\{\\text\{high\}\}\\big\)A\_\{i\}\\Big\)\\right\],\(2\)whereϵlow=ϵhigh=0\.2\\epsilon\_\{\\text\{low\}\}=\\epsilon\_\{\\text\{high\}\}=0\.2and

ri,t\(θ\)=πθ\(oi,t∣q,oi,<t\)πθold\(oi,t∣q,oi,<t\),Ai=Ri−mean\(\{Rj\}j=1G\)std\(\{Rj\}j=1G\)\.r\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\rm old\}\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)\},\\qquad A\_\{i\}=\\frac\{R\_\{i\}\-\\mathrm\{mean\}\(\\\{R\_\{j\}\\\}\_\{j=1\}^\{G\}\)\}\{\\mathrm\{std\}\(\\\{R\_\{j\}\\\}\_\{j=1\}^\{G\}\)\}\.\(3\)The group\-normalized advantage reduces variance and stabilizes policy optimization\. However, optimizing only final correctness often encourages long reasoning traces, leading to*overthinking*, where models generate redundant reflections, repeated verification steps, or unnecessarily detailed derivations\[[25](https://arxiv.org/html/2605.07316#bib.bib810),[1](https://arxiv.org/html/2605.07316#bib.bib811),[5](https://arxiv.org/html/2605.07316#bib.bib812),[26](https://arxiv.org/html/2605.07316#bib.bib813)\]\. Such behavior increases inference cost and may even hurt correctness on hard or noisy problems\[[13](https://arxiv.org/html/2605.07316#bib.bib820),[11](https://arxiv.org/html/2605.07316#bib.bib821),[9](https://arxiv.org/html/2605.07316#bib.bib822)\]\.

#### Length penalties\.

A major line of work mitigates overthinking by adding a length\-dependent term to the RL reward:

Ri=Ricorr\+λRilen,R\_\{i\}=R\_\{i\}^\{\\mathrm\{corr\}\}\+\\lambda\\,R\_\{i\}^\{\\mathrm\{len\}\},\(4\)whereRicorrR\_\{i\}^\{\\mathrm\{corr\}\}is the correctness reward,RilenR\_\{i\}^\{\\mathrm\{len\}\}is the length reward, andλ\\lambdacontrols the compression strength\. Since GRPO normalizes scalar rewards within each rollout group, the length term directly changes the relative advantage and participates in policy optimization\. Existing length penalties can be summarized as follows:

- •LP\-F\.Fixed\-reference penalties use predefined length bounds or budgets to reward shorter responses, such as DAPO with a soft length shaping\[[34](https://arxiv.org/html/2605.07316#bib.bib726)\]\.
- •LP\-G\.Group\-wise penalties normalize response lengths within the current rollout group and favor shorter samples relative to other responses, as in Kimi\-k1\.5 and so on\[[27](https://arxiv.org/html/2605.07316#bib.bib344),[2](https://arxiv.org/html/2605.07316#bib.bib823),[19](https://arxiv.org/html/2605.07316#bib.bib824),[20](https://arxiv.org/html/2605.07316#bib.bib825),[16](https://arxiv.org/html/2605.07316#bib.bib826),[15](https://arxiv.org/html/2605.07316#bib.bib827),[24](https://arxiv.org/html/2605.07316#bib.bib828)\]\.

Although these methods can effectively reduce token usage, they explicitly couple response length with reward optimization\. This makes training sensitive to coefficient tuning and may shift the policy toward superficial shortening rather than genuine reasoning improvement, causing underthinking or accuracy degradation\[[18](https://arxiv.org/html/2605.07316#bib.bib814),[21](https://arxiv.org/html/2605.07316#bib.bib816),[35](https://arxiv.org/html/2605.07316#bib.bib741)\]\.

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/qwen-truncate.png)Figure 1:Accuracy under different maximum response lengths on mathematical reasoning benchmarks\. Accuracy increases monotonically with the allowed reasoning length across both Qwen3\-4B and Qwen3\-8B, indicating that truncation can damage the quality of inference\.
#### Early exit and truncation\.

Another line of work reduces overthinking by constructing, selecting, or truncating reasoning trajectories\. Some route queries between thinking and no\-thinking modes\[[36](https://arxiv.org/html/2605.07316#bib.bib829),[30](https://arxiv.org/html/2605.07316#bib.bib830)\]\. Early\-exit and truncation\-style methods instead shorten reasoning by stopping generation once a fixed budget is reached or once sufficient evidence is estimated to be available, using signals such as confidence, verification, or entropy\[[10](https://arxiv.org/html/2605.07316#bib.bib817),[4](https://arxiv.org/html/2605.07316#bib.bib818),[6](https://arxiv.org/html/2605.07316#bib.bib819)\]\. Although these methods can reduce inference cost, they rely on the assumption that later reasoning steps are mostly redundant\. As shown in Fig\.[1](https://arxiv.org/html/2605.07316#S2.F1), reasoning accuracy increases monotonically with the allowed response length on hard mathematical benchmarks, indicating that truncated steps often still contain useful reasoning computation\. Therefore, early stopping may hurt performance when later steps are tightly coupled with correctness\. In contrast, ICR does not discard reasoning trajectories by hard truncation or early exit, but extracts compression signals from concise correct samples already present in on\-policy rollout groups\.

## 3Method

We begin by presenting an empirical observation that motivates our method\. Although the length penalty is widely adopted to mitigate overthinking in RL post\-training, we find that it inevitably incurs a performance loss\. This observation reveals the limitation of existing reward\-shaping strategies and motivates the method proposed in this section\.

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/LP1-accuracy.png)

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/LP1-length.png)

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/LP1-correlation2.png)

\(\(a\)\)LP\-F \(ℓmin=4096\\ell\_\{\\min\}=4096,ℓmax=8192\\ell\_\{\\max\}=8192\)\.
![Refer to caption](https://arxiv.org/html/2605.07316v1/img/LP2-accuracy.png)

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/LP2-length.png)

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/LP2-correlation2.png)

\(\(b\)\)LP\-G \(Kimi\-k1\.5\)\.

Figure 2:Coefficient\-tuning results for the two length reward designs withλ∈\{0\.5,1,2\}\\lambda\\in\\\{0\.5,1,2\\\}\. In both cases, increasing the length coefficient shortens responses faster, but also accelerates accuracy degradation\. The accuracy–length correlation starts negative and gradually moves toward zero, indicating that the initial compatibility between shorter responses and correctness weakens during training\.### 3\.1Observations

We tune the length coefficientλ∈\{0\.5,1,2\}\\lambda\\in\\\{0\.5,1,2\\\}in Eq\. \([4](https://arxiv.org/html/2605.07316#S2.E4)\) for both reward designs and track three quantities during training: training accuracy, average response length, and the correlation coefficient between response length and correctness reward\. The correlation is computed within each rollout group and then averaged over all groups in the batch\.

Negative but weakening accuracy–length correlation\.As shown in Fig\.[2](https://arxiv.org/html/2605.07316#S3.F2), for both penalty designs, the accuracy–length correlation is negative at the early stage of training\. This means that, within the current rollout groups, shorter responses are often more likely to be correct, indicating that there is still room for safe compression\. However, this negative correlation gradually moves toward zero as training proceeds\. In other words, the advantage of shorter responses over longer ones becomes weaker during optimization\. This trend suggests that length penalties exploit an initially available compression opportunity, but continuously applying the same pressure makes length reduction progressively less aligned with correctness\.

Accuracy–length Pareto frontier\.As shown in Fig\.[2](https://arxiv.org/html/2605.07316#S3.F2), a larger length coefficientλ\\lambdamakes the model shorten its responses faster on the training set, but the training accuracy also drops more rapidly\. This indicates that stronger length pressure does not merely remove redundant tokens; it already begins to interfere with correctness optimization during training\. Consistently, Fig\.[3](https://arxiv.org/html/2605.07316#S3.F3)shows that this degradation transfers to held\-out mathematical competition benchmarks\. Increasingλ\\lambdanot only reduces performance, but also produces a worse accuracy–length Pareto frontier: for the same amount of length reduction, the model suffers a larger accuracy drop\. Therefore, although a stronger length reward accelerates compression, it degrades the effective trade\-off between reasoning budget and reasoning performance during both training and evaluation\.

These observations suggest that explicit length penalties exploit the easier optimization direction of shortening responses, which can gradually shift the policy from removing redundant reasoning toward underthinking\. This motivates a compression mechanism that uses the structure of on\-policy rollouts without directly rewarding shortness\.

### 3\.2Implicit Compression Regularization

Based on this observation, we formalize the connection between overthinking and the group\-wise correlation between accuracy and response length\.

###### Claim 1\(Overthinking / underthinking\)\.

For a policy, a negative expected group\-wise correlation between correctness and response length indicates an*overthinking regime*, while a positive one indicates an*underthinking regime*\.

A negative value means that correct responses are, on average, shorter within rollout groups, suggesting removable redundancy; a positive value suggests that correct responses are typically longer, so further compression may remove useful reasoning\.

###### Claim 2\(ICR\)\.

For a policy, if the expected group\-wise correlation between correctness and response length is negative, then the shortest correct responses are shorter than the group\-average responses in expectation\.

This follows because negative expected correlation means that correct responses are shorter than the group average on average, and the shortest correct response is no longer than the average correct response\. Therefore, shortest correct samples provide natural compression targets\. Based on this, we propose*Implicit Compression Regularization*\(ICR\), which extracts compression signals from shortest correct samples in on\-policy rollout groups\.

For a queryqq, let\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}denote the rollout group sampled fromπθold\\pi\_\{\\theta\_\{\\rm old\}\}\. Each group already contains a natural compression target: the shortest correct sample in that group\. We define the set of such samples as

𝒮\(q\)=arg⁡mini∈\{1,…,G\},Ricorr=1⁡\|oi\|\.\\mathcal\{S\}\(q\)=\\arg\\min\_\{i\\in\\\{1,\\dots,G\\\},\\,R\_\{i\}^\{\\rm corr\}=1\}\|o\_\{i\}\|\.\(5\)
ICR augments RL with an additional preference toward these shortest correct responses:

𝒥ICR\(θ\)=𝒥GRPO\(θ\)\+α𝔼q∼P\(Q\),\{oi\}i=1G∼πθold\[1G∑i=1G1\|oi\|∑t=1\|oi\|min⁡\(ri,t\(θ\),clip\(ri,t\(θ\),1−ϵlow,1\+ϵhigh\)\)𝟏\[oi∈𝒮\(q\)\]\]\.\\small\\begin\{split\}\\mathcal\{J\}\_\{\\rm ICR\}\(\\theta\)=&\\,\\mathcal\{J\}\_\{\\rm GRPO\}\(\\theta\)\+\\alpha\\,\\mathbb\{E\}\_\{q\\sim P\(Q\),\\,\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\rm old\}\}\}\\\\ &\\hskip 18\.49988pt\\hskip 18\.49988pt\\hskip 18\.49988pt\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\min\\\!\\Big\(r\_\{i,t\}\(\\theta\),\\,\\mathrm\{clip\}\\\!\\big\(r\_\{i,t\}\(\\theta\),1\-\\epsilon\_\{\\rm low\},1\+\\epsilon\_\{\\rm high\}\\big\)\\Big\)\\mathbf\{1\}\[o\_\{i\}\\in\\mathcal\{S\}\(q\)\]\\Bigg\]\.\\\\ \\end\{split\}\(6\)
![Refer to caption](https://arxiv.org/html/2605.07316v1/img/LP1-math.png)\(\(a\)\)Accuracy–length trajectory of LP\-F\.
![Refer to caption](https://arxiv.org/html/2605.07316v1/img/LP2-math.png)\(\(b\)\)Accuracy–length trajectory of LP\-G\.

Figure 3:Accuracy–length trajectories on held\-out mathematical competition benchmarks \(AIME 24 & AIME25 & HMMT25\)\. A larger length coefficient reaches shorter responses with fewer steps, but incurs a larger drop in reasoning accuracy\.The regularizer term is activated only by the shortest correct samples in the current rollout group\. Hence, ICR does not reward shortness itself; instead, it selectively amplifies trajectories that are both correct and already concise\. This makes compression a consequence of reinforcing successful short reasoning paths, rather than an independent optimization target\. This design gives ICR three key properties:

- •On\-policy optimization\.ICR is fully on\-policy\. The selected samples come from the original rollout groups, and the compression signal in Eq\. \([6](https://arxiv.org/html/2605.07316#S3.E6)\) is obtained by a deterministic selection rule applied within each group\. Therefore, ICR introduces neither an external policy nor an additional distributional mismatch, and avoids the risk of discarding useful later reasoning steps as in truncation\-based methods\.
- •No explicit length penalty\.ICR does not add length penalties to the scalar reward\. The primary optimization target remains reasoning correctness, while the compression signal is extracted from the group\-wise structure of on\-policy samples\. In this way, ICR avoids directly reshaping the reward landscape toward the easier objective of shortening responses, which is the main source of degradation in length\-penalty methods\.
- •No extra training cost\.ICR reuses the original rollout groups and only adds a lightweight group\-wise selection step together with a simple advantage modification\. Thus, ICR provides an effective compression signal without extra training cost\.

The core mechanism of ICR is to turn the shortest correct samples in each rollout group into an implicit regularization signal\. Since these samples are generated by the current policy and already achieve correct answers, they define a naturally “shorter” and successful region inside the on\-policy distribution\. From this perspective, the selected samples induce a virtual shorter distributionπs\\pi\_\{s\}, which retains the shortest successful trajectories from the original rollout distributionπθ\\pi\_\{\\theta\}\. This induced distribution serves as the regularizer that guides compression: instead of optimizing a handcrafted length reward, ICR pulls the policy toward the concise successful trajectories that the policy itself has already discovered\. As shown in Fig\.[4](https://arxiv.org/html/2605.07316#S3.F4),πs\\pi\_\{s\}consistently produces shorter responses than GRPO across different models, while ICR stays between them\. This suggests that the shorter distributionπs\\pi\_\{s\}effectively guides the policy toward concise reasoning, but does so softly enough to preserve the correctness\-oriented optimization of RL\.

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/qwen3-4b-length.png)\(\(a\)\)Qwen3\-4B\.
![Refer to caption](https://arxiv.org/html/2605.07316v1/img/qwen3-8b-length.png)\(\(b\)\)Qwen3\-8B\.
![Refer to caption](https://arxiv.org/html/2605.07316v1/img/dsqw-7b-length.png)\(\(c\)\)DSQW\-7B\.

Figure 4:Response length trajectories ofπs\\pi\_\{s\}, ICR, and GRPO during training, showing that the shorter distributionπs\\pi\_\{s\}guides the compression behavior of ICR\.

## 4Experiments

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/ICR-math.png)Figure 5:ICR and ICR w/ LP\-F achieve a stronger accuracy–length Pareto frontier\.We evaluate ICR on three reasoning backbones, Qwen3\-4B\[[31](https://arxiv.org/html/2605.07316#bib.bib783)\], Qwen3\-8B\[[31](https://arxiv.org/html/2605.07316#bib.bib783)\], and DeepSeek\-R1\-Distill\-Qwen\-7B \(DSQW\-7B\)\[[14](https://arxiv.org/html/2605.07316#bib.bib249)\], using DAPO\-17K\[[34](https://arxiv.org/html/2605.07316#bib.bib726)\]for RL training\. We compare ICR with representative methods from vanilla RL optimization, length\-penalty methods, and inference\-time compression, including GRPO\[[22](https://arxiv.org/html/2605.07316#bib.bib723)\], LP\-F\[[34](https://arxiv.org/html/2605.07316#bib.bib726)\], LP\-G\[[27](https://arxiv.org/html/2605.07316#bib.bib344)\], ShorterBetter\[[33](https://arxiv.org/html/2605.07316#bib.bib815)\], and LC\-R1\[[6](https://arxiv.org/html/2605.07316#bib.bib819)\]\. Evaluation covers both mathematical reasoning benchmarks and knowledge\-intensive generalization benchmarks\. Detailed baselines, datasets, and implementation settings are provided in Appendix[B](https://arxiv.org/html/2605.07316#A2)\.

Table 1:Main results on ICR and other length\-aware reasoning compression methods\.LengthAccuracyAIME24AIME25HMMT25GSM8KMath500AMC23OlympiadAverageQwen3\-8B1261164\.381349950\.001460629\.48179995\.68493493\.00796187\.97934061\.78925068\.90\+GRPO882566\.35896952\.701068831\.67101896\.05271493\.60413489\.06563665\.03599870\.64\+ICR710967\.50814553\.75923934\.0671396\.13220895\.00364589\.53486966\.22513371\.74\+LP\-G \(Kimi\)628254\.58642440\.41760924\.3751394\.76157691\.80285085\.54251759\.55396764\.43\+ShorterBetter622453\.64669343\.95780124\.7945895\.22150491\.80257786\.64326761\.03407565\.30\+LC\-R1703556\.87787342\.39928923\.9550894\.54187991\.60263486\.48437361\.48479965\.33\+LP\-F660360\.72647944\.68763028\.3361895\.52175792\.60327188\.35381763\.11431167\.62\+ICR /w LP\-F617162\.60587845\.10692929\.6853395\.52168494\.40229288\.75338262\.96383868\.43Qwen3\-4B1235461\.881275949\.481392031\.15160294\.84468191\.60736087\.50899062\.37880968\.40\+GRPO863661\.87908750\.52942731\.6791596\.05253393\.60419989\.06502763\.40568969\.45\+ICR658866\.97727452\.60844232\.6061295\.14188493\.60387893\.04433365\.33471671\.33\+LP\-G \(Kimi\)444443\.75586737\.29539420\.5243793\.33100888\.60167782\.26212256\.59299360\.33\+ShorterBetter489146\.25582534\.68648120\.7233693\.47118990\.40179982\.89259057\.77330260\.88\+LC\-R1688241\.35713436\.52761123\.4336293\.70172488\.80289081\.64433855\.85442060\.18\+LP\-F667454\.79735044\.47853526\.1477794\.54212791\.80325089\.06426963\.11471266\.27\+ICR /w LP\-F578355\.83662246\.97764430\.3154994\.76169592\.40269988\.13371664\.14410167\.51DSQW\-7B1083641\.251163928\.541283114\.69109190\.60330287\.20622277\.50744849\.19762455\.57\+GRPO1011947\.701135235\.721344319\.79118592\.34313290\.20522684\.06722954\.37738460\.60\+ICR839950\.52998036\.561212319\.79100692\.49279191\.00415985\.31667755\.85644861\.65\+LP\-G \(Kimi\)855946\.041061835\.10964420\.9459691\.21221690\.60339285\.47552154\.67579260\.58\+ShorterBetter754347\.91873834\.37910520\.0051691\.81191091\.40317286\.64502155\.70514461\.12\+LC\-R1846046\.97971434\.791149920\.1055492\.34216591\.40361887\.18557256\.14594061\.27\+LP\-F820347\.70912134\.27957120\.5273492\.41231391\.80377987\.50537457\.03558561\.60\+ICR /w LP\-F771449\.89853735\.20865721\.4561693\.32207893\.00269887\.89510657\.62505862\.62

Table 2:Generalization results on knowledge\-intensive benchmarks, including ARC\-Challenge, MMLU\-Pro, and SuperGPQA\.MethodsQwen3\-4BQwen3\-8BDSQW\-7BARCMMLUS\-GPQAARCMMLUS\-GPQAARCMMLUS\-GPQABase123193\.31231272\.86685837\.20139892\.64263176\.07726244\.20132178\.60299356\.61623830\.00\+GRPO88692\.57126872\.32414341\.0098192\.57148679\.10402242\.60138681\.60251557\.32545231\.80\+ICR62293\.6483073\.39308241\.2080493\.31119579\.46349844\.00117782\.61213659\.82463533\.20\+LP\-G \(Kimi\)46492\.3072071\.25130037\.2064692\.6475175\.17228039\.4085579\.60187758\.39327731\.80\+ShorterBetter41892\.9756671\.25200934\.8061491\.9772576\.78226940\.8070781\.27167657\.85296731\.60\+LC\-R150591\.6380572\.14288436\.2070192\.6494676\.42307940\.4087883\.27169357\.67383232\.60\+LP\-F73293\.64100071\.60318938\.2069993\.6481179\.28339241\.6079282\.60168857\.85339033\.40\+ICR /w LP\-F60893\.9775273\.39239239\.6060293\.9775478\.03231241\.6060384\.28166659\.28295735\.20

### 4\.1Main results

Accuracy\-preserving compressionTables[1](https://arxiv.org/html/2605.07316#S4.T1)and[2](https://arxiv.org/html/2605.07316#S4.T2)report the main results on mathematical reasoning and knowledge\-intensive generalization benchmarks, respectively\. Overall, ICR consistently achieves accuracy\-preserving compression\. Compared with GRPO, ICR produces shorter responses while further improving accuracy, showing that overthinking can be reduced without sacrificing reasoning quality\. The training dynamics in Fig\.[6](https://arxiv.org/html/2605.07316#S4.F6)provide further evidence\. ICR maintains the most stable accuracy curve throughout training, whereas length\-penalty baselines, especially LP\-G, exhibit stronger late\-stage degradation\. More importantly, ICR keeps a more negative accuracy–length correlation during training, indicating that shorter samples remain more aligned with correctness\. This suggests that ICR preserves the quality of concise responses during compression, rather than pushing the model toward underthinking\. Moreover, Fig\.[5](https://arxiv.org/html/2605.07316#S4.F5)shows that ICR achieves the most favorable accuracy–length Pareto frontier, indicating that its compression is not only effective but also better aligned with reasoning performance\.

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/ICR-accuracy.png)

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/ICR-length.png)

![Refer to caption](https://arxiv.org/html/2605.07316v1/img/ICR-correlation2.png)

Figure 6:Training dynamics of ICR and ICR /w LP\-F on mathematical reasoning benchmarks\. From left to right: training accuracy, average response length, and the group\-wise accuracy–length correlation averaged over the batch\.Compatibility with length penalty\.ICR provides a soft implicit compression signal that preserves accuracy but usually compresses more moderately than aggressive length\-reward baselines\. Therefore, an important question is whether ICR can complement length penalties when stronger compression is required\. We find that ICR is highly compatible with length penalties\. When combined with LP\-F, ICR further strengthens the compression effect of the soft length reward while maintaining the correctness\-oriented preference induced by the shortest correct samples\. As shown in Table[1](https://arxiv.org/html/2605.07316#S4.T1)and Fig\.[5](https://arxiv.org/html/2605.07316#S4.F5), ICR /w LP\-F achieves the best overall performance and the most favorable accuracy–length Pareto frontier among all baselines\. Moreover, compared with LP\-F alone, ICR /w LP\-F produces even shorter responses with higher accuracy\. The training dynamics in Fig\.[6](https://arxiv.org/html/2605.07316#S4.F6)further show that ICR /w LP\-F keeps a more negative accuracy–length correlation than LP\-F, indicating that ICR strengthens the quality of short samples while enhancing compression\. This suggests that ICR does not conflict with length penalties; instead, it acts as a complementary on\-policy regularizer that guides the model toward concise yet correct reasoning trajectories\.

Generalization\.Another important finding is the strong generalization of ICR\. The gains of ICR are consistent across all three backbones, including Qwen3\-4B, Qwen3\-8B, and DSQW\-7B, and transfer well from mathematical competition benchmarks to knowledge\-intensive benchmarks such as ARC\-Challenge, MMLU\-Pro, and SuperGPQA\. These results suggest that the benefit of ICR does not depend on a specific model family or benchmark type, but reflects a more general improvement in the reasoning compression trade\-off\.

Baseline behaviors\.Compared with the base models, GRPO already provides a favorable starting point by shortening responses while maintaining, and often improving, reasoning accuracy\. In contrast, length\-penalty methods achieve substantially stronger compression, but their gains in efficiency are often accompanied by noticeable performance degradation, especially on mathematical reasoning benchmarks\. The early\-exit baseline LC\-R1 performs relatively well on DSQW\-7B and on knowledge\-intensive generalization benchmarks\. However, its accuracy drop becomes much more pronounced on more reasoning\-intensive base models, such as Qwen3, and on harder competition\-level mathematical benchmarks\. This suggests that early\-exit methods rely on the existence of long but largely redundant reasoning prefixes that can be safely truncated\. When reasoning becomes more information\-dense, however, later steps are less redundant and more tightly coupled to correctness, making early exit much less effective as a compression strategy\. In contrast, ICR does not depend on reward penalty or premature truncation, but instead extracts compression signals from successful short trajectories already contained in on\-policy rollouts\. This allows ICR to achieve a more favorable balance between compression and accuracy across both reasoning\-intensive and general\-domain settings\.

### 4\.2Ablations

We ablate three key designs of ICR on Qwen3\-4B: selecting all samples, selecting negative samples, and using only the regularizer without the GRPO objective\. Table[3](https://arxiv.org/html/2605.07316#S4.T3)reports the final results, and the curves below show the corresponding training dynamics\.

All samples\.When all samples enter𝒮\(q\)\\mathcal\{S\}\(q\), the regularizer loses its preference for shorter samples\. The length curve confirms that compression is largely weakened, and the average length increases from 4716 to 6541 tokens\. Accuracy also drops slightly from 71\.33 to 70\.64, suggesting that applying the unnormalized regularization signal to all samples introduces higher\-variance updates\.

Negative samples\.Allowing negative samples in𝒮\(q\)\\mathcal\{S\}\(q\)brings only limited compression, with the average length increasing to 5888 tokens\. The training curve also stays consistently above ICR\. This shows that under\-reasoned short samples are not useful compression targets; their additional optimization signal offsets the benefit of reinforcing concise correct samples, leading to a slight accuracy drop\.

Only regularizer\.Using only the regularizer achieves strong compression, with an average length of 4815 tokens and the fastest length reduction in training\. However, accuracy drops clearly from 71\.33 to 68\.44\. This indicates that the shortest\-correct regularizer can induce concise responses, but cannot preserve reasoning accuracy alone\. Therefore, the normalized GRPO objective is crucial to ICR, and the compression regularizer should remain an auxiliary term\.

Table 3:Ablation results of different variants of ICR\.Qwen3\-4BAIME24AIME25HMMT25GSM8KMath500AMC23OlympiadAverageAll samples in𝒮\(q\)\\mathcal\{S\}\(q\)833264\.891065251\.251171333\.12112795\.22296392\.80474993\.97625463\.256541\(\+1825\)70\.64\(\-0\.69\)Negative samples in𝒮\(q\)\\mathcal\{S\}\(q\)891066\.77911352\.501008632\.7097494\.61262592\.80400692\.64549963\.855888\(\+1172\)70\.84\(\-0\.49\)Only regularizer785961\.08790648\.75831029\.5861494\.16172591\.60286091\.56443362\.374815\(\+99\)68\.44\(\-2\.89\)

![[Uncaptioned image]](https://arxiv.org/html/2605.07316v1/img/ICR-ab.png)

## 5Conclusion

In this work, we studied overthinking in RL post\-training from the perspective of the relationship between response length and correctness\. We formalized overthinking using the expected group\-wise correlation between correctness and response length, where negative correlation indicates that shorter responses are more likely to be correct, while positive correlation suggests a drift toward underthinking\. Based on this view, we proposed*Implicit Compression Regularization*\(ICR\), which extracts compression signals from the shortest correct samples already present in on\-policy rollout groups\. By inducing a virtual shorter distribution, ICR regularizes the policy toward concise successful trajectories without adding a handcrafted length reward\. Experiments across multiple backbones and benchmarks demonstrate that ICR achieves accuracy\-preserving compression, generalizes beyond mathematical reasoning, and remains compatible with length penalties for stronger compression\.

As discussed in Limitations \(Sec\.[A](https://arxiv.org/html/2605.07316#A1)in appendix\), future work should therefore move beyond designing stronger compression heuristics and further investigate the underlying relationship between response length and reasoning accuracy\. A key direction is to understand what determines the accuracy–length Pareto frontier during RL post\-training, including how model capacity, task difficulty, rollout diversity, reward structure, and optimization dynamics shape the attainable trade\-off\. Such analysis may help distinguish removable redundancy from reasoning steps that are genuinely necessary for correctness\. Ultimately, this line of research points toward lossless reasoning compression, where models can reduce unnecessary computation while preserving the full reasoning accuracy of long responses\.

## References

- \[1\]M\. A\. Alomrani, Y\. Zhang, D\. Li, Q\. Sun, S\. Pal, Z\. Zhang, Y\. Hu, R\. D\. Ajwani, A\. Valkanas, R\. Karimi,et al\.\(2025\)Reasoning on a budget: a survey of adaptive and controllable test\-time compute in llms\.arXiv preprint arXiv:2507\.02076\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p1.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px1.p1.9)\.
- \[2\]\(2025\)Training language models to reason efficiently\.arXiv preprint arXiv:2502\.04463\.Cited by:[2nd item](https://arxiv.org/html/2605.07316#S2.I1.i2.p1.1)\.
- \[3\]M\. Balunović, J\. Dekoninck, I\. Petrov, N\. Jovanović, and M\. Vechev\(2025\)MathArena: evaluating llms on uncontaminated math competitions\.arXiv preprint arXiv:2505\.23281\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2505.23281)Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px2.p1.1)\.
- \[4\]Y\. Bin, T\. Jiang, Y\. Ding, K\. Zhu, F\. Ma, J\. Song, Y\. Yang, and H\. T\. Shen\(2025\)Explore briefly, then decide: mitigating llm overthinking via cumulative entropy regulation\.arXiv preprint arXiv:2510\.02249\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px3.p1.1)\.
- \[5\]X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang,et al\.\(2024\)Do not think that much for 2\+ 3=? on the overthinking of o1\-like llms\.arXiv preprint arXiv:2412\.21187\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p1.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px1.p1.9)\.
- \[6\]Z\. Cheng, D\. Chen, M\. Fu, and T\. Zhou\(2025\)Optimizing length compression in large reasoning models\.arXiv preprint arXiv:2506\.14755\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.07316#S4.p1.1)\.
- \[7\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px2.p1.1)\.
- \[8\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px2.p1.1)\.
- \[9\]A\. Cuadron, D\. Li, W\. Ma, X\. Wang, Y\. Wang, S\. Zhuang, S\. Liu, L\. G\. Schroeder, T\. Xia, H\. Mao,et al\.\(2025\)The danger of overthinking: examining the reasoning\-action dilemma in agentic tasks\.arXiv preprint arXiv:2502\.08235\.Cited by:[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px1.p1.9)\.
- \[10\]M\. Dai, C\. Yang, and Q\. Si\(2025\)S\-grpo: early exit via reinforcement learning in reasoning models\.arXiv preprint arXiv:2505\.07686\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px3.p1.1)\.
- \[11\]R\. Dang, Z\. Li, S\. Huang, and J\. Chen\(2025\)The first impression problem: internal bias triggers overthinking in reasoning models\.arXiv preprint arXiv:2505\.16448\.Cited by:[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px1.p1.9)\.
- \[12\]X\. Du, Y\. Yao, K\. Ma, B\. Wang, T\. Zheng, K\. Zhu, M\. Liu, Y\. Liang, X\. Jin, Z\. Wei,et al\.\(2025\)Supergpqa: scaling llm evaluation across 285 graduate disciplines\.arXiv preprint arXiv:2502\.14739\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px2.p1.1)\.
- \[13\]C\. Fan, M\. Li, L\. Sun, and T\. Zhou\(2025\)Missing premise exacerbates overthinking: are reasoning models losing critical thinking skill?\.arXiv preprint arXiv:2504\.06514\.Cited by:[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px1.p1.9)\.
- \[14\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p1.1),[§4](https://arxiv.org/html/2605.07316#S4.p1.1)\.
- \[15\]C\. Hu, Q\. Hu, Y\. Xu, J\. Chen, R\. Wang, S\. Liu, J\. Li, F\. Wu, and G\. Chen\(2026\)SmartThinker: progressive chain\-of\-thought length calibration for efficient large language model reasoning\.arXiv preprint arXiv:2603\.08000\.Cited by:[2nd item](https://arxiv.org/html/2605.07316#S2.I1.i2.p1.1)\.
- \[16\]T\. Liang, W\. Jiao, Z\. He, J\. Xu, H\. Mi, and D\. Yu\(2025\)DeepCompress: a dual reward strategy for dynamically exploring and compressing reasoning chains\.arXiv preprint arXiv:2510\.27419\.Cited by:[2nd item](https://arxiv.org/html/2605.07316#S2.I1.i2.p1.1)\.
- \[17\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px2.p1.1)\.
- \[18\]F\. Liu, Y\. Yin, P\. Shi, S\. Yang, Z\. Zeng, and H\. Qiu\(2026\)Length\-unbiased sequence policy optimization: revealing and controlling response length variation in rlvr\.arXiv preprint arXiv:2602\.05261\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px2.p1.5)\.
- \[19\]W\. Liu, R\. Zhou, Y\. Deng, Y\. Huang, J\. Liu, Y\. Deng, Y\. Zhang, and J\. He\(2025\)Learn to reason efficiently with adaptive length\-based reward shaping\.arXiv preprint arXiv:2505\.15612\.Cited by:[2nd item](https://arxiv.org/html/2605.07316#S2.I1.i2.p1.1)\.
- \[20\]Q\. Luo, S\. Ren, X\. Chen, R\. Liu, J\. Fang, N\. Tan, and S\. Huang\(2026\)Compress the easy, explore the hard: difficulty\-aware entropy regularization for efficient llm reasoning\.arXiv preprint arXiv:2602\.22642\.Cited by:[2nd item](https://arxiv.org/html/2605.07316#S2.I1.i2.p1.1)\.
- \[21\]D\. Nohara, T\. Nakamura, and R\. Yokota\(2026\)On the optimal reasoning length for rl\-trained language models\.arXiv preprint arXiv:2602\.09591\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px2.p1.5)\.
- \[22\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.07316#S1.p1.1),[§4](https://arxiv.org/html/2605.07316#S4.p1.1)\.
- \[23\]G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu\(2025\)Hybridflow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,pp\. 1279–1297\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px1.p1.2)\.
- \[24\]J\. Su and C\. Cardie\(2025\)Thinking fast and right: balancing accuracy and reasoning length with adaptive rewards\.arXiv preprint arXiv:2505\.18298\.Cited by:[2nd item](https://arxiv.org/html/2605.07316#S2.I1.i2.p1.1)\.
- \[25\]J\. Su, J\. Healey, P\. Nakov, and C\. Cardie\(2025\)Between underthinking and overthinking: an empirical study of reasoning length and correctness in llms\.arXiv preprint arXiv:2505\.00127\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p1.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px1.p1.9)\.
- \[26\]Y\. Sui, Y\. Chuang, G\. Wang, J\. Zhang, T\. Zhang, J\. Yuan, H\. Liu, A\. Wen, S\. Zhong, N\. Zou,et al\.\(2025\)Stop overthinking: a survey on efficient reasoning for large language models\.arXiv preprint arXiv:2503\.16419\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p1.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px1.p1.9)\.
- \[27\]K\. Team, A\. Du, B\. Gao, B\. Xing, C\. Jiang, C\. Chen, C\. Li, C\. Xiao, C\. Du, C\. Liao,et al\.\(2025\)Kimi k1\. 5: scaling reinforcement learning with llms\.arXiv preprint arXiv:2501\.12599\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.07316#S1.p1.1),[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[2nd item](https://arxiv.org/html/2605.07316#S2.I1.i2.p1.1),[§4](https://arxiv.org/html/2605.07316#S4.p1.1)\.
- \[28\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px2.p1.1)\.
- \[29\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p1.1)\.
- \[30\]S\. Xu, W\. Xie, L\. Zhao, and P\. He\(2025\)Chain of draft: thinking faster by writing less\.arXiv preprint arXiv:2502\.18600\.Cited by:[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px3.p1.1)\.
- \[31\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4](https://arxiv.org/html/2605.07316#S4.p1.1)\.
- \[32\]Z\. Yaowei, L\. Junting, W\. Shenzhi, F\. Zhangchi, K\. Dongdong, and X\. Yuwen\(2025\)EasyR1: an efficient, scalable, multi\-modality rl training framework\.Note:[https://github\.com/hiyouga/EasyR1](https://github.com/hiyouga/EasyR1)Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px1.p1.2)\.
- \[33\]J\. Yi, J\. Wang, and S\. Li\(2025\)Shorterbetter: guiding reasoning models to find optimal inference length for efficient reasoning\.arXiv preprint arXiv:2504\.21370\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[§4](https://arxiv.org/html/2605.07316#S4.p1.1)\.
- \[34\]Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, T\. Fan, G\. Liu, L\. Liu, X\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[Appendix B](https://arxiv.org/html/2605.07316#A2.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[1st item](https://arxiv.org/html/2605.07316#S2.I1.i1.p1.1),[§4](https://arxiv.org/html/2605.07316#S4.p1.1)\.
- \[35\]Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, S\. Song, and G\. Huang\(2025\)Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?\.arXiv preprint arXiv:2504\.13837\.Cited by:[§1](https://arxiv.org/html/2605.07316#S1.p2.1),[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px2.p1.5)\.
- \[36\]J\. Zhang, N\. Lin, L\. Hou, L\. Feng, and J\. Li\(2025\)Adaptthink: reasoning models can learn when to think\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 3716–3730\.Cited by:[§2](https://arxiv.org/html/2605.07316#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix ALimitations

A limitation of this work lies in the granularity of our definition of overthinking and underthinking\. To the best of our knowledge, we make the first attempt to distinguish these two regimes using the length–accuracy correlation during RL post\-training: a negative value indicates an overthinking regime, while a positive value indicates an underthinking regime\. This definition provides a simple and useful batch\-level diagnostic for identifying whether shorter responses are still aligned with correctness\. However, it is also coarse\-grained\. Since the correlation is computed over rollout groups and averaged at the batch level, it cannot determine whether a specific response, reasoning step, or query is truly redundant\. A batch may also contain both overthinking and underthinking examples, which are only summarized by a single aggregate statistic\. Therefore, this definition should be viewed as an initial operational boundary rather than a fine\-grained characterization of reasoning redundancy\. Developing instance\-level or step\-level criteria for distinguishing removable redundancy from necessary reasoning remains an important direction for future work\.

Moreover, ICR provides a soft compression signal rather than an explicit length\-control mechanism\. This property is desirable for preserving accuracy, since ICR does not directly optimize a handcrafted notion of shortness\. Our experiments show that ICR can achieve a more favorable accuracy–length Pareto frontier by guiding the policy toward concise yet correct trajectories\. However, this also means that ICR does not directly enforce a target response length or a strict inference budget\. When aggressive compression is required, ICR may need to be combined with length penalties or other budget\-aware mechanisms\. Nevertheless, this compatibility also reflects a limitation: achieving stronger or controllable compression may still require reintroducing explicit length\-based constraints\. Future work should explore how to achieve stronger compression without explicit length rewards while further preserving reasoning accuracy\.

## Appendix BDetailed implementation

Table 4:Detailed implementation for all experiments\.Hardware8×\\timesA800 GPUs \(40GB\)Training FrameworkEasyR1 / VeRLTraining DatasetDAPO\-17KRL Settings:Maximum response length8192Batch size\|B\|\|B\|128Mini batch size64Rollout group sizeGG8Sampling temperature1\.0Learning rate1×10−61\\times 10^\{\-6\}Clip rangeϵ=ϵlow=ϵhigh\\epsilon=\\epsilon\_\{\\rm low\}=\\epsilon\_\{\\rm high\}0\.2Reward typeBinary correctness rewardICR Settings:Selection ruleShortest correct sample\(s\) in each groupEmpty correct setNo ICR regularization for this groupICR coefficientα=α0\|B\|/\|𝒮\(q\)\|\\alpha=\\alpha\_\{0\}\|B\|/\|\\mathcal\{S\}\(q\)\|,α0=0\.5\\alpha\_\{0\}=0\.5Length\-Penalty Settings:LP\-F coefficientλ\\lambda0\.5LP\-F lower boundℓmin\\ell\_\{\\min\}4096LP\-F upper boundℓmax\\ell\_\{\\max\}8192LP\-G settingGroup\-wise min\-max normalizationICR /w LP\-FICR combined with the same LP\-F settingEvaluation Settings:Maximum response length16384Top P0\.95Temperature0\.1#### Experimental setup\.

We implement ICR in the EasyR1 and VeRL frameworks\[[32](https://arxiv.org/html/2605.07316#bib.bib550),[23](https://arxiv.org/html/2605.07316#bib.bib743)\]and follow the default EasyR1 setup unless otherwise specified\. Experiments are conducted on Qwen3\-4B, Qwen3\-8B, and DeepSeek\-R1\-Distill\-Qwen\-7B \(DSQW\-7B\), using DAPO\-17K\[[34](https://arxiv.org/html/2605.07316#bib.bib726)\]for training\. Baselines are chosen from three categories: vanilla RL optimization, length\-penalty methods, and inference\-time compression\. Specifically, we compare with GRPO\[[22](https://arxiv.org/html/2605.07316#bib.bib723)\], LP\-F from DAPO withℓmin=4096\\ell\_\{\\min\}=4096andλ=0\.5\\lambda=0\.5\[[34](https://arxiv.org/html/2605.07316#bib.bib726)\], LP\-G from Kimi\-K1\.5\[[27](https://arxiv.org/html/2605.07316#bib.bib344)\], ShorterBetter, and LC\-R1\.

#### Evaluation and implementation details\.

We evaluate in\-domain mathematical reasoning on AIME24\[[3](https://arxiv.org/html/2605.07316#bib.bib796)\], AIME25\[[3](https://arxiv.org/html/2605.07316#bib.bib796)\], HMMT25\[[3](https://arxiv.org/html/2605.07316#bib.bib796)\], AMC23\[[17](https://arxiv.org/html/2605.07316#bib.bib752)\], GSM8K\[[8](https://arxiv.org/html/2605.07316#bib.bib753)\], MATH500\[[17](https://arxiv.org/html/2605.07316#bib.bib752)\]and Olympiad\[[17](https://arxiv.org/html/2605.07316#bib.bib752)\], and further evaluate out\-of\-domain generalization on ARC\-Challenge\[[7](https://arxiv.org/html/2605.07316#bib.bib805)\], MMLU\-Pro\[[28](https://arxiv.org/html/2605.07316#bib.bib806)\], and SuperGPQA\[[12](https://arxiv.org/html/2605.07316#bib.bib807)\]\. All experiments are conducted on 8 A800 GPUs, and the full training configuration is listed in Table[4](https://arxiv.org/html/2605.07316#A2.T4)\. For ICR, the shortest correct samples are selected within each rollout group\. If a group contains no correct response, the ICR regularizer is not applied to that group\. If multiple correct responses share the minimum length, all of them are selected, and the regularization coefficient is normalized by the number of selected samples\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: We have mentioned in the abstract\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: We have discussed it in Section Limitation \(in appendix\)\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]\.
14. Justification: There are no theoretical results in this paper\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: We use open\-sourced data and models, which are easy to reproduce\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: We provide codes in our supplemental material\. The data and models we used are all open\-sourced\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: We provide in Detailed implementation in appendix\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[No\]
34. Justification: Experiments on LLMs are too expensive to run for many times\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: We provide in our appendix\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: We follow NeurIPS Code of Ethics\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[N/A\]
49. Justification: There is no societal impact of the work performed\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The paper poses no such risks
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: We follow the license and terms of use in our experiments\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.07316v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[N/A\]
64. Justification: The paper does not release new assets
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: The paper does not involve crowdsourcing nor research with human subjects\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: The paper does not involve crowdsourcing nor research with human subjects\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: The core method development in this research does not involve LLMs as any important, original, or non\-standard components\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

Similar Articles

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

Structural Rationale Distillation via Reasoning Space Compression

Submit Feedback

Similar Articles

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
Structural Rationale Distillation via Reasoning Space Compression