Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models
Summary
This paper introduces Dynamic Rollout Editing (DRE), a training-time intervention to reduce overthinking in GRPO-style reinforcement learning for reasoning models. DRE edits successful trajectories by preserving the solution-reachable prefix and preferring verified shorter edits, weakening the preference for unnecessary thinking.
View Cached Full Text
Cached at: 06/17/26, 05:42 AM
# Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models
Source: [https://arxiv.org/html/2606.17890](https://arxiv.org/html/2606.17890)
Zihao Wei1,2,Wenjie Shi,Liang Pang1††thanks:Corresponding author\.,Jingcheng Deng1,2,Shicheng Xu1,2,Shasha Guo1, Zenghao Duan1,2,Jiahao Liu,Jingang Wang,Huawei Shen1,Xueqi Cheng1 1Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China \{weizihao22z, pangliang\}@ict\.ac\.cn shiwj09@163\.com
###### Abstract
Long\-form chain\-of\-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged\. We refer to this behavior as overthinking\. We study this phenomenon from the perspective of GRPO\-style reinforcement learning \(RL\) post\-training, framing it as a training\-time credit\-assignment problem rather than merely a decoding\-time stopping problem\. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts\. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence\-level credit, it cannot distinguish the solution\-reaching prefix from the unnecessary continuation that lengthens a successful trajectory\. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training\. To address this issue, we introduce Dynamic Rollout Editing \(DRE\), a training\-time intervention for successful trajectories that continue thinking after answer emergence\. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer\. Experiments across diverse tasks show the effectiveness of DRE\.
Dynamic Rollout Editing for Reducing Overthinking in RL\-Trained Reasoning Models
Zihao Wei1,2, Wenjie Shi, Liang Pang1††thanks:Corresponding author\., Jingcheng Deng1,2, Shicheng Xu1,2, Shasha Guo1,Zenghao Duan1,2,Jiahao Liu,Jingang Wang,Huawei Shen1,Xueqi Cheng11Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China2University of Chinese Academy of Sciences, Beijing, China\{weizihao22z, pangliang\}@ict\.ac\.cnshiwj09@163\.com
## 1Introduction
Figure 1:Schematic view of training\-time overthinking\. \(a\) Vanilla GRPO attaches positive credit to successful trajectories with greater overthinking\. \(b\) Dynamic rollout editing prefers verified shorter edits, reducing the association between success and overthinking\.Large language models \(LLMs\) have substantially improved reasoning performance by generating long\-form chain\-of\-thought processes before producing final responsesGuoet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib2)\); Yanget al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib4)\); Meituan LongCat Team \([2025](https://arxiv.org/html/2606.17890#bib.bib55)\)\. This paradigm is central to reinforcement learning post\-training, where models explore reasoning trajectories and learn from outcome feedback\. However, longer thinking is not always better: generation length and answer quality are not monotonically relatedGhosalet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib57)\); Hanet al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib58)\)\. Once a correct answer has appeared, further thinking can increase inference cost and may even move the model away from the correct solution, a phenomenon commonly described as*overthinking*Suiet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib11)\); Wanget al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib12)\)\.
Existing work on overthinking largely treats it as a problem of controlling reasoning length: deciding when the model should stop or how its thinking process can be shortened without hurting accuracy\. Studies of reasoning dynamics identify boundaries such as the reasoning completion point, where useful reasoning gives way to unnecessary thinkingWuet al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib32)\); Suet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib59)\); Weiet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib60)\)\. Building on this view, early\-exit decoding, RL\-based stopping controllers, and length\-oriented regularization aim to reduce excessive generationYanget al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib30)\); Daiet al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib34)\); Kimi Team \([2025](https://arxiv.org/html/2606.17890#bib.bib18)\)\. While useful, these methods mainly intervene after the policy has learned its reasoning behavior\. They leave open how reinforcement learning may reinforce overthinking after a correct answer is already verifiable\.
We study this issue in Group Relative Policy Optimization \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2606.17890#bib.bib65)\), a widely used post\-training method for reasoning models\. GRPO assigns sequence\-level rewards and updates the policy using group\-relative advantages\. This design improves task success, but can also create a credit\-assignment problem: a successful trajectory may contain both thinking that leads to the correct answer and unnecessary thinking after the answer has emerged\. Since the same positive advantage is applied to the trajectory as a whole, GRPO does not distinguish the reasoning prefix that leads to the verified answer from the later overthinking that only lengthens the trajectory\.
To examine this credit\-assignment problem, we make answer emergence observable through a parser\-verifier analysis\. For each trajectory, we identify the first point at which a correct answer can be extracted and externally verified, allowing us to compare behavior before and after answer emergence\. In rollouts sampled early in GRPO training, successful trajectories already show a slight tendency to contain more overthinking after answer emergence than unsuccessful trajectories sampled for the same prompts\. This initial asymmetry can be amplified by sequence\-level updates: when unnecessary thinking is part of a successful trajectory, GRPO may reinforce it together with the thinking that leads to the answer\. As training improves task success, this coupled credit can make overthinking after answer emergence more persistent, including redundant checking, answer revisits, and repeated verification\. This observation motivates a different intervention\. Rather than only stopping a trained model earlier, we aim to change the training signal so that task success is less coupled with overthinking after answer emergence\. A simple length penalty may suppress useful thinking, while penalizing an entire successful trajectory may weaken the prefix needed to solve the problem\. An effective intervention should preserve the verified thinking prefix while reducing the preference signal assigned to subsequent overthinking\.
In this work, we introducedynamic rollout editing\(DRE\), a training\-time intervention for successful trajectories that contain overthinking after answer emergence\. Figure[1](https://arxiv.org/html/2606.17890#S1.F1)contrasts vanilla GRPO with DRE: vanilla GRPO can reinforce successful trajectories that include overthinking after answer emergence, whereas DRE shifts preference toward verified shorter counterparts\. For a successful trajectory with overthinking after answer emergence, DRE constructs an edited counterpart that preserves the verified prefix up to a conservative boundary and replaces the subsequent overthinking\. The edited trajectory is then preferred within the same GRPO group, allowing the model to retain a positive training signal for reaching the correct answer while weakening the association between success and overthinking\. We further introduce two complementary training strategies: prefix masking avoids directly penalizing the shared verified prefix, while GClip provides a learning signal for edited tokens near the edit point\.
Contributions\.\(1\) We identify overthinking after answer emergence as a training\-time credit\-assignment problem in GRPO\-style post\-training\. Our analysis shows that sequence\-level advantages can mix credit between reasoning that leads to the answer and unnecessary thinking after answer emergence\. \(2\) We propose DRE, a training\-time intervention that weakens the association between task success and unnecessary thinking after answer emergence\. DRE assigns a higher preference to shorter edited counterparts within the same GRPO group while preserving the verified reasoning prefix that reaches the answer\. \(3\) We demonstrate the broad effectiveness of DRE across mathematical reasoning, science QA, and code generation\. DRE reduces thinking length while preserving task performance, mainly after answer emergence\.
## 2Overthinking After Answer Emergence
Figure 2:Behavioral evidence from an R1\-Zero\-like GRPO run of Qwen3\-4B\-Base\.\(a\)AIME24 accuracy improves during training\.\(b\)Verification and checking keywords increase both before and after answer emergence\.\(c\)For same prompts, successful rollouts show more answer revisits than unsuccessful rollouts\.To examine this phenomenon, this section first formulates the generation process and defines answer emergence as the earliest point at which a verified answer can be extracted from a trajectory\. This point provides a concrete boundary for analysis: tokens before it may be needed to reach the answer, whereas tokens after it allow us to measure overthinking once the answer has already appeared\. We then present preliminary observations showing that, for the same prompts, successful rollouts often contain more overthinking after answer emergence than unsuccessful rollouts\. Finally, we analyze how GRPO can reinforce this behavior: because rewards are assigned at the sequence level, the update does not explicitly distinguish reasoning that leads to the answer from overthinking that follows it when both occur in a successful trajectory\.
### 2\.1Problem Formulation
We formalize overthinking after answer emergence in terms of thinking trajectories\. Given a promptxx, the model generates a thinking processttfollowed by a content processyy\. We model this generation process asπθ\(t,y∣x\)=πθ\(t∣x\)πθ\(y∣x,t\)\\pi\_\{\\theta\}\(t,y\\mid x\)=\\pi\_\{\\theta\}\(t\\mid x\)\\pi\_\{\\theta\}\(y\\mid x,t\)\. The resulting trajectory is denoted asτ=\(x,<think\>t</think\>y\)\\tau=\(x,\\texttt\{<think\>\}~t~\\texttt\{</think\>\}~y\)\.
For each trajectory, we split the thinking process into sentences and scan sentence prefixes from left to right\. At each prefix, a parser extracts a candidate answer, and a verifier checks whether the candidate is correct for the given prompt\. We definek⋆\(τ\)k^\{\\star\}\(\\tau\)as the first sentence index whose prefix passes this verification check\. If no prefix contains a verified answer, we setk⋆\(τ\)=Kt\(τ\)k^\{\\star\}\(\\tau\)=K\_\{t\}\(\\tau\), whereKt\(τ\)K\_\{t\}\(\\tau\)is the number of thinking sentences inτ\\tau\.
Thus,k⋆\(τ\)k^\{\\star\}\(\\tau\)marks the earliest sentence\-level point at which a correct answer becomes verifiable in the thinking process\. We use this point as an analysis boundary for measuring overthinking after answer emergence, without assuming that every subsequent token is redundant\.
### 2\.2Preliminary Observations
#### Setting\.
We train Qwen3\-4B\-Base with GRPO following the DeepSeek\-R1\-ZeroGuoet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib2)\)setting, using only rule\-based rewards for answer correctness and format\. This setup allows us to examine whether overthinking after answer emergence increases during RL training\.
For each rollout in which a verified answer appears,k⋆\(τ\)k^\{\\star\}\(\\tau\)denotes the first sentence index at which the prefix contains that answer\. We count keywords related to verification and checking in the sentences before and afterk⋆\(τ\)k^\{\\star\}\(\\tau\)\. Keywords in sentences beforek⋆\(τ\)k^\{\\star\}\(\\tau\)may reflect useful checking before the answer is reached, whereas keywords in sentences afterk⋆\(τ\)k^\{\\star\}\(\\tau\)serve as indicators of possible overthinking, since the model continues to check or verify after the answer has already appeared\. We then examine whether both quantities increase during training\.
To compare successful and unsuccessful rollouts for the same prompt, we use answer revisits as a rollout\-level measure, because unsuccessful rollouts may not contain a validk⋆\(τ\)k^\{\\star\}\(\\tau\)\. For each rollout with an extractable final answer, we count how often that answer is repeated after its first occurrence, and compare this count across rollouts sampled for the same prompt\.
Observation: RL training improves accuracy while amplifying overthinking behaviors such as answer repetition and redundant verification\.Figure[2](https://arxiv.org/html/2606.17890#S2.F2)\(a\) illustrates the accuracy gain during training\. Figure[2](https://arxiv.org/html/2606.17890#S2.F2)\(b\) shows that verification and checking keywords increase both before and afterk⋆\(τ\)k^\{\\star\}\(\\tau\)\. The increase beforek⋆\(τ\)k^\{\\star\}\(\\tau\)is consistent with more useful checking before the verified answer appears\. The increase afterk⋆\(τ\)k^\{\\star\}\(\\tau\)suggests more unnecessary overthinking after the answer has already emerged\. Figure[2](https://arxiv.org/html/2606.17890#S2.F2)\(c\) presents a similar pattern within GRPO sampling groups: for the same prompt, successful rollouts repeat their extracted answers more often than unsuccessful rollouts\. A qualitative example in Appendix[7](https://arxiv.org/html/2606.17890#A7.F7)illustrates this pattern\. At an earlier training step, the model still requires additional thinking to reach the correct answer; at a later step, it reaches the verified answer but continues with an additional verification step\.
These observations motivate the analysis below: sequence\-level credit can conflate the thinking that leads to answer emergence with subsequent overthinking, thereby reinforcing the latter when it is associated with positive advantage\.
### 2\.3Analysis of Overthinking under GRPO
Under GRPO, each prompt is paired with multiple sampled trajectories\. The complete output receives a reward, which is normalized within the sampled group to form a group\-relative advantageA\(τ\)A\(\\tau\)\. The argument below is not the literal clipped GRPO estimator used in training\. Instead, it is an unclipped first\-order approximation for analysis, evaluated at or nearθold\\theta\_\{\\mathrm\{old\}\}, that suppresses policy\-ratio, clipping, and content\-process terms in order to isolate where the scalar group advantage is broadcast\. For readability, we describe the broadcast effect at the sentence level, while the training objective is applied at the token level\.
Letψs\(τ\)=∇θlogπθ\(ts∣x,t<s\)\\psi\_\{s\}\(\\tau\)=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(t\_\{s\}\\mid x,t\_\{<s\}\)denote the score term for thess\-th thinking sentence\. LetZt\(τ\)Z\_\{t\}\(\\tau\)denote the number of unmasked thinking units in this simplified sentence\-level view\. Write𝔼old\\mathbb\{E\}\_\{\\mathrm\{old\}\}for expectation underτ∼πθold\\tau\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}, and abbreviatek⋆\(τ\)k^\{\\star\}\(\\tau\)andKt\(τ\)K\_\{t\}\(\\tau\)ask⋆k^\{\\star\}andKtK\_\{t\}inside the display\. For trajectories with no verified thinking prefix, our conventionk⋆\(τ\)=Kt\(τ\)k^\{\\star\}\(\\tau\)=K\_\{t\}\(\\tau\)only makes the post\-boundary sum empty; it should not be interpreted as answer emergence at the final sentence\. A first\-order thinking\-component approximation can be written as
gapprox\(θ\)\\displaystyle g\_\{\\mathrm\{approx\}\}\(\\theta\)=𝔼old\[A\(τ\)Zt\(τ\)C\(τ\)\],\\displaystyle=\\mathbb\{E\}\_\{\\mathrm\{old\}\}\\\!\\left\[\\frac\{A\(\\tau\)\}\{Z\_\{t\}\(\\tau\)\}C\(\\tau\)\\right\],\(1\)C\(τ\)\\displaystyle C\(\\tau\)=∑s≤k⋆ψs\(τ\)\+∑k⋆<s≤Ktψs\(τ\)\.\\displaystyle=\\sum\_\{s\\leq k^\{\\star\}\}\\psi\_\{s\}\(\\tau\)\+\\sum\_\{k^\{\\star\}<s\\leq K\_\{t\}\}\\psi\_\{s\}\(\\tau\)\.
Equation \([1](https://arxiv.org/html/2606.17890#S2.E1)\) shows the credit\-broadcasting pathway: the same group\-relative advantage multiplies score terms before and after the first verified answer boundary\. This does not mean that every post\-k⋆k^\{\\star\}token is necessarily reinforced\. Rather, when post\-answer continuation is correlated with positive advantage, continuation afterk⋆k^\{\\star\}can inherit the same positive training signal as reasoning that may be needed to expose the verified answer\. The matched\-prompt evidence above supports this condition: successful rollouts contain more answer revisits and more post\-answer checking behavior\. BecauseZt\(τ\)Z\_\{t\}\(\\tau\)token\-normalizes the rollout loss, this analysis concerns the direction of broadcast credit rather than gradient magnitude as post\-k⋆k^\{\\star\}continuation length increases\.
This analysis suggests two intervention principles\. First, the verified prefix before the editable boundary should be preserved, as it may contain reasoning needed to expose the answer\. Second, the preference signal should target overthinking after a conservative boundary, rather than penalize total thinking length uniformly\.
Figure 3:Overview of dynamic rollout editing\. Afterk⋆k^\{\\star\}marks answer emergence, prefix verification selects an editable boundaryk^\\hat\{k\}\. The accepted edit preserves the thinking process throughk^\\hat\{k\}, removes the subsequent overthinking, appends a verified final answer, and regenerates the content process\. The expanded GRPO group assigns higher preference to accepted verified edits than to verified originals or unedited copies\. Prefix masking prevents the shared verified prefix on the original branch from receiving direct negative credit, while GClip preserves learning signal for the preferred edited suffix\.
## 3Dynamic Rollout Editing
The analysis in Section[2](https://arxiv.org/html/2606.17890#S2)motivates our intervention: when overthinking after a verified answer is correlated with positive advantage, trajectory\-level rewards can reinforce it through the same gradient signal\. We address this issue with*dynamic rollout editing*, which constructs an edited counterpart for an overthinking rollout within the same GRPO group and assigns higher preference to the accepted edit than to the original\. The edit preserves the verified prefix, ends the thinking process at an accepted editable boundary, and reduces preference for overthinking beyond that boundary\.
### 3\.1Edited Rollout Synthesis
For each rollout, dynamic rollout editing searches near answer emergence for an editable boundary, proposes a trajectory that stops thinking there, and accepts it only if both prefix verification and regenerated\-content verification succeed\. Accepted edits preserve the verified prefix, remove later thinking, and regenerate the content process; failed edits fall back to an unedited copy\.
#### Editable Boundary Identification\.
The answer\-emergence boundaryk⋆k^\{\\star\}marks the earliest prefix from which the parser\-verifier extracts the correct answer, but it is not necessarily a safe edit point: the answer may appear inside an unfinished calculation, provisional statement, or incomplete final\-answer pattern\. We therefore usek⋆k^\{\\star\}as the starting point for prefix verification, not as the edit point itself\. Starting fromk⋆k^\{\\star\}, we append`\\nFinal Answer`to each candidate sentence prefix and let the policy generate a verified\-answer closure\. The first prefix whose closure exposes the same verified answer and reaches</think\>is accepted as the editable boundaryk^\\hat\{k\}\.
#### Boundary\-Based Rollout Synthesis\.
Givenk^\\hat\{k\}, the edit preserves thinking throughk^\\hat\{k\}, removes the original overthinking afterk^\\hat\{k\}, appends the verified answer closure, and regenerates the content process\. We keep the edited trajectory only when the regenerated content exposes the same verified answer\. Otherwise, the method falls back to an unedited copy\. Accepted edits therefore remove overthinking afterk^\\hat\{k\}without modifying the accepted verified prefix\. Figure[3](https://arxiv.org/html/2606.17890#S2.F3)summarizes boundary identification, rollout synthesis, and training with edited rollouts\. Appendix[H](https://arxiv.org/html/2606.17890#A8)shows a concrete accepted edit in this prefix\-preserving form\.
### 3\.2Edited Rollout Scoring
The scoring rule turns accepted edits into preferences within the group\. For each sampled original trajectoryτi\\tau\_\{i\}, the GRPO group contains one auxiliary trajectory: the accepted editτ~i\\tilde\{\\tau\}\_\{i\}when both prefix verification and verification of the regenerated content succeed, and an unedited copy otherwise\. LetV\(ρ\)=1V\(\\rho\)=1indicate that the content process of trajectoryρ\\rhois verified, and letE\(ρ\)=1E\(\\rho\)=1indicate thatρ\\rhois an accepted edit\. The scoring rule is a shaped preference score for the expanded GRPO group, not a calibrated task reward\. It encodes an additional preference among verified trajectories: when an accepted edit preserves the verified answer while removing continuation afterk^\\hat\{k\}, the edit should be preferred over the original continuation\.
q\(ρ\)=\{1,V\(ρ\)=1andE\(ρ\)=1,0,V\(ρ\)=1andE\(ρ\)=0,−1,V\(ρ\)=0\.q\(\\rho\)=\\begin\{cases\}1,&V\(\\rho\)=1\\text\{ and \}E\(\\rho\)=1,\\\\ 0,&V\(\\rho\)=1\\text\{ and \}E\(\\rho\)=0,\\\\ \-1,&V\(\\rho\)=0\.\\end\{cases\}\(2\)Thus, accepted verified edits are ranked above verified originals or unedited copies, which are in turn ranked above unverified trajectories\. The preference between an accepted edit and a verified original is not a claim that the edit is more correct; both pass the verifier\. It is reward shaping that adds a training preference for stopping after a conservative verified boundary\. Because the scores are normalized within the group to form advantages, their absolute scale is not calibrated; what matters is the induced within\-group ranking and spacing\.
When the original is also verified, the edit isolates the preference for removing post\-boundary continuation while preserving the same verified answer\. When the original is not verified but the edit is accepted, the edit is best interpreted as correcting answer drift after a verified prefix\. This case changes both correctness and length, whereas verified\-original edits isolate the preference for removing post\-boundary continuation\.
### 3\.3Training with Edited Rollouts
This shaped preference changes the GRPO training signal without redefining task correctness: it weakens positive credit for the original thinking after the verified answer boundary\. For accepted edit pairs, this preference reversal creates two optimization issues during training,*i\.e\.*,
\(1\) Negative credit on the shared prefix\.The original trajectory receives a lower preference score even though it shares the verified prefix with the edited trajectory\. Without a safeguard, the resulting lower advantage would be applied to the entire original trajectory, including both the accepted verified prefix and the overthinking afterk^\\hat\{k\}\.
\(2\) Upper clipping on edited tokens\.The edited suffix begins with the verified final answer string and regenerated content rather than the original rollout continuation\. These positively preferred tokens can have low probability under the old policy neark^\\hat\{k\}, so ordinary clipping can remove the ascent signal where the edit needs to be learned\.
#### Prefix Masking for Preserving the Verified Prefix\.
For an accepted pair\(τ,τ~\)\(\\tau,\\tilde\{\\tau\}\), the original and edited trajectories share the same generated prefix throughk^\\hat\{k\}\. After this prefix, the edited rollout begins the verified final answer string, whereas the original keeps thinking afterk^\\hat\{k\}\. Prefix masking applies only to the original trajectory: it excludes the generated tokens in this shared prefix from that trajectory’s GRPO loss, while keeping unit weight on tokens after the divergence\. The edited trajectory and all other trajectories keep unit weights\. Thus, negative relative credit from the original targets its post\-k^\\hat\{k\}continuation rather than the accepted verified prefix shared with the edit\.
Table 1:Main comparison across three Qwen3 reasoning models\. Percentages in parentheses report thinking\-token use relative to the Raw Model under the same base model and benchmark\. The best result is shown inbold\.
#### GClip for Learning from Edited Rollouts\.
*GClip*handles the second issue by using a clipped ratio operator with a custom backward pass for edited tokens\. Accepted edits are synthetic continuations constructed after rollout sampling, so the expanded\-group objective should be read as a GRPO\-style preference surrogate over original and edited trajectories rather than as an unbiased on\-policy GRPO estimator\. We retain the ratio toπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}to keep the forward surrogate clipped and to measure compatibility with the old policy\. Because the edited suffix no longer follows the original rollout path, its early tokens can have very low old\-policy probability neark^\\hat\{k\}\. Positively preferred edited tokens can then hit the upper clip boundary quickly, causing ordinary clipping to remove the ascent signal where the edit needs to be learned\.
For generated tokensz1,…,zmz\_\{1\},\\ldots,z\_\{m\}, let the token\-level policy ratio be
rs\(θ\)≜πθ\(zs∣x,z<s\)πθold\(zs∣x,z<s\)\.r\_\{s\}\(\\theta\)\\triangleq\\frac\{\\pi\_\{\\theta\}\(z\_\{s\}\\mid x,z\_\{<s\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(z\_\{s\}\\mid x,z\_\{<s\}\)\}\.\(3\)For any scalar ratior\>0r\>0and clip bounds0<ℓ<u0<\\ell<u, GClip has the same forward scalar value as the ordinary clipped ratio, but it changes the backward pass through a stop\-gradient construction:
GClip\(r;ℓ,u\)≜sg\(clip\(r,ℓ,u\)\)⋅rsg\(r\)\.\\operatorname\{GClip\}\(r;\\ell,u\)\\triangleq\\operatorname\{sg\}\\\!\\bigl\(\\mathrm\{clip\}\(r,\\ell,u\)\\bigr\)\\cdot\\frac\{r\}\{\\operatorname\{sg\}\(r\)\}\.\(4\)Heresg\(⋅\)\\operatorname\{sg\}\(\\cdot\)denotes the stop\-gradient operator\. GClip should therefore be understood as a straight\-through clipped\-ratio operator: it preserves the clipped forward surrogate value while replacing the zero\-gradient behavior of the ordinary clipped branch outside the clipping interval\. In the training objective below, GClip replaces the clipped\-ratio term in the GRPO\-style surrogate\. Theorem[1](https://arxiv.org/html/2606.17890#Thmtheorem1)formalizes this straight\-through derivative\.
#### Overall Training Objective\.
LetA\(ρ\)A\(\\rho\)be the group\-relative advantage obtained by normalizing the scores in Equation \([2](https://arxiv.org/html/2606.17890#S3.E2)\) within the expanded GRPO group, and writeGClipϵ\(r\)=GClip\(r;1−ϵ,1\+ϵ\)\\operatorname\{GClip\}\_\{\\epsilon\}\(r\)=\\operatorname\{GClip\}\(r;1\-\\epsilon,1\+\\epsilon\)\. We integrate GClip into GRPO by replacing the clipped\-ratio term withGClipϵ\\operatorname\{GClip\}\_\{\\epsilon\}, giving the token\-level surrogate
ℓs\(θ;ρ\)\\displaystyle\\ell\_\{s\}\(\\theta;\\rho\)=−min\(rs\(θ\)A\(ρ\),\\displaystyle=\-\\min\\\!\\Bigl\(r\_\{s\}\(\\theta\)A\(\\rho\),\(5\)GClipϵ\(rs\(θ\)\)A\(ρ\)\)\.\\displaystyle\\qquad\\qquad\\operatorname\{GClip\}\_\{\\epsilon\}\(r\_\{s\}\(\\theta\)\)A\(\\rho\)\\Bigr\)\.For accepted edited suffixes, the main saturation case isA\(ρ\)\>0A\(\\rho\)\>0withrs\>1\+ϵr\_\{s\}\>1\+\\epsilon: ordinary clipping would remove the ascent signal on the clipped branch\. GClip keeps the same clipped forward value while supplying a straight\-through backward signal, so preferred edited tokens can still be learned\. This should be read as an optimization surrogate rather than a formal PPO trust\-region guarantee; Appendix[E](https://arxiv.org/html/2606.17890#A5)gives the full derivative analysis, including the lower\-side saturation case\.
For an original trajectory paired with an accepted edit,ws\(ρ\)=0w\_\{s\}\(\\rho\)=0on generated tokens in the shared prefix throughk^\\hat\{k\}andws\(ρ\)=1w\_\{s\}\(\\rho\)=1otherwise\. For its edited counterpart and all other trajectories,ws\(ρ\)=1w\_\{s\}\(\\rho\)=1at every generated token\. The rollout loss is
ℒ\(θ;ρ\)=∑sws\(ρ\)ℓs\(θ;ρ\)∑sws\(ρ\)\.\\mathcal\{L\}\(\\theta;\\rho\)=\\frac\{\\sum\_\{s\}w\_\{s\}\(\\rho\)\\ell\_\{s\}\(\\theta;\\rho\)\}\{\\sum\_\{s\}w\_\{s\}\(\\rho\)\}\.\(6\)This objective targets post\-k^\\hat\{k\}thinking continuation in accepted edited pairs while preventing the lower\-scored original branch from directly penalizing the shared verified prefix\.
## 4Experiments
### 4\.1Experimental Setups
Benchmarks\.We conduct experiments on diverse benchmarks spanning mathematical reasoning, science QA, and code generation: AIME24, AIME25, and AIME26Mathematical Association of America \([2026](https://arxiv.org/html/2606.17890#bib.bib46)\)for math, GPQA Diamond \(GPQA\-D\)Reinet al\.\([2024](https://arxiv.org/html/2606.17890#bib.bib43)\)for graduate\-level science QA, and LiveCodeBench V6 \(LCB\-V6\)Jainet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib44)\)for code\.
Metrics\.We report accuracy \(Acc\), mean thinking tokens \(Think Tok\.\), and token efficiency \(TE\)\.Accindicates task performance, andThink Tok\.measures the average length of the generated thinking process\. We also report token efficiency \(TE\) to summarize the accuracy–token tradeoff, defined asTE=1000⋅Acc¯/Lt¯\\mathrm\{TE\}=1000\\cdot\\overline\{\\mathrm\{Acc\}\}/\\overline\{L\_\{t\}\}, whereAcc¯\\overline\{\\mathrm\{Acc\}\}andLt¯\\overline\{L\_\{t\}\}denote the average accuracy and thinking\-token count over the benchmarks in the table\. More details are provided in Appendix[B](https://arxiv.org/html/2606.17890#A2)\.
Baselines\.We evaluate Dynamic Rollout Editing \(DRE\) on three Qwen3 reasoning models: Qwen3\-4B\-Thinking\-2507, Qwen3\-8B, and Qwen3\-30B\-A3B\-Thinking\-2507Yanget al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib4)\)\. For each backbone,*Raw Model*denotes the corresponding original Qwen3 model without further training\. We compare DRE with Raw Model, GRPOShaoet al\.\([2024](https://arxiv.org/html/2606.17890#bib.bib65)\), GRPO \+ LPKimi Team \([2025](https://arxiv.org/html/2606.17890#bib.bib18)\), and S\-GRPODaiet al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib34)\)\. All trainable methods are trained on DAPO\-Math\-17kYuet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib66)\)\.
Figure 4:Evaluation on selected AIME24 samples across training steps\. Colors and markers indicate different methods\. In panel \(b\), solid curves show total thinking length, dashed curves show thinking up tok⋆k^\{\\star\}, and the shaded region highlights the reduction afterk⋆k^\{\\star\}under DRE\.\(a\)Edited positive trajectories contain fewer answer revisits than their original counterparts\.\(b\)DRE reduces thinking mainly afterk⋆k^\{\\star\}\.\(c\)Answer revisits decrease more substantially under DRE than under the baselines\.
### 4\.2Main Results
As shown in Table[1](https://arxiv.org/html/2606.17890#S3.T1), we make four empirical observations:\(1\) DRE substantially reduces thinking tokens while preserving benchmark accuracy\.Among methods with complete LCB\-V6 results, DRE has the lowest five\-benchmark average thinking length and the highest token efficiency in every model block, using only 72\.1%, 77\.7%, and 75\.3% of Raw Model thinking tokens on the three backbones\. On Qwen3\-4B\-Thinking\-2507 and Qwen3\-8B, this reduction is accompanied by higher average accuracy than both Raw Model and GRPO\. On Qwen3\-30B\-A3B\-Thinking\-2507, where all trained variants trail Raw Model, DRE stays closest to Raw Model in accuracy while using the fewest thinking tokens among trained methods\. The largest\-model result should therefore be read as stronger length control with competitive accuracy, rather than as a uniform accuracy gain\.\(2\) The reduction is not confined to the math training distribution\.Although all trainable methods use DAPO\-Math\-17k, DRE shortens thinking across held\-out AIME evaluations, GPQA\-D, and out\-of\-domain LCB\-V6\. Relative to GRPO, DRE uses fewer thinking tokens on all nine AIME model–benchmark rows, all three GPQA\-D rows, and all three LCB\-V6 rows\. Its accuracy also matches or exceeds GRPO on seven AIME rows, improves GPQA\-D accuracy in each model block, and stays within 0\.6 accuracy points of the best completed method on LCB\-V6\. This cross\-benchmark pattern is consistent with reducing redundant post\-solution continuation rather than only pruning math\-specific thinking\.\(3\) Generic length pressure does not reproduce the same pattern\.GRPO \+ LP sometimes shortens thinking relative to GRPO, but it remains longer than DRE on the five\-benchmark average and yields less stable accuracy\. This contrast is consistent with the design difference: DRE edits after the verified solution boundary while preserving the solution\-reaching prefix, whereas a length penalty does not distinguish necessary thinking from redundant continuation\.\(4\) The edited preference signal covers most sampled trajectories\.During DRE training, the edit\-and\-verify procedure accepted edits for 87\.20% of sampled trajectories, and the remaining 12\.80% used the conservative unedited fallback\. Among accepted edits, 99\.40% preserved verifier correctness after content regeneration\. Thus the preference term in Equation \([2](https://arxiv.org/html/2606.17890#S3.E2)\) is supported by most sampled rollouts, rather than by a small exceptional subset\.
### 4\.3Overthinking Analysis
We next examine whether the reduction in thinking length occurs mainly after answer emergence\. Figure[4](https://arxiv.org/html/2606.17890#S4.F4)evaluates selected AIME24 samples across training steps\. Panel \(a\) compares original positive trajectories, original negative trajectories, and edited positive trajectories\. Original positive trajectories show more answer revisits than original negative trajectories, whereas edited positive trajectories contain fewer revisits than their original counterparts and become the group with the fewest revisits during training\. Panel \(b\) shows that DRE reduces total thinking length much more than thinking beforek⋆k^\{\\star\}, indicating that most of the reduction occurs after answer emergence\. Panel \(c\) shows the same trend at the method level: answer revisits decrease more under DRE than under the baselines\. In contrast, GRPO \+ LP leaves these measures largely unchanged, suggesting that the effect is not explained by a generic preference for shorter trajectories\. Appendix[J](https://arxiv.org/html/2606.17890#A10)provides a complementary semantic\-space visualization of the same intervention effect, and Appendix[C](https://arxiv.org/html/2606.17890#A3)reports ablations that remove each component individually\.
## 5Related Work
Prior work reduces overthinking by controlling reasoning length during post\-training, shortening inference\-time generation, or explicitly deciding when to stopSuiet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib11)\); Wanget al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib12)\)\. Post\-training methods often use short or variable\-length thinking supervision, length rewards or penalties, or latent reasoning schemesYuet al\.\([2024](https://arxiv.org/html/2606.17890#bib.bib13)\); Kimi Team \([2025](https://arxiv.org/html/2606.17890#bib.bib18)\); Aggarwal and Welleck \([2025](https://arxiv.org/html/2606.17890#bib.bib20)\); Haoet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib21)\)\. Inference\-time methods instead shorten an already trained policy through prompts, decoding controls, early\-exit signals, or stopping controllersYanget al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib30)\); Daiet al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib34)\); Weiet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib60)\)\. Our work is complementary: we analyze how GRPO\-style credit assignment can reinforce overthinking after answer emergence, and use dynamic rollout editing to preserve necessary reasoning while reducing preference for unnecessary thinking after the answer becomes verifiable\. Appendix[A](https://arxiv.org/html/2606.17890#A1)gives a fuller discussion\.
## 6Conclusion
We study why RL\-trained reasoning models continue thinking after a correct answer has already emerged\. Our analysis suggests that, in GRPO\-style training, successful rollouts can contain more overthinking than unsuccessful rollouts sampled for the same prompts, and sequence\-level rewards may assign positive credit to both reaching the answer and overthinking after it\. In this work, we propose DRE, which prefers verified shorter trajectories within the same RL group while preserving the reasoning prefix that reaches the answer\. Experiments on diverse benchmarks covering mathematical reasoning, science QA, and code generation validate the effectiveness of DRE\. The results show that DRE reduces thinking length while preserving overall task performance, and that this reduction occurs mainly after answer emergence\.
## Limitations
RL training in this work is conducted only on DAPO\-Math\-17KYuet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib66)\), a dataset of math problems with verifiable answers\. This setting supports reliable identification of answer emergence, which is required by dynamic rollout editing, but it also limits the diversity of training signals used to learn shorter reasoning behavior\. Although the resulting model reduces thinking tokens on out\-of\-domain code\-generation benchmarks, the gains are smaller than those observed on math tasks\. Extending dynamic rollout editing to more diverse verifiable RL training data remains an important direction for improving and assessing token savings across domains\.
## Ethical Considerations
This work aims to reduce overthinking after answer emergence in RL\-trained reasoning models\. The main potential benefit is improved inference efficiency, which can lower compute cost and associated resource use\. Since shorter thinking processes may not be equally suitable for all tasks, deployments should be preceded by evaluation in the intended target domain, especially when additional verification steps are important for reliability\.
## References
- L1: controlling how long A reasoning model thinks with reinforcement learning\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=4jdIxXBNve)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- J\. Cheng and B\. V\. Durme \(2024\)Compressed chain of thought: efficient reasoning through dense representations\.CoRRabs/2412\.13171\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.13171),[Document](https://dx.doi.org/10.48550/ARXIV.2412.13171),2412\.13171Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1)\.
- M\. Dai, C\. Yang, and Q\. Si \(2026\)S\-grpo: early exit via reinforcement learning in reasoning models\.Advances in Neural Information Processing Systems38,pp\. 48178–48204\.Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.17890#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.17890#S4.SS1.p3.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- S\. S\. Ghosal, S\. Chakraborty, A\. Reddy, Y\. Lu, M\. Wang, D\. Manocha, F\. Huang, M\. Ghavamzadeh, and A\. S\. Bedi \(2025\)Does thinking more always help? mirage of test\-time scaling in reasoning models\.InAdvances in Neural Information Processing Systems,Vol\.38\.External Links:[Link](https://openreview.net/forum?id=tKPqbamNb9)Cited by:[§1](https://arxiv.org/html/2606.17890#S1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang,et al\.\(2025\)DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Note:Formal journal version of arXiv:2501\.12948External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z),[Link](https://doi.org/10.1038/s41586-025-09422-z)Cited by:[§1](https://arxiv.org/html/2606.17890#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.17890#S2.SS2.SSS0.Px1.p1.1)\.
- J\. Han, Y\. Huang, Y\. Liao, H\. Zhao, Z\. Jiang, X\. Wang, X\. Lu, G\. Zhou, S\. Jiang, J\. Liang, W\. Zhou, Z\. Sun, F\. Yu, and Y\. Xiao \(2026\)Your models have thought enough: training large reasoning models to stop overthinking\.InThe Fourteenth International Conference on Learning Representations, ICLR 2026,External Links:[Link](https://openreview.net/forum?id=2u5ZRzDyS0)Cited by:[§1](https://arxiv.org/html/2606.17890#S1.p1.1)\.
- T\. Han, Z\. Wang, C\. Fang, S\. Zhao, S\. Ma, and Z\. Chen \(2025\)Token\-budget\-aware LLM reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 24842–24855\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1274/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1274)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px2.p1.1)\.
- S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. E\. Weston, and Y\. Tian \(2025\)Training large language models to reason in a continuous latent space\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Itxz7S4Ip3)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- J\. Huang, B\. Lin, G\. Feng, J\. Chen, D\. He, and L\. Hou \(2026a\)Efficient reasoning for large reasoning language models via certainty\-guided reflection suppression\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 31176–31184\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/40379),[Document](https://dx.doi.org/10.1609/aaai.v40i37.40379)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px2.p1.1)\.
- Z\. Huang, X\. Xia, Y\. Ren, J\. Zheng, X\. Wang, Z\. Zhang, H\. Xie, S\. Liang, Z\. Chen, X\. Xiao, F\. Zhuang, J\. Li, Y\. Ban, and D\. Wang \(2026b\)Does your reasoning model implicitly know when to stop thinking?\.CoRRabs/2602\.08354\.External Links:[Link](https://doi.org/10.48550/arXiv.2602.08354),[Document](https://dx.doi.org/10.48550/ARXIV.2602.08354),2602\.08354Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px3.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2025\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=chfJJYC3iL)Cited by:[§4\.1](https://arxiv.org/html/2606.17890#S4.SS1.p1.1)\.
- Y\. Kang, X\. Sun, L\. Chen, and W\. Zou \(2025\)C3oT: generating shorter chain\-of\-thought without compromising effectiveness\.InAAAI\-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 \- March 4, 2025, Philadelphia, PA, USA,T\. Walsh, J\. Shah, and Z\. Kolter \(Eds\.\),pp\. 24312–24320\.External Links:[Link](https://doi.org/10.1609/aaai.v39i23.34608),[Document](https://dx.doi.org/10.1609/AAAI.V39I23.34608)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px2.p1.1)\.
- Kimi Team \(2025\)Kimi k1\.5: scaling reinforcement learning with LLMs\.CoRRabs/2501\.12599\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.12599),[Document](https://dx.doi.org/10.48550/ARXIV.2501.12599),2501\.12599Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1),[Appendix B](https://arxiv.org/html/2606.17890#A2.SS0.SSS0.Px4.p1.7),[§1](https://arxiv.org/html/2606.17890#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.17890#S4.SS1.p3.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- A\. Lee, E\. Che, and T\. Peng \(2025\)How well do LLMs compress their own chain\-of\-thought? A token complexity approach\.CoRRabs/2503\.01141\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.01141),[Document](https://dx.doi.org/10.48550/ARXIV.2503.01141),2503\.01141Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px2.p1.1)\.
- X\. Liu and L\. Wang \(2025\)Answer convergence as a signal for early stopping in reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 17896–17907\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.904/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.904),ISBN 979\-8\-89176\-332\-6Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px3.p1.1)\.
- H\. Luo, L\. Shen, H\. He, Y\. Wang, S\. Liu, W\. Li, N\. Tan, X\. Cao, and D\. Tao \(2025\)O1\-Pruner: length\-harmonizing fine\-tuning for O1\-Like reasoning pruning\.CoRRabs/2501\.12570\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.12570),[Document](https://dx.doi.org/10.48550/ARXIV.2501.12570),2501\.12570Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1)\.
- X\. Ma, G\. Wan, R\. Yu, G\. Fang, and X\. Wang \(2025\)CoT\-Valve: length\-compressible chain\-of\-thought tuning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 6025–6035\.External Links:[Link](https://aclanthology.org/2025.acl-long.300/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.300)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1)\.
- Mathematical Association of America \(2026\)MAA invitational competitions\.External Links:[Link](https://maa.org/maa-invitational-competitions/)Cited by:[§4\.1](https://arxiv.org/html/2606.17890#S4.SS1.p1.1)\.
- Meituan LongCat Team \(2025\)Introducing LongCat\-Flash\-Thinking: A technical report\.CoRRabs/2509\.18883\.External Links:[Link](https://doi.org/10.48550/arXiv.2509.18883),[Document](https://dx.doi.org/10.48550/ARXIV.2509.18883),2509\.18883Cited by:[§1](https://arxiv.org/html/2606.17890#S1.p1.1)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. J\. Candès, and T\. Hashimoto \(2025\)s1: simple test\-time scaling\.CoRRabs/2501\.19393\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.19393),[Document](https://dx.doi.org/10.48550/ARXIV.2501.19393),2501\.19393Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2025\)gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px2.p1.1)\.
- K\. Peng, L\. Ding, Y\. Ouyang, M\. Fang, and D\. Tao \(2025\)Revisiting overthinking in long chain\-of\-thought from the perspective of self\-doubt\.CoRRabs/2505\.23480\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.23480),[Document](https://dx.doi.org/10.48550/ARXIV.2505.23480),2505\.23480Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px2.p1.1)\.
- X\. Pu, M\. Saxon, W\. Hua, and W\. Y\. Wang \(2025\)THOUGHTTERMINATOR: benchmarking, calibrating, and mitigating overthinking in reasoning models\.CoRRabs/2504\.13367\.External Links:[Link](https://doi.org/10.48550/arXiv.2504.13367),[Document](https://dx.doi.org/10.48550/ARXIV.2504.13367),2504\.13367Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px3.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof Q&A benchmark\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[§4\.1](https://arxiv.org/html/2606.17890#S4.SS1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.CoRRabs/2402\.03300\.External Links:[Link](https://doi.org/10.48550/arXiv.2402.03300),[Document](https://dx.doi.org/10.48550/ARXIV.2402.03300),2402\.03300Cited by:[§1](https://arxiv.org/html/2606.17890#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.17890#S4.SS1.p3.1)\.
- Z\. Shen, H\. Yan, L\. Zhang, Z\. Hu, Y\. Du, and Y\. He \(2025\)CODI: compressing chain\-of\-thought into continuous space via self\-distillation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 677–693\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.36/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.36)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1)\.
- J\. Su, J\. Healey, P\. Nakov, and C\. Cardie \(2025\)Between underthinking and overthinking: an empirical study of reasoning length and correctness in LLMs\.CoRRabs/2505\.00127\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.00127),[Document](https://dx.doi.org/10.48550/ARXIV.2505.00127),2505\.00127Cited by:[§1](https://arxiv.org/html/2606.17890#S1.p2.1)\.
- Y\. Sui, Y\. Chuang, G\. Wang, J\. Zhang, T\. Zhang, J\. Yuan, H\. Liu, A\. Wen, S\. Zhong, N\. Zou, H\. Chen, and X\. Hu \(2025\)Stop overthinking: A survey on efficient reasoning for large language models\.Trans\. Mach\. Learn\. Res\.2025\.External Links:[Link](https://openreview.net/forum?id=HvoG8SxggZ)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.p1.1),[§1](https://arxiv.org/html/2606.17890#S1.p1.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- R\. Sun, W\. Cheng, D\. Li, H\. Chen, and W\. Wang \(2025\)Stop when enough: adaptive early\-stopping for chain\-of\-thought reasoning\.CoRRabs/2510\.10103\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.10103),[Document](https://dx.doi.org/10.48550/ARXIV.2510.10103),2510\.10103Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px3.p1.1)\.
- R\. Wang, H\. Wang, B\. Xue, J\. Pang, S\. Liu, Y\. Chen, J\. Qiu, D\. F\. Wong, H\. Ji, and K\. Wong \(2025\)Harnessing the reasoning economy: A survey of efficient reasoning for large language models\.CoRRabs/2503\.24377\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.24377),[Document](https://dx.doi.org/10.48550/ARXIV.2503.24377),2503\.24377Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.p1.1),[§1](https://arxiv.org/html/2606.17890#S1.p1.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- Z\. Wei, L\. Pang, J\. Liu, W\. Shi, J\. Deng, S\. Xu, Z\. Duan, F\. Sun, H\. Shen, and X\. Cheng \(2025\)The evolution of thought: tracking LLM overthinking via reasoning dynamics analysis\.CoRRabs/2508\.17627\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.17627),[Document](https://dx.doi.org/10.48550/ARXIV.2508.17627),2508\.17627Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px3.p1.1),[Appendix J](https://arxiv.org/html/2606.17890#A10.p1.6),[Appendix D](https://arxiv.org/html/2606.17890#A4.p1.1),[§1](https://arxiv.org/html/2606.17890#S1.p2.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- Y\. Wu, Y\. Wang, Z\. Ye, T\. Du, S\. Jegelka, and Y\. Wang \(2026\)When more is less: understanding chain\-of\-thought length in LLMs\.InThe Fourteenth International Conference on Learning Representations, ICLR 2026,External Links:[Link](https://openreview.net/forum?id=6QDFsYxtI1)Cited by:[§1](https://arxiv.org/html/2606.17890#S1.p2.1)\.
- H\. Xia, C\. T\. Leong, W\. Wang, Y\. Li, and W\. Li \(2025\)TokenSkip: controllable chain\-of\-thought compression in LLMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 3351–3363\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.165/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.165)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2606.17890#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.17890#S4.SS1.p3.1)\.
- C\. Yang, Q\. Si, Y\. Duan, Z\. Zhu, C\. Zhu, Q\. Li, M\. Chen, Z\. Lin, and W\. Wang \(2026\)Dynamic early exit in reasoning models\.InThe Fourteenth International Conference on Learning Representations, ICLR 2026,External Links:[Link](https://openreview.net/forum?id=NpU7ZXafRi)Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px3.p1.1),[Appendix D](https://arxiv.org/html/2606.17890#A4.p1.1),[§1](https://arxiv.org/html/2606.17890#S1.p2.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- P\. Yu, J\. Xu, J\. Weston, and I\. Kulikov \(2024\)Distilling System 2 into System 1\.CoRRabs/2407\.06023\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.06023),[Document](https://dx.doi.org/10.48550/ARXIV.2407.06023),2407\.06023Cited by:[Appendix A](https://arxiv.org/html/2606.17890#A1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.17890#S5.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, J\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.InAdvances in Neural Information Processing Systems,Vol\.38,pp\. 113222–113244\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/a4277440d50f1f15d2cb4c14f7e0c0d2-Paper-Conference.pdf)Cited by:[Appendix B](https://arxiv.org/html/2606.17890#A2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17890#S4.SS1.p3.1),[Limitations](https://arxiv.org/html/2606.17890#Sx1.p1.1)\.
## Appendix ARelated Work
We organize related work according to the stage at which prior methods intervene to control reasoning length: during post\-training, at inference through prompting or decoding, or via explicit early\-exit mechanisms\. This organization follows recent taxonomies of efficient reasoningSuiet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib11)\); Wanget al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib12)\)\.
#### Post\-training methods\.
Training\-time approaches control reasoning length by modifying supervision signals, reward functions, or the latent reasoning space\. Representative directions include supervised fine\-tuning with variable\-length chain\-of\-thought dataYuet al\.\([2024](https://arxiv.org/html/2606.17890#bib.bib13)\); Kanget al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib14)\); Xiaet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib15)\); Maet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib16)\), RL objectives with explicit length rewards or penaltiesKimi Team \([2025](https://arxiv.org/html/2606.17890#bib.bib18)\); Luoet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib19)\); Aggarwal and Welleck \([2025](https://arxiv.org/html/2606.17890#bib.bib20)\), and latent\-space reasoning schemesHaoet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib21)\); Cheng and Durme \([2024](https://arxiv.org/html/2606.17890#bib.bib23)\); Shenet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib22)\)\. The proposed method also intervenes during post\-training, but differs in mechanism: rather than globally rewarding concise reasoning or requiring short chain\-of\-thought supervision, it edits high\-reward RL rollouts at the point where a verified solution first appears\.
#### Prompt\-based methods\.
Prompting strategies elicit shorter or more disciplined chains of thought by instructing the model to be concise, decisive, or self\-monitoringHanet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib25)\); Muennighoffet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib24)\); Leeet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib26)\); OpenAI \([2025](https://arxiv.org/html/2606.17890#bib.bib38)\); Kanget al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib14)\); Penget al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib48)\); Huanget al\.\([2026a](https://arxiv.org/html/2606.17890#bib.bib56)\)\. These methods are appealing because they require no retraining, but they tend to be brittle: their effectiveness depends heavily on prompt wording, and a single prompting policy often generalizes poorly across tasks with different reasoning depths or answer formats\.
#### Early\-exit methods\.
Early\-exit approaches intervene directly in the generation process by terminating reasoning once the model appears to have reached a solution\. Some methods rely on external monitors or additional training, such as auxiliary classifiersLiu and Wang \([2025](https://arxiv.org/html/2606.17890#bib.bib52)\), bandit controllersSunet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib53)\), difficulty estimatorsPuet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib54)\), RL\-based stopping policies such as S\-GRPODaiet al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib34)\), or self\-aware guided sampling as in SAGE\-RLHuanget al\.\([2026b](https://arxiv.org/html/2606.17890#bib.bib35)\)\. Others pursue lower\-overhead stopping rules based on intrinsic model signals, such as DEERYanget al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib30)\), or are motivated by reasoning\-dynamics analyses that characterize an instance\-specific post\-convergence regime via a reasoning completion pointWeiet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib60)\)\. The present work is complementary to this line of research: rather than stopping generation through a controller or sampling rule, it edits sampled rollouts after verified answer emergence and trains against the later continuation\.
## Appendix BExperimental Details
#### Training data and rollout sampling\.
All training\-time variants are trained on DAPO\-Math\-17kYuet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib66)\)\. During training, we use a maximum generation length of 28,000 tokens\. Each optimization step uses a batch size of 64 prompts, and each prompt contributes an effective group of 8 trajectories or candidates for group\-relative normalization\. For GRPO and GRPO \+ LP, these are 8 sampled rollouts\. For DRE, we sample 4 original rollouts and construct 4 auxiliary counterparts; each auxiliary trajectory is an accepted edit when both prefix verification and verification of the regenerated content process succeed, and an unedited copy otherwise\. For S\-GRPO, we sample 1 rollout and construct 8 candidates from that rollout following its stopping\-policy construction\.
#### Training horizon and checkpoint selection\.
For DRE, we select the first checkpoint whose validation\-set mean full thinking length is shorter than the raw model’s validation\-set meank⋆k^\{\\star\}token position\. This stopping rule selects the point at which the trained model’s complete thinking process has become shorter than the raw model’s answer\-emergence boundary on the validation set\. For the other trained methods, we train to 100 optimization steps and report the step\-100 checkpoint\. In S\-GRPO runs that collapse before step 100 on a training set, we instead report the last checkpoint before collapse and do not use post\-collapse checkpoints\.
#### Optimization hyperparameters\.
We do not use a KL penalty in any training variant, i\.e\., the KL coefficient is set to zero\. For ordinary clipped GRPO objectives, including the GRPO, GRPO \+ LP, S\-GRPO, and*Standard Clip*settings, we use asymmetric ratio clipping with lower\-sideϵ↓=0\.20\\epsilon\_\{\\downarrow\}=0\.20and upper\-sideϵ↑=0\.28\\epsilon\_\{\\uparrow\}=0\.28, corresponding to ratio bounds\[0\.80,1\.28\]\[0\.80,1\.28\]\. For GClip in DRE and the GClip\-based ablations, we useϵ↓=ϵ↑=0\.10\\epsilon\_\{\\downarrow\}=\\epsilon\_\{\\uparrow\}=0\.10, corresponding to ratio bounds\[0\.90,1\.10\]\[0\.90,1\.10\]\.
#### GRPO \+ LP length reward\.
GRPO \+ LP is a length\-penalty baseline following the group\-relative length reward used in Kimi k1\.5Kimi Team \([2025](https://arxiv.org/html/2606.17890#bib.bib18)\)\. Within each sampled group, letr\(x,yi,y⋆\)∈\{0,1\}r\(x,y\_\{i\},y^\{\\star\}\)\\in\\\{0,1\\\}be the rule\-based correctness indicator for responseyiy\_\{i\}, letlen\(i\)\\operatorname\{len\}\(i\)be its generated length, and letmin\_len\\operatorname\{min\\\_len\}andmax\_len\\operatorname\{max\\\_len\}be the minimum and maximum generated lengths in the same group\. Ifmax\_len=min\_len\\operatorname\{max\\\_len\}=\\operatorname\{min\\\_len\}, we setλi=0\\lambda\_\{i\}=0\. Otherwise, we set
λi=0\.5−len\(i\)−min\_lenmax\_len−min\_len,\\lambda\_\{i\}=0\.5\-\\frac\{\\operatorname\{len\}\(i\)\-\\operatorname\{min\\\_len\}\}\{\\operatorname\{max\\\_len\}\-\\operatorname\{min\\\_len\}\},and define the auxiliary length reward as
rlen\(i\)=\{λi,r\(x,yi,y⋆\)=1,min\(0,λi\),r\(x,yi,y⋆\)=0\.r\_\{\\mathrm\{len\}\}\(i\)=\\begin\{cases\}\\lambda\_\{i\},&r\(x,y\_\{i\},y^\{\\star\}\)=1,\\\\ \\min\(0,\\lambda\_\{i\}\),&r\(x,y\_\{i\},y^\{\\star\}\)=0\.\\end\{cases\}Thus shorter correct responses receive larger auxiliary reward, while incorrect responses receive no positive length bonus\.
#### Evaluation protocol\.
At evaluation time, we use temperature 0\.6 for all benchmarks\. Unless otherwise noted, for AIME \(AIME I and II combined for each year\), we sample 16 responses per problem with a maximum generation length of 81,920 tokens\. For GPQA\-D, we sample 8 responses per problem with a maximum generation length of 32,768 tokens\. For LCB\-V6, we sample 8 responses per problem with a maximum generation length of 81,920 tokens\. For Qwen3\-8B, we use a 40,960\-token generation cap for all benchmarks due to our serving and evaluation configuration\. A small number of method–benchmark runs use fewer completed evaluation samples because of time constraints\. We report Acc as per\-sample accuracy over completed sampled responses\. Think Tok\. is the mean number of generated tokens inside the delimited thinking process, counted between the first<think\>marker and the first subsequent</think\>marker\. If</think\>is absent, we count generated tokens after<think\>until the generation stops or reaches the maximum generation length\. Truncated outputs are scored as generated, without repair or regeneration\. For mechanism\-facing analyses, we additionally track meank⋆k^\{\\star\}, answer revisits afterk⋆k^\{\\star\}, and old\-policy token probabilities on accepted edited suffixes afterk^\\hat\{k\}\.
#### Token efficiency\.
In Table[1](https://arxiv.org/html/2606.17890#S3.T1),TEdenotes token efficiency:
TE=1000⋅Acc¯Lt¯,\\mathrm\{TE\}=1000\\cdot\\frac\{\\overline\{\\mathrm\{Acc\}\}\}\{\\overline\{L\_\{t\}\}\},whereAcc¯\\overline\{\\mathrm\{Acc\}\}is the average accuracy in percentage points andLt¯\\overline\{L\_\{t\}\}is the average thinking\-token count over the benchmarks included in the table\. TE can be read as accuracy points per 1,000 thinking tokens, so larger values indicate a better accuracy–token tradeoff\.
## Appendix CAblations
We report leave\-one\-out ablations evaluated on selected AIME24 samples across training steps\. The*Full Method*combines prefix verification, boundary\-based editing, prefix masking, and GClip\.*No Prefix Verification*removes the prefix\-closure check while retaining the final regenerated\-content correctness check\.*No Prefix Masking*removes the pair\-local mask that excludes the original branch’s direct loss on the shared thinking prefix\.*Standard Clip*replaces GClip with ordinary clipping near the edited boundary and in the early edited suffix\. These ablations test whether each safeguard is needed to reduce overthinking while preserving task performance\. Figure[5](https://arxiv.org/html/2606.17890#A3.F5)reports accuracy, total thinking length, thinking beforek⋆k^\{\\star\}, and content process length\.
Figure 5:Leave\-one\-out ablations evaluated on selected AIME24 samples across training steps\. The*Full Method*maintains accuracy while reducing total thinking and keeping content process length stable\.\(a\)Accuracy\.\(b\)Total thinking\.\(c\)Thinking beforek⋆k^\{\\star\}\.\(d\)Content process length\.The ablations show distinct failure modes\. Without prefix verification, accuracy declines while content length increases, suggesting that some accepted prefixes are not yet ready to expose the verified solution and the regenerated content process must complete reasoning that should have remained in thinking\. Without prefix masking, thinking beforek⋆k^\{\\star\}drops the most and accuracy later degrades, consistent with negative credit leaking into the shared verified prefix\. With*Standard Clip*, total thinking declines more slowly than under the*Full Method*, consistent with ordinary clipping weakening the intended ascent signal near the edited boundary; Appendix[F](https://arxiv.org/html/2606.17890#A6)gives a representative old\-policy mismatch example\. Overall, the ablation results support combining prefix verification, prefix masking, and GClip\.
## Appendix DSupplementary Early\-Exit Comparison
Table[2](https://arxiv.org/html/2606.17890#A4.T2)situates DRE against two training\-free inference\-time stopping baselines, DEERYanget al\.\([2026](https://arxiv.org/html/2606.17890#bib.bib30)\)and RCPDWeiet al\.\([2025](https://arxiv.org/html/2606.17890#bib.bib60)\)\. We report this supplementary comparison on AIME24, AIME25, AIME26, and GPQA\-D, where we ran both early\-exit baselines under the same evaluation pipeline\. LCB\-V6 is omitted from this table because we did not validate RCPD under the same LCB\-V6 protocol\. The Average columns are therefore computed over these four benchmarks and should be interpreted within this supplementary comparison, rather than compared directly with the five\-benchmark averages in Table[1](https://arxiv.org/html/2606.17890#S3.T1)\.
Table 2:Supplementary comparison with Raw Model, training\-free early\-exit methods, and DRE on the shared evaluation set\. Average is computed over AIME24, AIME25, AIME26, and GPQA\-D\.
## Appendix EGClip as a Straight\-Through Clipped\-Ratio Operator
This appendix formalizes the straight\-through property of GClip used in Section[3\.3](https://arxiv.org/html/2606.17890#S3.SS3)\. Throughout, let0<ℓ<u0<\\ell<uand define
GClip\(x;ℓ,u\)≜sg\(clip\(x,ℓ,u\)\)⋅xsg\(x\),\\operatorname\{GClip\}\(x;\\ell,u\)\\;\\triangleq\\;\\operatorname\{sg\}\\\!\\bigl\(\\mathrm\{clip\}\(x,\\ell,u\)\\bigr\)\\cdot\\frac\{x\}\{\\operatorname\{sg\}\(x\)\},for any policy ratiox\>0x\>0\. Heresg\\operatorname\{sg\}denotes stop\-gradient: it leaves the forward value unchanged but blocks gradients through its argument\.
###### Theorem 1\(GClip identity and straight\-through derivative\)\.
Lets\(x\)=GClip\(x;ℓ,u\)s\(x\)=\\operatorname\{GClip\}\(x;\\ell,u\)\. In the forward pass,
s\(x\)=clip\(x,ℓ,u\)\.s\(x\)=\\mathrm\{clip\}\(x,\\ell,u\)\.Under automatic differentiation with stop\-gradient, the scalar derivative with respect toxxis
Dsgs\(x\)=clip\(x,ℓ,u\)x=clip\(1,ℓx,ux\),D^\{\\mathrm\{sg\}\}s\(x\)=\\frac\{\\mathrm\{clip\}\(x,\\ell,u\)\}\{x\}=\\mathrm\{clip\}\\\!\\left\(1,\\frac\{\\ell\}\{x\},\\frac\{u\}\{x\}\\right\),or equivalently,
Dsgs\(x\)=\{ℓ/x,0<x<ℓ,1,ℓ≤x≤u,u/x,x\>u\.D^\{\\mathrm\{sg\}\}s\(x\)=\\begin\{cases\}\\ell/x,&0<x<\\ell,\\\\ 1,&\\ell\\leq x\\leq u,\\\\ u/x,&x\>u\.\\end\{cases\}
Consequently, letx=xθx=x\_\{\\theta\}be a policy ratio that depends on the model parameters, and letL\(θ\)=−As\(xθ\)L\(\\theta\)=\-A\\,s\(x\_\{\\theta\}\)denote the clipped branch of a token\-level loss for a positively preferred token, whereA\>0A\>0is treated as constant during differentiation\. Ifxθ\>ux\_\{\\theta\}\>u, then
∇θsgL\(θ\)=−Auxθ∇θxθ,\\nabla\_\{\\theta\}^\{\\mathrm\{sg\}\}L\(\\theta\)=\-A\\,\\frac\{u\}\{x\_\{\\theta\}\}\\,\\nabla\_\{\\theta\}x\_\{\\theta\},which is non\-zero whenever∇θxθ≠0\\nabla\_\{\\theta\}x\_\{\\theta\}\\neq 0\. By contrast, the standard clipped branch
Lclip\(θ\)=−Aclip\(xθ,ℓ,u\)L\_\{\\mathrm\{clip\}\}\(\\theta\)=\-A\\,\\mathrm\{clip\}\(x\_\{\\theta\},\\ell,u\)has zero automatic\-differentiation gradient forxθ\>ux\_\{\\theta\}\>u\. Thus GClip leaves the clipped forward scalar value unchanged while replacing the ordinary clipped branch with a nonzero straight\-through gradient in this selected saturation case\.
Moreover, for every interior pointx0∈\(ℓ,u\)x\_\{0\}\\in\(\\ell,u\),
s\(x0\)=x0andDsgs\(x0\)=1\.s\(x\_\{0\}\)=x\_\{0\}\\quad\\text\{and\}\\quad D^\{\\mathrm\{sg\}\}s\(x\_\{0\}\)=1\.Therefore, inside the clipping interval, this token\-level branch has the same first\-order direction as the unclipped branch\.
###### Proof\.
Forward value\.Fix anyx\>0x\>0\. Sincesg\\operatorname\{sg\}leaves the forward value unchanged, the first factor equalsclip\(x,ℓ,u\)\\mathrm\{clip\}\(x,\\ell,u\)in the forward pass\. The second factor satisfies
xsg\(x\)=1\\frac\{x\}\{\\operatorname\{sg\}\(x\)\}=1in the forward pass because the denominator has the same numerical value asxx\. Therefore
s\(x\)=sg\(clip\(x,ℓ,u\)\)⋅xsg\(x\)=clip\(x,ℓ,u\),s\(x\)=\\operatorname\{sg\}\\\!\\bigl\(\\mathrm\{clip\}\(x,\\ell,u\)\\bigr\)\\cdot\\frac\{x\}\{\\operatorname\{sg\}\(x\)\}=\\mathrm\{clip\}\(x,\\ell,u\),which proves the first claim\.
Surrogate derivative\.For the derivative, the stop\-gradient operator treats bothsg\(clip\(x,ℓ,u\)\)\\operatorname\{sg\}\\\!\\bigl\(\\mathrm\{clip\}\(x,\\ell,u\)\\bigr\)andsg\(x\)\\operatorname\{sg\}\(x\)as constants during backpropagation\. Hence the derivative of the clipped factor itself is not used, including at the ordinary nondifferentiable pointsx=ℓx=\\ellandx=ux=u\. We have
Dsgs\(x\)\\displaystyle D^\{\\mathrm\{sg\}\}s\(x\)=sg\(clip\(x,ℓ,u\)\)⋅Dsg\(xsg\(x\)\)\\displaystyle=\\operatorname\{sg\}\\\!\\bigl\(\\mathrm\{clip\}\(x,\\ell,u\)\\bigr\)\\cdot D^\{\\mathrm\{sg\}\}\\\!\\left\(\\frac\{x\}\{\\operatorname\{sg\}\(x\)\}\\right\)=clip\(x,ℓ,u\)⋅1x\\displaystyle=\\mathrm\{clip\}\(x,\\ell,u\)\\cdot\\frac\{1\}\{x\}=clip\(x,ℓ,u\)x\.\\displaystyle=\\frac\{\\mathrm\{clip\}\(x,\\ell,u\)\}\{x\}\.Becausex\>0x\>0, this is equivalently
Dsgs\(x\)=\{ℓ/x,0<x<ℓ,1,ℓ≤x≤u,u/x,x\>u,D^\{\\mathrm\{sg\}\}s\(x\)=\\begin\{cases\}\\ell/x,&0<x<\\ell,\\\\ 1,&\\ell\\leq x\\leq u,\\\\ u/x,&x\>u,\\end\{cases\}and alsoDsgs\(x\)=clip\(1,ℓ/x,u/x\)D^\{\\mathrm\{sg\}\}s\(x\)=\\mathrm\{clip\}\(1,\\ell/x,u/x\)\.
Upper\-side saturation\.Letx=xθx=x\_\{\\theta\}, letL\(θ\)=−As\(xθ\)L\(\\theta\)=\-A\\,s\(x\_\{\\theta\}\)withA\>0A\>0treated as constant during differentiation, and supposexθ\>ux\_\{\\theta\}\>u\. The derivative identity gives
Dsgs\(xθ\)=uxθ\.D^\{\\mathrm\{sg\}\}s\(x\_\{\\theta\}\)=\\frac\{u\}\{x\_\{\\theta\}\}\.By the chain rule,
∇θsgL\(θ\)=−ADsgs\(xθ\)∇θxθ=−Auxθ∇θxθ\.\\nabla\_\{\\theta\}^\{\\mathrm\{sg\}\}L\(\\theta\)=\-A\\,D^\{\\mathrm\{sg\}\}s\(x\_\{\\theta\}\)\\,\\nabla\_\{\\theta\}x\_\{\\theta\}=\-A\\,\\frac\{u\}\{x\_\{\\theta\}\}\\,\\nabla\_\{\\theta\}x\_\{\\theta\}\.SinceA\>0A\>0,u\>0u\>0, andxθ\>ux\_\{\\theta\}\>u, this gradient is non\-zero whenever∇θxθ≠0\\nabla\_\{\\theta\}x\_\{\\theta\}\\neq 0\. For comparison, the standard clipped branchLclip\(θ\)=−Aclip\(xθ,ℓ,u\)L\_\{\\mathrm\{clip\}\}\(\\theta\)=\-A\\,\\mathrm\{clip\}\(x\_\{\\theta\},\\ell,u\)is constant inxθx\_\{\\theta\}on the regionxθ\>ux\_\{\\theta\}\>u, so its automatic\-differentiation gradient is zero there\. This proves that GClip preserves a straight\-through signal after upper\-side clip saturation\.
Interior agreement\.Now letx0∈\(ℓ,u\)x\_\{0\}\\in\(\\ell,u\)\. Becausex0x\_\{0\}lies strictly inside the clipping interval,clip\(x0,ℓ,u\)=x0\\mathrm\{clip\}\(x\_\{0\},\\ell,u\)=x\_\{0\}\. The forward identity from the first claim gives
s\(x0\)=clip\(x0,ℓ,u\)=x0\.s\(x\_\{0\}\)=\\mathrm\{clip\}\(x\_\{0\},\\ell,u\)=x\_\{0\}\.The derivative identity from the second claim gives
Dsgs\(x0\)=clip\(x0,ℓ,u\)x0=x0x0=1\.D^\{\\mathrm\{sg\}\}s\(x\_\{0\}\)=\\frac\{\\mathrm\{clip\}\(x\_\{0\},\\ell,u\)\}\{x\_\{0\}\}=\\frac\{x\_\{0\}\}\{x\_\{0\}\}=1\.Thus GClip agrees exactly with the unclipped map to first order at every interior point of the clipping interval\. The proof is complete\. ∎
###### Corollary 1\(Selected branches in the clipped surrogate\)\.
Let
ϕA\(x\)=−min\(xA,GClip\(x;ℓ,u\)A\)\\phi\_\{A\}\(x\)=\-\\min\(xA,\\operatorname\{GClip\}\(x;\\ell,u\)A\)be the token\-level loss branch used in Equation \([5](https://arxiv.org/html/2606.17890#S3.E5)\), withAAtreated as constant during differentiation and away from branch\-tie points\. IfA\>0A\>0andx\>ux\>u, the clipped branch is selected and
∇θsgϕA\(xθ\)=−Auxθ∇θxθ\.\\nabla\_\{\\theta\}^\{\\mathrm\{sg\}\}\\phi\_\{A\}\(x\_\{\\theta\}\)=\-A\\frac\{u\}\{x\_\{\\theta\}\}\\nabla\_\{\\theta\}x\_\{\\theta\}\.IfA<0A<0and0<x<ℓ0<x<\\ell, the clipped branch is also selected and
∇θsgϕA\(xθ\)=−Aℓxθ∇θxθ\.\\nabla\_\{\\theta\}^\{\\mathrm\{sg\}\}\\phi\_\{A\}\(x\_\{\\theta\}\)=\-A\\frac\{\\ell\}\{x\_\{\\theta\}\}\\nabla\_\{\\theta\}x\_\{\\theta\}\.Therefore, Equation \([5](https://arxiv.org/html/2606.17890#S3.E5)\) preserves nonzero straight\-through signal in the upper\-side positive\-advantage case and in the lower\-side negative\-advantage case\. The first case supports learning positively preferred edited suffixes; the second case continues to penalize lower\-ranked tokens whose ratios have moved below the lower clip bound\.
###### Proof\.
This follows directly from the branch selected by the minimum and the derivative identityDsgGClip\(x;ℓ,u\)=clip\(x,ℓ,u\)/xD^\{\\mathrm\{sg\}\}\\operatorname\{GClip\}\(x;\\ell,u\)=\\mathrm\{clip\}\(x,\\ell,u\)/x\. Thus GClip leaves the clipped forward scalar value unchanged while replacing the ordinary clipped branch with a nonzero straight\-through gradient in the saturation cases selected by the surrogate\. ∎
## Appendix FEdited\-Suffix Distribution Mismatch
The*Standard Clip*ablation in Appendix[C](https://arxiv.org/html/2606.17890#A3)is motivated by the local distribution mismatch introduced when an accepted training\-time edit replaces the original continuation after the accepted editable boundaryk^\\hat\{k\}\. The edited suffix no longer follows the sampled trajectory underπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}, so its early tokens can receive low old\-policy probability near the boundary\. Figure[6](https://arxiv.org/html/2606.17890#A6.F6)shows a representative accepted pair over the first 80 suffix positions; the shaded cue region starts atk^\\hat\{k\}and includes the inserted final\-answer cue\. The original suffix remains on the original sampled trajectory, whereas the edited suffix begins with the verified final\-answer string that closes the thinking process, followed by the regenerated content process\. This mismatch can make positively preferred edited tokens more likely to enter upper\-side clip saturation once their current\-policy probability increases, which is the case where GClip preserves straight\-through learning signal\.
This mismatch also clarifies the status of the objective\. The edited suffix is deliberately introduced as a verified synthetic alternative to the original continuation\. Consequently, the edited\-trajectory update should not be read as satisfying the usual on\-policy sampling assumptions behind PPO\-style clipping\. Instead, DRE uses the clipped\-ratio form as a bounded forward surrogate for a preference constructed from verifier\-accepted edits\.
Figure 6:Representative old\-policy probability mismatch near an accepted editable boundary\. The shaded cue region starts at the accepted boundary and includes the inserted final\-answer cue\. After the shared thinking prefix, the edited suffix often receives low old\-policy probability near the boundary and over early edited\-suffix tokens, while the original suffix remains on the original sampled trajectory\.
## Appendix GRepresentative Rollouts Across Training
Figure[7](https://arxiv.org/html/2606.17890#A7.F7)shows a single prompt across early, middle, and late training steps from the same GRPO training run discussed in Section[2](https://arxiv.org/html/2606.17890#S2)\. The example illustrates the contrast between thinking before and after answer emergence in Figure[2](https://arxiv.org/html/2606.17890#S2.F2)\. Thinking before answer emergence can be useful: the middle training step fixes the early training step’s failure to convert “3 hours and 24 minutes” into minutes before boxing the answer\. Thinking after answer emergence can become overthinking: the late training step already boxes the correct answer, then enters an additional verification block and boxes the same answer again\. To keep the appendix compact, we show only the final portion of each response after the shared algebraic setup\.
Prompt\.Every morning Aya goes for a 9\-kilometer\-long walk and stops at a coffee shop afterwards\. When she walks at a constant speed ofsskilometers per hour, the walk takes her 4 hours, includingttminutes spent in the coffee shop\. When she walkss\+2s\+2kilometers per hour, the walk takes her 2 hours and 24 minutes, includingttminutes spent in the coffee shop\. Suppose Aya walks ats\+1/2s\+\\nicefrac\{\{1\}\}\{\{2\}\}kilometers per hour\. Find the number of minutes the walk takes her, including thettminutes spent in the coffee shop\.The three excerpts are taken directly from stored model outputs;\[earlier steps omitted\]marks omitted prefix text, and line breaks are adjusted for layout\.Early Training Step\[earlier steps omitted\] Using the first equation:92\.5\+t60=4\\frac\{9\}\{2\.5\}\+\\frac\{t\}\{60\}=43\.6\+t60=43\.6\+\\frac\{t\}\{60\}=4t60=0\.4\\frac\{t\}\{60\}=0\.4t=24minutest=24\\text\{ minutes\}Step 5: Calculate the total time when Aya walks ats\+12=3s\+\\frac\{1\}\{2\}=3km/h\.Total time=93\+2460\\displaystyle=\\frac\{9\}\{3\}\+\\frac\{24\}\{60\}=3\+0\.4\\displaystyle=3\+4=3\.4hours\\displaystyle=4\\text\{ hours\}Convert 0\.4 hours to minutes:0\.4×60=24minutes0\.4\\times 60=24\\text\{ minutes\}So, the total time is 3 hours and 24 minutes\.Final Answer:344\\boxed\{344\}Middle Training Step\[earlier steps omitted\] Givens=2\.5s=2\.5km/h,s\+12=2\.5\+0\.5=3km/hs\+\\frac\{1\}\{2\}=2\.5\+0\.5=3\\text\{ km/h\}First, find the walking time at 3 km/h for 9 km:Tw=93=3hoursT\_\{w\}=\\frac\{9\}\{3\}=3\\text\{ hours\}Then, the total time including coffee shop time:T=Tw\+t=3\+0\.4=3\.4hoursT=T\_\{w\}\+t=3\+0\.4=3\.4\\text\{ hours\}Convert 3\.4 hours to hours and minutes:0\.4hours=0\.4×60=24minutes0\.4\\text\{ hours\}=0\.4\\times 60=24\\text\{ minutes\}So,T=3hours and24minutesT=3\\text\{ hours and \}24\\text\{ minutes\}But the problem asks for the number of minutes\. So, convert 3 hours and 24 minutes to minutes:3×60\+24\\displaystyle 3\\times 0\+4=180\+24\\displaystyle=80\+4=204minutes\\displaystyle=\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\\text\{ minutes\}Final Answer:𝟐𝟎𝟒\\boxed\{\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\}Late Training Step\[earlier steps omitted\] Thus, the final answer is:𝟐𝟎𝟒\\boxed\{\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\}Wait, let’s ensure all steps are accurate\.VerificationGiven: 1\.9s\+t60=4\\frac\{9\}\{s\}\+\\frac\{t\}\{60\}=4\(1\) 2\.9s\+2\+t60=2\.4\\frac\{9\}\{s\+2\}\+\\frac\{t\}\{60\}=2\.4\(2\)Found:s=2\.5s=2\.5km/h,t=24t=24minutesFor third scenario:•Speed:s\+12=2\.5\+0\.5=3s\+\\frac\{1\}\{2\}=2\.5\+0\.5=3km/h•Time walking:93=3\\frac\{9\}\{3\}=3hours•Time in coffee shop:t=24t=24minutes=0\.4=0\.4hours•Total time:3\+0\.4=3\.43\+0\.4=3\.4hours•Convert to minutes:3\.4×60=𝟐𝟎𝟒3\.4\\times 60=\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}minutesEverything seems consistent\. Therefore, the answer is:𝟐𝟎𝟒\\boxed\{\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{204\}$\}\}\}
Figure 7:Single\-prompt rollout progression across training\. Each panel is an abridged excerpt from the stored model output at that training step; omitted text is marked explicitly, and the remaining wording is unchanged apart from markdown\-to\-LaTeXformatting\. The early training step reaches the correct verbal conclusion but boxes344\\boxed\{344\}, showing that thinking before the final answer has not yet completed the required format conversion\. The middle training step completes the needed conversion before boxing204\\boxed\{204\}\. The late training step boxes204\\boxed\{204\}and then continues into an additional verification block before boxing the same answer again\. This example complements Figure[2](https://arxiv.org/html/2606.17890#S2.F2): thinking beforek⋆k^\{\\star\}can improve correctness, while thinking afterk⋆k^\{\\star\}can become overthinking attached to successful trajectories\.
## Appendix HRepresentative Accepted Edit
Figure[8](https://arxiv.org/html/2606.17890#A8.F8)shows an accepted edit from the rollout\-editing data used for the method\-facing analyses\. The example is chosen to make the edit operation visible: the two trajectories share a long thinking prefix, diverge after the accepted editable boundaryk^\\hat\{k\}, and the edited branch closes the thinking process with the verified answer before regenerating the content process\. The left branch is the lower\-scored original continuation, and the right branch is the preferred accepted edit\.
Shared thinking prefix throughk^\\hat\{k\}\.\[earlier thinking omitted\]The step\-by\-step sum with the verifiedana\_\{n\}values gives 450, so I think that’s the answer\.Just to make 100% sure, let’s compute the sum again in a different order\. Starting fromn=11n=11down ton=1n=1, the alternating sum gives450450again after then=3n=3,n=2n=2, andn=1n=1terms\.Yes\! Same result\. So regardless of the order, we get 450\.
\[\-0\.4em\]▼\\blacktriangledown
accepted editable boundaryk^\\hat\{k\}
\[\-0\.4em\]▼\\blacktriangledown
Original continuation afterk^\\hat\{k\}I think that’s solid\. The key was not getting confused by the initial simplification, computing the first fewana\_\{n\}correctly, finding the pattern for the minima at the median, deriving the formulas, verifying with multiple methods, then carefully summing with the correct signs\.The problem says “for a positive integernn,” sonnstarts at 1, which we did, and the sum is fromn=1n=1to 11, which we covered\.I don’t see any mistakes in theana\_\{n\}values or the summation now, so I think the answer is 450\.</think\>To solve the problem, we begin by simplifying the function:fn\(x\)=\(∑i=1n\|x−i\|\)2−∑i=1n\(x−i\)2\.f\_\{n\}\(x\)=\\left\(\\sum\_\{i=1\}^\{n\}\|x\-i\|\\right\)^\{2\}\-\\sum\_\{i=1\}^\{n\}\(x\-i\)^\{2\}\.Accepted editFinal Answer
450\\boxed\{450\}</think\>To solve the problem, we begin by analyzing the function:fn\(x\)=\(∑i=1n\|x−i\|\)2−∑i=1n\(x−i\)2\.f\_\{n\}\(x\)=\\left\(\\sum\_\{i=1\}^\{n\}\|x\-i\|\\right\)^\{2\}\-\\sum\_\{i=1\}^\{n\}\(x\-i\)^\{2\}\.This expression simplifies significantly because\|x−i\|2=\(x−i\)2\|x\-i\|^\{2\}=\(x\-i\)^\{2\}\. Expanding the square givesfn\(x\)=2∑1≤i<j≤n\|x−i\|\|x−j\|\.f\_\{n\}\(x\)=2\\sum\_\{1\\leq i<j\\leq n\}\|x\-i\|\|x\-j\|\.
Figure 8:A concrete accepted edit produced by dynamic rollout editing\. The shared thinking prefix is preserved through the accepted editable boundaryk^\\hat\{k\}\. The original branch continues with additional post\-answer checking, while the accepted edit closes the thinking process with the verified answer and regenerates the content process\. During training, the accepted edit is preferred over the original continuation, and prefix masking prevents the lower\-scored original branch from directly penalizing the shared prefix\.
## Appendix ICase Study: DRE Reduces Post\-Solution Continuation
Figure[9](https://arxiv.org/html/2606.17890#A9.F9)presents a paired AIME24 example comparing the Raw Model and the DRE\-trained model on the same prompt\. Once the verified answer emerges, the Raw Model continues to reflect on and re\-check the exposed value, producing a long post\-solution continuation\. In contrast, the DRE\-trained model closes the thinking process shortly after deriving the same value\.
Prompt\.Jen picks44distinct numbers from\{1,…,10\}\\\{1,\\ldots,10\\\}, and44numbers are randomly chosen from the same set\. She wins a prize if at least two of her numbers are among the chosen numbers, and wins the grand prize if all four match\. If the conditional probability of the grand prize given any prize ism/nm/n, findm\+nm\+n\.The panels summarize the same sample slot in the stored Raw Model and DRE outputs\. The long repeated checks are compressed, and notation and line breaks are normalized for layout\.Raw Model: Long Post\-Answer CheckingAnswer emergence\. N≥2\\displaystyle N\_\{\\geq 2\}=\(42\)\(62\)\+\(43\)\(61\)\+\(44\)\\displaystyle=\\binom\{4\}\{2\}\\binom\{6\}\{2\}\+\\binom\{4\}\{3\}\\binom\{6\}\{1\}\+\\binom\{4\}\{4\}=90\+24\+1=115,\\displaystyle=0\+4\+1=15,Pr\(grand∣prize\)\\displaystyle\\Pr\(\\mathrm\{grand\}\\mid\\mathrm\{prize\}\)=1115,\\displaystyle=\\frac\{1\}\{115\},m\+n\\displaystyle m\+n=𝟏𝟏𝟔\.\\displaystyle=\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\.*“m\+n=𝟏𝟏𝟔m\+n=\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}? Wait, but wait, wait a second, is this correct?”*The Raw Model then continues for over twelve thousand characters\. It reopens the interpretation of “at least two,” recomputes the match counts, checks small examples and lottery variants, and repeatedly returns to the same value\.Late continuation\. \(42\)\(62\)=90,\(43\)\(61\)=24,\(44\)=1\.\\binom\{4\}\{2\}\\binom\{6\}\{2\}=90,\\quad\\binom\{4\}\{3\}\\binom\{6\}\{1\}=24,\\quad\\binom\{4\}\{4\}=1\.*“So the answer should be1\+115=𝟏𝟏𝟔1\+115=\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\.”*𝟏𝟏𝟔\\boxed\{\\mathchoice\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\displaystyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\textstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\{\\fcolorbox\{rollouthighlightborder\}\{rollouthighlightbg\}\{$\\scriptscriptstyle\\color\{rollouthighlighttext\}\\mathbf\{116\}$\}\}\}DRE\-Trained Model: Immediate ClosureNear the end of thinking\. For exactlykkmatches, choosekknumbers from Jen’s ticket and4−k4\-kfrom the other six numbers\.\(42\)\(62\)=90,\(43\)\(61\)=24,\(44\)=1\.\\binom\{4\}\{2\}\\binom\{6\}\{2\}=90,\\quad\\binom\{4\}\{3\}\\binom\{6\}\{1\}=24,\\quad\\binom\{4\}\{4\}=1\.N≥2=90\+24\+1=115\.N\_\{\\geq 2\}=90\+24\+1=115\.Only11of these115115cases is the grand prize, soPr\(grand∣prize\)=1115\.\\Pr\(\\mathrm\{grand\}\\mid\\mathrm\{prize\}\)=\\frac\{1\}\{115\}\.m=1,n=115,m\+n=116\.m=1,\\quad n=115,\\qquad m\+n=\{\\color\[rgb\]\{0\.1328125,0\.5,0\.3203125\}116\}\.Final Answer 116\\boxed\{116\}</think\>The number of prize\-winning draws is90\+24\+1=11590\+24\+1=115, and only11of these draws is the grand prize\. HenceP\(Grand Prize∣Prize\)=1115\.P\(\\text\{Grand Prize\}\\mid\\text\{Prize\}\)=\\frac\{1\}\{115\}\.Thus,m=1m=1,n=115n=115, andm\+n=116m\+n=116\.116\\boxed\{116\}
Figure 9:Paired Raw Model and DRE\-trained outputs for an AIME24 prompt where both models reach the verified answer116116\. The Raw Model exposes the final requested value and then continues with an extended post\-answer checking phase, including repeated reinterpretation and consistency checks\. The DRE\-trained model performs the necessary counting calculation and closes the thinking process almost immediately after answer emergence\. This qualitative case illustrates the intended effect of DRE: reducing overthinking after answer emergence without changing a correct final answer\.
## Appendix JSemantic Dynamics Under DRE
Figure[10](https://arxiv.org/html/2606.17890#A10.F10)gives a representative semantic\-space analysis of the same target behavior\. Following the reasoning\-dynamics analysis of LLM overthinking\(Weiet al\.,[2025](https://arxiv.org/html/2606.17890#bib.bib60)\), we examine whether DRE shortens post\-solution continuation after answer emergence without erasing thinking that may be needed before the answer\-emergence boundaryk⋆k^\{\\star\}\. For each thinking prefixkk, we force the thinking process to stop at that prefix, sample content continuations, embed the sampled content processes, and average them into a prefix\-level centroideke\_\{k\}\. The left panel projects these centroids into a shared two\-dimensional PCA space for visualization only\. The right panel reports the high\-dimensional semantic residual1−cos\(ek,e¯≥k⋆\)1\-\\cos\(e\_\{k\},\\bar\{e\}\_\{\\geq k^\{\\star\}\}\), wheree¯≥k⋆\\bar\{e\}\_\{\\geq k^\{\\star\}\}is the centroid of the post\-k⋆k^\{\\star\}prefix embeddings for the corresponding trajectory\.
The original model continues through a longer sequence of prefix states and moves within a broad post\-k⋆k^\{\\star\}semantic neighborhood\. In contrast, the DRE\-trained model reaches a compact region over a shorter prefix range\. Because each residual curve uses its own post\-k⋆k^\{\\star\}centroid, the residual values should be read as within\-trajectory concentration rather than distances to a shared semantic target\. This qualitative analysis is consistent with the main evidence in Section[4](https://arxiv.org/html/2606.17890#S4): DRE reduces continuation after answer emergence while preserving pre\-k⋆k^\{\\star\}reasoning that may be needed to expose the verified answer\.
Figure 10:Supplementary representative semantic\-dynamics analysis for the original model and the DRE\-trained model\.\(a\)Prefix\-level content centroids are projected with a shared PCA basis; color intensity encodes the thinking prefix indexkkseparately for each trajectory\. The red marker indicates the answer\-emergence boundaryk⋆k^\{\\star\}for the original model, and the shaded region summarizes the original model’s post\-k⋆k^\{\\star\}neighborhood\.\(b\)Semantic residual is computed in the original embedding space as distance to each trajectory’s own post\-k⋆k^\{\\star\}centroid, so it measures within\-trajectory concentration rather than distance to a shared target\. The DRE\-trained trajectory covers fewer prefix states and reaches a low\-residual region within its shorter prefix range, while the original model continues with post\-k⋆k^\{\\star\}semantic movement\.
## Appendix KDeclaration of AI Use
#### Writing assistance\.
Following the ACL policy on AI writing assistance, we disclose that ChatGPT was used only for language polishing during manuscript preparation\. Its use was limited to editing author\-written text for grammar, clarity, and readability\. It was not used to develop the method, design experiments, formulate scientific claims, interpret results, or produce experimental data\. The authors reviewed all AI\-assisted edits and remain responsible for the final manuscript\.
#### Experimental context\.
The Qwen3\-family models and DeepSeek\-R1\-style RL setup discussed in this work are part of the research subject and experimental context\. These models and settings were not used as writing assistants for drafting the manuscript\.Similar Articles
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
This paper introduces DyCon, a training-free framework that uses step-level embeddings to model evolving task difficulty and dynamically control reasoning depth in Large Reasoning Models, effectively reducing overthinking and improving efficiency without sacrificing accuracy.
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
This paper proposes DRIFT, a framework that combines offline trajectories with importance-weighted supervised fine-tuning to efficiently achieve multi-turn interactive learning performance comparable to reinforcement learning.
EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
EfficientRollout is a system-aware self-speculative decoding framework that accelerates reinforcement learning rollouts for LLMs by adapting drafters to evolving policies and optimizing speculative decoding regimes, reducing latency by up to 19.6%.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
This paper introduces Implicit Compression Regularization (ICR), a method to address LLM overthinking during RL post-training by guiding models toward concise yet accurate reasoning trajectories.