Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

arXiv cs.CL Papers

Summary

This paper introduces DASH, a method that uses intermediate answer commitments within reasoning traces to assign segment-level credit, reducing overthinking behaviors and improving accuracy on competition-level math benchmarks.

arXiv:2607.00482v1 Announce Type: new Abstract: Reasoning language models frequently overthink: generating extended chains of behaviors such as hedging, approach abandonment, and self contradiction that consume tokens without improving answers. We show that these behaviors are not merely a consequence of length; even when controlling for response length, incorrect traces exhibit higher rates of unproductive self-reflection than correct ones. Addressing this requires identifying where self-reflection helps vs hurts, but obtaining these step-level annotations is costly. We observe that intermediate answer commitments within reasoning traces can provide a cheap proxy: by comparing each final answer candidate in the trace to the ground truth, we can determine whether subsequent reflection is productive without any additional supervision. Building on this insight, we propose DASH (Drift Aware advantage SHaping), which assigns segment-level credit based on whether each reasoning segment leads toward or away from correctness. On competition-level math benchmarks, DASH achieves the highest accuracy where overthinking is prevalent (AIME25: 50.8% vs. 45.4% GRPO) while reducing overthinking behaviors and achieving more productive self-correction than baselines.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:37 AM

# Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking
Source: [https://arxiv.org/html/2607.00482](https://arxiv.org/html/2607.00482)
Chia\-Hsuan Lee, Sihui Dai, Mingyang Zhou, Isha Slavin Shi\-Xiong Zhang, Sambit Sahu, William Campbell Capital One

###### Abstract

Reasoning language models frequently overthink: generating extended chains of behaviors such as hedging, approach abandonment, and self contradiction that consume tokens without improving answers\. We show that these behaviors are not merely a consequence of length; even when controlling for response length, incorrect traces exhibit higher rates of unproductive self\-reflection than correct ones\. Addressing this requires identifying where self\-reflection helps vs hurts, but obtaining these step\-level annotations is costly\. We observe that intermediate answer commitments within reasoning traces can provide a cheap proxy: by comparing each final answer candidate in the trace to the ground truth, we can determine whether subsequent reflection is productive without any additional supervision\. Building on this insight, we propose DASH \(Drift Aware advantage SHaping\), which assigns segment\-level credit based on whether each reasoning segment leads toward or away from correctness\. On competition\-level math benchmarks, DASH achieves the highest accuracy where overthinking is prevalent \(AIME25: 50\.8% vs\. 45\.4% GRPO\) while reducing overthinking behaviors and achieving more productive self\-correction than baselines\.

Know When to Stop: Segment\-Level Credit Assignment for Reducing Overthinking

Chia\-Hsuan Lee, Sihui Dai, Mingyang Zhou, Isha SlavinShi\-Xiong Zhang, Sambit Sahu, William CampbellCapital One

## 1Introduction

![Refer to caption](https://arxiv.org/html/2607.00482v1/x1.png)Figure 1:Overview of DASH\.We decompose reasoning traces into segments bounded by intermediate answer checkpoints\. Segments leading to correct answers \(green\) receive positive advantage; segments leading to incorrect answers \(red\) receive escalating negative advantage\. Standard GRPO assigns uniform negative advantage to all tokens in an incorrect trace, discarding the structure within\.Reasoning\-focused language models, such as DeepSeek\-R1\(DeepSeek\-AI,[2025](https://arxiv.org/html/2607.00482#bib.bib4)\), achieve strong performance through extended chains of thought\. However, longer reasoning does not always help: models frequently exhibit overthinking behaviors such as hedging, re\-verifying, or switching approaches\(Wang et al\.,[2025b](https://arxiv.org/html/2607.00482#bib.bib35); Peng et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib25)\)which can lead the model to an incorrect final answer\.

This motivates a natural question:can we train models to retain productive self\-reflection while surpressing unproductive self\-reflection?A significant challenge is cost: identifying whether self\-reflection is helpful at each reasoning step would require step\-wise labels via a process reward model\(Lightman et al\.,[2024](https://arxiv.org/html/2607.00482#bib.bib14); Wang et al\.,[2024](https://arxiv.org/html/2607.00482#bib.bib32)\), LLM\-as\-a\-judge, or manual annotation\(Lightman et al\.,[2024](https://arxiv.org/html/2607.00482#bib.bib14)\)\. In this work, we propose a cheaper alternative\.

Our key observation is that reasoning models can commit to intermediate answers within their thinking traces–for example, writing "the answer is X" or boxing a result before continuing to reason\. These commitments provide verifiable demonstrations of productive and unproductive self\-reflection: by comparing each to the ground truth, we know whether subsequent reflection improved or degraded the answer, without any external supervision\. When a model reaches a correct intermediate answer and then reflects its way to an incorrect one, we have direct evidence that this self\-reflection was harmful\.

Based on this intuition, we proposeDASH\(Drift\-Aware advantage SHaping\), which converts traces where the answer drifts from correct to incorrect intermediate examples from wasted negatives into informative training examples\. Rather than assigning a single scalar advantage to the entire rollout, DASH divides each trace into segments bounded by consecutive answer checkpoints and assigns advantages based on whether each segment moves towards or away from the correct answer\. A single drift trace simultaneously teaches the model to reinforce the reasoning that found the correct answerandto suppress the overthinking that abandoned it–extracting dual training signal from what GRPO would treat as a flat negative example\.

We complement DASH with six lightweightlinguistic overthinking signals—repetition, hedging, abandonment, contradiction, recomputation, and length outlier—that characterize reasoning quality without requiring intermediate answer extraction\. These signals serve as evaluation metrics to verify that accuracy gains correspond to genuine behavioral improvements rather than superficial shortcuts\.

Experiments on Nemotron\-4B across four competition\-level math benchmarks show:

- •Best accuracy where drift is worst\.DASH achieves 50\.8% on AIME25 \(vs\. 45\.4% GRPO, 46\.1% base\)—the benchmark with the highest drift prevalence\.
- •Self\-correction over spiraling\.DASH’s correct traces exhibit twice as many contradiction\-then\-resolution patterns as GRPO’s \(0\.92 vs\. 0\.47/trace\), while showing fewer blind approach abandonments—reasoning longer but more productively\.

## 2Analyzing Overthinking in Reasoning Traces

Prior work on reasoning efficiency has primarily characterized overthinking through response length: longer traces are treated as less efficient, and length penalties or early\-stopping mechanisms are used to encourage brevity\(Muennighoff et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib21)\)\. However, length alone does not provide detail signals for what overthinking behaviors are exhibited by the model\.

To address this, we first introduce a set of linguistic signals to analyze overthinking patterns\. All signals are regex\- ornn\-gram\-based, requiring no learned components:

- •S1 \(Repetition\):Maximumnn\-gram overlap between sliding windows\. Captures circular reasoning loops where the model rephrases without progressing\(Duan et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib5)\)\.
- •S2 \(Hedging\):Density of uncertainty markers \(“wait, no,” “let me reconsider”\) per 100 tokens\. Operationalizes the self\-doubt mechanism preceding negative flips\(Zhou et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib43)\)\.
- •S3 \(Abandonment\):Count of explicit strategy switches \(“this approach is wrong,” “let me try another way”\)\. The strongest individual failure predictor: 4\.1–4\.3×\\timesmore common in incorrect traces across all models tested\.
- •S4 \(Contradiction\):Count of self\-contradiction markers \(“which is impossible,” “can’t be right”\)\. Captures unresolved inconsistencies\(Mündler et al\.,[2024](https://arxiv.org/html/2607.00482#bib.bib22); Yang et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib38)\)\.
- •S5 \(Recomputation\):Numerical values derived 3\+ times in computation contexts\. Targets confirmatory re\-checking that rarely catches errors\(Long et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib17)\)\.
- •S6 \(Length outlier\):Per\-prompt groupzz\-score of response length\. Adaptive: hard problems warrant long responses, but within\-group outliers indicate pathology\.

Using these signals, we analyze reasoning traces generated by Llama\-3\.1\-Nemotron\-Nano\-4B\-v1\.1\(NVIDIA,[2025](https://arxiv.org/html/2607.00482#bib.bib23)\)on AIME 2025\. We bucket generated responses by length and analyze the correlation between correctness, length, and overthinking behaviors\. We present results for abandonment and self\-contradiction dimensions in Figure[2](https://arxiv.org/html/2607.00482#S2.F2)and a full analysis in Appendix[C](https://arxiv.org/html/2607.00482#A3)\.

![Refer to caption](https://arxiv.org/html/2607.00482v1/x2.png)Figure 2:Overthinking signals in Nemotron\-4B reasoning traces on AIME 2025 \(960 traces, 32 per problem\)\. Traces are grouped into quintiles by word count\. \(a\) Accuracy drops sharply with response length\. \(b–c\) Approach abandonment and self\-contradiction density \(per 100 words\), split by correctness within each length bucket\. Even controlling for length, incorrect traces exhibit higher rates of unproductive self\-reflection than correct traces, indicating these linguistic signals carry information beyond response length alone\.Figure[2](https://arxiv.org/html/2607.00482#S2.F2)\(a\) shows that accuracy decreases monotonically with response length, confirming that the model’s additional computation in longer traces is largely unproductive\. Panels \(b\) and \(c\) reveal a more nuanced finding: even controlling for response length, incorrect traces exhibit higher rates of approach abandonment and self\-contradiction than correct traces within the same length bucket\. The gap is most pronounced for approach abandonment, where incorrect traces show consistently elevated density across all buckets, while self\-contradiction shows a similar but more moderate trend\.

These patterns suggest that much of the model’s unproductive computation involves cycles of self\-reflection that actively steer reasoning in the wrong direction—abandoning viable approaches or arriving at contradictions that undermine earlier progress\. This motivates our central hypothesis: if we can discourage self\-reflective behavior that ultimately leads to incorrect answers while preserving reflection that aids error correction, we may improve both reasoning efficiency and accuracy without simply truncating generation length\.

## 3Drift\-Aware Advantage Shaping

One challenge is that it is difficult to identify where self\-reflective behavior is needed for correcting errors in reasoning steps and where it hurts performance\. In this section, we consider a cheap proxy: rather than obtaining annotations across each step containing self\-reflection and whether it is helpful or not, we instead extract signals from self\-reflection that occursafterarriving at a candidate final answer, which can be verified with the ground truth as a measure of whether self\-reflection was helpful or not\.

### 3\.1Preliminaries: GRPO

In Group Relative Policy Optimization\(Shao et al\.,[2024](https://arxiv.org/html/2607.00482#bib.bib27)\), for each promptxx, the model generatesnnrollouts\{y1,…,yn\}\\\{y\_\{1\},\\ldots,y\_\{n\}\\\}\. Each rollout receives a rewardrir\_\{i\}, and advantages are computed by group normalization:

Ai=ri−mean​\(\{rj\}j=1n\)std​\(\{rj\}j=1n\)A\_\{i\}=\\frac\{r\_\{i\}\-\\text\{mean\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{n\}\)\}\{\\text\{std\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{n\}\)\}\(1\)
This scalarAiA\_\{i\}is broadcast identically to every token position in rolloutyiy\_\{i\}, yielding per\-token advantagesat=Aia\_\{t\}=A\_\{i\}for allt∈\{1,…,\|yi\|\}t\\in\\\{1,\\ldots,\|y\_\{i\}\|\\\}\. The policy gradient loss is:

ℒ=−1\|yi\|​∑t=1\|yi\|min⁡\(ρt​Ai,clip​\(ρt,1±ϵ\)​Ai\)\\mathcal\{L\}=\-\\frac\{1\}\{\|y\_\{i\}\|\}\\sum\_\{t=1\}^\{\|y\_\{i\}\|\}\\min\\left\(\\rho\_\{t\}A\_\{i\},\\ \\text\{clip\}\(\\rho\_\{t\},1\\pm\\epsilon\)A\_\{i\}\\right\)\(2\)whereρt=πθ​\(yt\|x,y<t\)/πref​\(yt\|x,y<t\)\\rho\_\{t\}=\\pi\_\{\\theta\}\(y\_\{t\}\|x,y\_\{<t\}\)/\\pi\_\{\\text\{ref\}\}\(y\_\{t\}\|x,y\_\{<t\}\)\.

### 3\.2Identifying Answer Drift

Our proposed algorithm is centered around the key idea of detecting portions of self\-reflection where the final answer is incorrect even though the trace had reached the correct solution at some point\. Formally, we define answer drift as:

###### Definition 3\.1\(Answer Drift\)\.

Letyybe a reasoning trace produced for a question with ground\-truth answera∗a^\{\*\}\. Suppose we extract fromyyan ordered sequence of intermediate answer candidates\(a^1,…,a^K\)\(\\hat\{a\}\_\{1\},\\ldots,\\hat\{a\}\_\{K\}\), representing successive points at which the model commits to an answer before potentially reconsidering\. We sayyyexhibits*answer drift*ifa^K=a∗\\hat\{a\}\_\{K\}=a^\{\*\}and there exists somei∈1,…,K−1i\\in\{1,\\ldots,K\-1\}such thata^i≠a∗\\hat\{a\}\_\{i\}\\neq a^\{\*\}\.

In order to identify drift occurrence, we extract intermediate answer commitments within the reasoning trace by matching patterns including\\\\backslashboxed\{\.\.\.\}, “the answer isXX”, and natural\-language answer commitments\. Instances where drift occurs are prime examples of where self\-reflection harms performance that we aim to penalize\.

### 3\.3Segment\-Based Advantage Shaping

One important consideration is that we would like to maintain positive self\-correction even within traces with drift\. For example, a trace can oscillate between a correct answer candidate and an incorrect answer candidate, and we would like to encourage behavior where the model’s self reflection changes to the correct answer while penalizing shifts from correct to incorrect\. To handle this, rather than broadcasting a single advantage to all tokens, we divide each rollout into*segments*bounded by consecutive answer checkpoints \(a^j−1,a^j\\hat\{a\}\_\{j\-1\},\\hat\{a\}\_\{j\}\] and assign segment\-specific advantages\.

##### Segment construction\.

Given checkpoints at token positionsp1<p2<…<pKp\_\{1\}<p\_\{2\}<\\ldots<p\_\{K\}within the response, we define segments:

- •Neutral segmentS0S\_\{0\}: tokens before the first checkpoint \(t<p1t<p\_\{1\}\)
- •Positive segmentSj\+S\_\{j\}^\{\+\}: tokens in segment\(a^j−1,a^j\]\(\\hat\{a\}\_\{j\-1\},\\hat\{a\}\_\{j\}\]where checkpointa^j\\hat\{a\}\_\{j\}is correct
- •Negative segmentSj−S\_\{j\}^\{\-\}: tokens in segment\(a^j−1,a^j\]\(\\hat\{a\}\_\{j\-1\},\\hat\{a\}\_\{j\}\]where checkpointa^j\\hat\{a\}\_\{j\}is incorrect

##### Advantage assignment\.

For each token at positiontt:

at=\{\+\|Ai\|⋅α\+if​t∈Sj\+−\|Ai\|⋅α−⋅w​\(t\)if​t∈Sj−Ai⋅αnif​t∈S0​\(conditional\)a\_\{t\}=\\begin\{cases\}\+\|A\_\{i\}\|\\cdot\\alpha\_\{\+\}&\\text\{if \}t\\in S\_\{j\}^\{\+\}\\\\ \-\|A\_\{i\}\|\\cdot\\alpha\_\{\-\}\\cdot w\(t\)&\\text\{if \}t\\in S\_\{j\}^\{\-\}\\\\ A\_\{i\}\\cdot\\alpha\_\{n\}&\\text\{if \}t\\in S\_\{0\}\\text\{ \(conditional\)\}\\end\{cases\}\(3\)
whereα\+,α−\\alpha\_\{\+\},\\alpha\_\{\-\}are positive and negative scale factors,αn\\alpha\_\{n\}is the neutral scale, andw​\(t\)w\(t\)is a length penalty weight defined below\.

##### Length penalty within negative segments\.

To encode “the longer you continue past a correct answer, the worse it gets,” tokens in negative segments receive an escalating penalty:

w​\(t\)=1\+α⋅t−tstart\(k\)tend\(k\)−tstart\(k\)w\(t\)=1\+\\alpha\\cdot\\frac\{t\-t\_\{\\text\{start\}\}^\{\(k\)\}\}\{t\_\{\\text\{end\}\}^\{\(k\)\}\-t\_\{\\text\{start\}\}^\{\(k\)\}\}\(4\)whereα\\alphacontrols the ramp rate and the penalty is capped atwmaxw\_\{\\max\}to prevent gradient explosion\.

##### Conditional neutral mode\.

Tokens before the first answer checkpoint \(the “neutral” segment\) present a design choice\. We implement three modes:

- •zero: No gradient on neutral tokens \(conservative\)\.
- •standard: Apply the standard GRPO advantage\.
- •conditional: Apply a scaled advantage based on whether the trace eventually drifts, providing a weak signal that the initial reasoning trajectory correlates with drift\.

### 3\.4Reward Shaping for Drift Traces

Before advantage computation, drift traces receive a*shaped reward*that reflects their partial correctness:

rdrift=rincorrect\+δ⋅\(1−Lpost\-driftLtotal\)r\_\{\\text\{drift\}\}=r\_\{\\text\{incorrect\}\}\+\\delta\\cdot\\left\(1\-\\frac\{L\_\{\\text\{post\-drift\}\}\}\{L\_\{\\text\{total\}\}\}\\right\)\(5\)whereδ\\deltais the drift partial credit and the formula naturally decreases as the post\-drift section grows\. The shaped reward is capped:rdrift≥rincorrectr\_\{\\text\{drift\}\}\\geq r\_\{\\text\{incorrect\}\}, ensuring drift traces are never penalized more harshly than pure\-incorrect traces\.

## 4Experimental Results

### 4\.1Experimental Setup

#### 4\.1\.1Model and Data

We evaluate onLlama\-3\.1\-Nemotron\-Nano\-4B\-v1\.1\(NVIDIA,[2025](https://arxiv.org/html/2607.00482#bib.bib23)\), a 4B\-parameter reasoning model that exhibits “detailed thinking” mode triggered by a system prompt\. This model exhibits high drift rates \(∼\\sim20% on AIME\-level problems\), making it a natural testbed for drift\-aware training\. Training data consists of 16\.5K mathematical reasoning problems sampled from OpenR1\-Math\-220K\(Open R1 Team,[2025](https://arxiv.org/html/2607.00482#bib.bib24)\), sourced from NuminaMath 1\.5\(AI\-MO Team,[2024](https://arxiv.org/html/2607.00482#bib.bib2)\)with reasoning traces from DeepSeek\-R1\(DeepSeek\-AI,[2025](https://arxiv.org/html/2607.00482#bib.bib4)\)verified by Math\-Verify\. We restrict to problems with both correct and incorrect verified traces to ensure meaningful GRPO signal\.

#### 4\.1\.2Evaluation

We evaluate on four competition\-level mathematical reasoning benchmarks:OlympiadBench\(He et al\.,[2024](https://arxiv.org/html/2607.00482#bib.bib9)\),AMC 2023\(Mathematical Association of America,[2023](https://arxiv.org/html/2607.00482#bib.bib20)\),AIME 2024\(Math\-AI,[2025](https://arxiv.org/html/2607.00482#bib.bib18)\), andAIME 2025\(Math\-AI,[2026](https://arxiv.org/html/2607.00482#bib.bib19)\)\. For AMC23, AIME24, and AIME25, we report avg@32 \(average accuracy over 32 sampled solutions per problem\)\. For OlympiadBench, we report pass@1\.

#### 4\.1\.3Training Configuration

All runs use GRPO with group sizen=16n=16trained on 4 nodes of 8 H100 GPUs\. For our drift\-aware \(DASH\) runs, we set positive and negative advantage scales toα\+=1\.0,α−=1\.0\\alpha\_\{\+\}=1\.0,\\alpha\_\{\-\}=1\.0, use a length penalty withα=3\.0\\alpha=3\.0andwmax=3\.0w\_\{\\max\}=3\.0, and apply aconditionalneutral mode withαn=0\.1\\alpha\_\{n\}=0\.1\. Full training hyperparameters are provided in Appendix[F](https://arxiv.org/html/2607.00482#A6)\.

#### 4\.1\.4Baselines

Along with the base model and standard GRPO, we compare against two baselines designed to address lengthy reasoning traces:

##### DR\-GRPO\(Liu et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib16)\)\.

DR\-GRPO is a debiased variant of GRPO that addresses an implicit length bias in the optimization objective\. It makes two modifications: \(1\) replacing per\-response length normalization in the policy loss with a constant scaling factor, which ensures that longer and shorter responses receive equal per\-token gradient weight, and \(2\) computing advantages asA^i=ri−mean​\(\{rj\}j=1n\)\\hat\{A\}\_\{i\}=r\_\{i\}\-\\text\{mean\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{n\}\)without dividing bystd​\(\{rj\}j=1n\)\\text\{std\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{n\}\), which removes a question\-level difficulty bias\. These changes have been shown to improve token efficiency and reduce the length of incorrect responses\. This baseline tests whether an explicit credit assignment over the reasoning trajectory is more effective at reducing drift and overthinking behavior compared to using a debiased optimizer\.

##### GRPO \+ Brevity Bonus\.

We introduce a simple reward shaping baseline that explicitly encourages shorter correct traces\. For each prompt group, we identify the length of the shortest correct responselminl\_\{\\min\}\. Each correct trace of lengthlil\_\{i\}then receives an additive per\-token bonus:bi=β⋅lminli2b\_\{i\}=\\frac\{\\beta\\cdot l\_\{\\min\}\}\{l\_\{i\}^\{2\}\}, whereβ\\betais a scale hyperparameter\. The shortest correct trace receives the maximum total bonus ofβ\\beta, while longer correct traces receive proportionally less \(scaling aslmin/lil\_\{\\min\}/l\_\{i\}\)\. Incorrect traces receive no bonus\. This baseline tests whether a simple length pressure on correct traces–without any fine\-grained credit assignment over the reasoning trajectory–suffices to reduce drift and overthinking behavior\. In our experiments, we useβ=0\.2\\beta=0\.2

### 4\.2Results

#### 4\.2\.1Main Results

Table[1](https://arxiv.org/html/2607.00482#S4.T1)presents accuracy across benchmarks\.

MethodOlympiadBenchAMC23AIME24AIME25AverageBaselinesNemotron\-4B60\.990\.860\.546\.164\.57GRPO67\.392\.861\.845\.466\.83DR\-GRPO66\.794\.362\.250\.368\.38GRPO \+ Brevity Bonus66\.591\.658\.545\.965\.63DASH \(Ours\)65\.291\.861\.150\.867\.22Table 1:Main results\. DASH compared against RL baselines on competition math benchmarks\. Best per column inbold\. Average is computed over OlympiadBench, AMC23, AIME24, and AIME25\.##### Key observations\.

\(1\) Strongest performance on the hardest benchmark\.DASH achieves the highest AIME25 accuracy \(50\.8%\), outperforming GRPO and DR\-GRPO\. AIME25 is the benchmark with the highest drift prevalence and longest reasoning traces, suggesting that segment\-level credit assignment is most impactful where overthinking is most severe\.

\(2\) GRPO’s blind spot on AIME25\.Standard GRPO improves over the base model on OlympiadBench \(\+6\.4\) and AIME24 \(\+1\.3\) but*degrades*on AIME25 \(−\-0\.7\)\. DASH avoids this regression, improving AIME25 by \+4\.7 over base\. This aligns with our drift analysis: AIME25 has the highest correct\-to\-wrong drift rate, and GRPO’s uniform negative advantage on drift traces inadvertently penalizes the valid reasoning prefix, discouraging the strategies that found the correct intermediate answer\.

#### 4\.2\.2Drift Reduction

![Refer to caption](https://arxiv.org/html/2607.00482v1/x3.png)Figure 3:Accuracy vs\. self\-correction rate on AIME 2024 \(top\) and AIME 2025 \(bottom\)\. The x\-axis reports the percentage of answer flips that are beneficial \(wrong→right\), computed as W→R / \(W→R \+ R→W\) across all traces where consecutive answer checkpoints differ in correctness\. Methods in the upper\-right region achieve both high accuracy and effective self\-correction\.In Figure[3](https://arxiv.org/html/2607.00482#S4.F3), we report plot the rate of positive changes in answer \(Wrong → Right\) as a percentage of all flips in answer correctness for generated traces on AIME across DASH and baselines\. We observe that DASH maintains the high combined accuracy and self\-correction rate among methods that do not sacrifice accuracy, while Brevity Bonus achieves a high W→R rate but at the cost of reduced accuracy\.

## 5Analysis

### 5\.1Ablation Studies

VariantOlympiadBenchAMC23AIME24AIME25AverageDASH \(full\)65\.291\.861\.150\.867\.22w/o length penalty on negative63\.191\.261\.348\.666\.05w/o conditional neutral66\.790\.461\.148\.966\.78Table 2:Ablation study on DASH components\. Best per column inbold\. Average is computed over OlympiadBench, AMC23, AIME24, and AIME25\.##### Ablation results\.

Table[2](https://arxiv.org/html/2607.00482#S5.T2)isolates the contribution of two DASH components\. Removing the length penalty on negative segments \(which escalates the penalty for tokens further past a correct answer\) reduces AIME25 by−\-2\.2 and average by−\-1\.18, confirming that the “the longer you continue past a correct answer, the worse” signal is load\-bearing\. Removing the conditional neutral mode \(which provides a weak gradient on tokens before the first answer checkpoint\) has a smaller effect on AIME25 \(−\-1\.9\) but interestingly improves OlympiadBench \(\+1\.5\), suggesting that pre\-answer tokens on OlympiadBench contain less drift\-predictive signal\. The full DASH configuration achieves the best average and the best AIME25 score, indicating that both components contribute complementary value\.

### 5\.2Overthinking Signal Analysis

![Refer to caption](https://arxiv.org/html/2607.00482v1/figures/signal_radar_emnlp.png)Figure 4:Overthinking signal profile on AIME25\.Each axis represents one of six linguistic signals, normalized to\[0,1\]\[0,1\]\. Smaller area = less overthinking\. DASH achieves the lowest or near\-lowest values on five of six signals while attaining the highest accuracy \(50\.8% avg@32\)\. The sole elevated signal—contradiction \(s4\)—reflects productive self\-monitoring \(§[D](https://arxiv.org/html/2607.00482#A4)\)\.Figure[4](https://arxiv.org/html/2607.00482#S5.F4)compares five training methods across six overthinking signals on AIME25\. DASH achieves the tightest profile: lowest repetition \(s1: 0\.40 vs\. 0\.47 Nemotron\-4B\), lowest abandonment \(s3: 2\.59 vs\. 2\.70 GRPO\), and lowest length\-outlier \(s6: 0\.008 vs\. 0\.010 GRPO\)—while producing 11\.8% longer traces and achieving the highest accuracy \(50\.8% vs\. 45\.4% GRPO avg@32\)\.

##### Contradiction as self\-monitoring\.

The one axis where DASH exceeds GRPO is contradiction \(s4: 2\.67 vs\. 1\.63\)\. Rather than indicating overthinking, this reflects a qualitative shift in error\-handling strategy\. In GRPO, contradictions are 6\.1×\\timesmore frequent in incorrect than correct traces—they are almost exclusively failure symptoms\. In DASH, this discrimination ratio drops to 3\.9×\\timesbecause DASH’s*correct*traces contain nearly twice as many contradictions \(0\.92 vs\. 0\.47 per trace\), using them as an error\-monitoring mechanism: detect inconsistency→\\todiagnose root cause→\\toresolve→\\toproceed\. GRPO instead responds to confusion by silently abandoning approaches \(8 abandonments per trace in the example of Figure[6](https://arxiv.org/html/2607.00482#A4.F6)\) without ever identifying what went wrong\. Full discrimination statistics are in Table[4](https://arxiv.org/html/2607.00482#A4.T4)\(Appendix[D](https://arxiv.org/html/2607.00482#A4)\), and a head\-to\-head trace comparison on AIME25 Problem 20 is in Figure[6](https://arxiv.org/html/2607.00482#A4.F6)\.

In summary: DASH’s reasoning is longer but*not*more wasteful—it exhibits less repetition, less abandonment, and more deliberate error\-checking, consistent with a model that has learned to self\-correct rather than spiral\.

### 5\.3Qualitative Analysis

AIME 2025 Problem \#1AIME 2025 Problem \#2LetNNbe the number of 8\-digit positive integers using digits11–88exactly once divisible by 22\. FindN−2025N\-2025\. \(Answer: 279\)LetAAbe the set of positive integer divisors of 2025 andBBa random subset ofAA\. The probability thatBBis nonempty withlcm​\(B\)=2025\\mathrm\{lcm\}\(B\)=2025ism/nm/n\. Findm\+nm\+n\. \(Answer: 237\)Base Model✓\\checkmark“2304−\-2025 = 279\.”
↺\\circlearrowleft“Is that the answer? Wait, let me verify…”
✓\\checkmark“Then, 2304−\-2025 = 279\.”
↺\\circlearrowleft“Is that correct? Let me verify with another approach…”
✓\\checkmark“Maybe the answer is 279, but I need to verify\.”
↺\\circlearrowleft“Wait, let’s check another resource…”
⋯\\cdots
✗Token limit exhausted\. No final answer\.✓\\checkmark“m \+ n = 237\.”
↺\\circlearrowleft“Wait, hold on\. Let me double\-check…”
✓\\checkmark“Therefore, m \+ n = 237\.”
↺\\circlearrowleft“But wait, let me think again\. Let’s think differently…”
⋯\\cdots
✗Token limit exhausted\. No final answer\.DASH✓\\checkmark“So 2304−\-2025 = 279\.”
⊳\\triangleright“Wait, let me double\-check the logic…”
✓\\checkmark“The calculation seems correct\.”
✓279\\boxed\{279\}Correct\.✓\\checkmark“The answer is 109 \+ 128 = 237\.”
⊳\\triangleright“Let me just make sure I didn’t make a mistake…”
✓\\checkmark“So m \+ n = 237\.”
✓237\\boxed\{237\}Correct\.Table 3:Qualitative comparison of reasoning traces on two AIME 2025 problems\. The base model reaches the correct answer repeatedly \(✓\\checkmark\) but enters self\-verification loops \(↺\\circlearrowleft\) until the token limit is exhausted\. DASH commits to the answer after brief verification \(⊳\\triangleright\), avoiding drift entirely\.To demonstrate how our drift\-aware algorithm mitigates answer drift, we analyze two concrete case studies from the AIME 2025 evaluation set in Table[3](https://arxiv.org/html/2607.00482#S5.T3)\. In both instances, the base model successfully uncovers the correct solution in its intermediate reasoning trace but drifts away from the answer \(often numerous times\), choosing alternative approaches\. It ultimately exhausts the token limit without generating a final answer\. In contrast, our drift\-hybrid model stabilizes after a single validation phase and successfully commits to the correct solution\.

## 6Related Work

### 6\.1Overthinking and Reasoning Efficiency

Excessive reasoning in LLMs has been documented across multiple studies\.Chen et al\. \([2024](https://arxiv.org/html/2607.00482#bib.bib3)\)first named the “overthinking” phenomenon in o1\-like models, showing that models over\-allocate compute to simple problems\.Su et al\. \([2025](https://arxiv.org/html/2607.00482#bib.bib29)\)demonstrated a U\-shaped relationship between reasoning length and accuracy, and the “Reasoning Completion Point” framework\(Wei et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib36)\)formally showed that once a model reaches its peak correctness probability, continued reasoning almost never improves the answer—it primarily re\-confirms or flips to wrong\.Wang et al\. \([2025b](https://arxiv.org/html/2607.00482#bib.bib35)\)link drift to thought\-switching, andPeng et al\. \([2025](https://arxiv.org/html/2607.00482#bib.bib25)\)trace the mechanism to self\-doubt after correct answers\. Rather than hoping continued reasoning self\-corrects, we train the model to recognize when to stop*within*a trace\.

### 6\.2Length Control in RL for Reasoning

Several methods address the length problem in GRPO training\.

##### Length\-penalized rewards

: Dr\. GRPO\(Liu et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib16)\)corrects a per\-token normalization bias that dilutes penalties for long incorrect traces; DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib41)\)uses token\-level loss aggregation; ShorterBetter\(Yi et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib40)\)introduces “Sample Optimal Length” as a reward signal\.

##### Progressive constraints

: ThinkPrune\(Hou et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib10)\)iteratively tightens token budgets, and L1/LCPO\(Aggarwal and Welleck,[2025](https://arxiv.org/html/2607.00482#bib.bib1)\)adds explicit length\-controlled objectives\.

##### Decoupled normalization

: DRPO\(Li et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib13)\)normalizes correct and incorrect rollouts separately, preventing length effects from corrupting advantage estimates\. Our method adopts decoupled normalization as a component but adds segment\-level granularity within individual traces\.

All of these approaches use token count as a proxy for overthinking—a lossy signal that penalizes all length equally, including legitimate complex reasoning\. Our method instead targets*why*a response is long \(detecting actual answer drift\) rather than*that*it is long\.

### 6\.3Token\- and Segment\-Level Credit Assignment

Standard GRPO broadcasts a single advantage to all tokens\. Several methods provide finer\-grained credit through learned weighting \(GTPO\(Tan et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib30)\),λ\\lambda\-GRPO\(Wang et al\.,[2025a](https://arxiv.org/html/2607.00482#bib.bib34)\)\), external judges \(CAPO\(Xie et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib37)\)\), or Monte Carlo value estimation \(VinePPO\(Kazemnejad et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib12)\), SPO\(Guo et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib8)\)\)\.

Most closely related to our work, VPPO\(Liu et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib15)\)uses a process reward model \(PRM\) to localize the first incorrect step in a failed rollout, partitioning the trajectory into a verified correct prefix \(rewarded\) and an erroneous suffix \(penalized\)\. This shares our intuition that not all tokens in a failed trace deserve equal blame\. However, VPPO relies on an external PRM for error localization and targets*correctness*boundaries, whereas our method uses the model’s own intermediate answers to detect*drift*boundaries—requiring no additional models and applying to both correct and incorrect traces, since a correct trace that drifts before self\-correcting still wastes compute\. VPPO’s binary prefix/suffix partition also does not capture the richer structure of traces with multiple answer changes, which our segment\-based formulation handles naturally\.

### 6\.4Inference\-Time Early Exit

Orthogonal to our training\-time approach, several methods diagnose overthinking at inference\(Fu et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib7); Yang et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib39); Wang et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib33); Muennighoff et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib21)\)\. These are complementary to ours: a model trained with drift\-aware shaping could additionally use inference\-time early exit for further efficiency gains\. We discuss extended related work in Appendix[G](https://arxiv.org/html/2607.00482#A7)\.

## 7Conclusion

We presented DASH, a segment\-level credit assignment method for reducing overthinking in reasoning language models\. The core idea is to use the model’s own answer checkpoints as free supervision: when a trace reaches a correct answer and later moves away from it, the trace reveals both a useful reasoning prefix and a harmful reflective suffix\. DASH preserves this structure by rewarding segments that lead to correct checkpoints and penalizing segments that lead away from them\.

Our analysis shows why this distinction matters\. Overthinking is not simply a matter of response length; even among traces of comparable length, incorrect solutions contain more abandonment and unresolved contradiction\. On competition math benchmarks, DASH is most effective where this failure mode is most prevalent, achieving the best AIME25 accuracy while reducing repetition, abandonment, and length outliers\. The remaining contradiction signal becomes more productive: DASH uses contradictions for diagnosis and recovery rather than as a prelude to spiraling\.

These results suggest that efficient reasoning training should focus less on making models universally shorter and more on teaching them when reflection has stopped being useful\. Segment\-level credit from answer checkpoints offers a simple way to make that distinction without process labels or external judges\.

## Limitations

Our work has several limitations\. First, experiments are conducted on a 4B model; while our analysis suggests drift patterns are scale\-invariant, training dynamics may differ at larger scales\. Second, our method requires extractable intermediate answers for drift detection, which limits applicability to domains with verifiable checkpoints \(mathematics, code with test cases\)\. Open\-ended reasoning tasks without clear answer markers would require alternative drift indicators\. Third, the slight accuracy decrease on some easy benchmarks for Nemotron \(MATH\-500:−1\.7\-1\.7pp vs\. base\) represents a real trade\-off that practitioners must weigh against gains on hard problems\. Finally, our evaluation is limited to mathematical reasoning; generalization to other reasoning domains \(logical, scientific, commonsense\) remains to be validated\.

Future work could explore difficulty\-adaptive methods that modulate drift penalty strength based on problem difficulty or model uncertainty\.

As with any RL fine\-tuning method, DASH inherits the safety properties and failure modes of the base model; we do not introduce safety mitigations beyond those of the upstream models\.

## References

- Aggarwal and Welleck \(2025\)Pranjal Aggarwal and Sean Welleck\. 2025\.[L1: Controlling how long a reasoning model thinks with reinforcement learning](https://openreview.net/forum?id=4jdIxXBNve)\.In*Proceedings of the Second Conference on Language Modeling*\.
- AI\-MO Team \(2024\)AI\-MO Team\. 2024\.NuminaMath 1\.5\.[https://huggingface\.co/datasets/AI\-MO/NuminaMath\-1\.5](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5)\.
- Chen et al\. \(2024\)Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu\. 2024\.[Do NOT think that much for 2\+3=? on the overthinking of o1\-like LLMs](https://arxiv.org/abs/2412.21187)\.*Preprint*, arXiv:2412\.21187\.
- DeepSeek\-AI \(2025\)DeepSeek\-AI\. 2025\.[DeepSeek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning](https://arxiv.org/abs/2501.12948)\.*Preprint*, arXiv:2501\.12948\.
- Duan et al\. \(2026\)Zenghao Duan, Liang Pang, Zihao Wei, Wenbin Duan, Yuxin Tian, Shicheng Xu, Jingcheng Deng, Zhiyi Yin, and Xueqi Cheng\. 2026\.[Circular reasoning: Understanding self\-reinforcing loops in large reasoning models](https://arxiv.org/abs/2601.05693)\.*Preprint*, arXiv:2601\.05693\.
- Fang et al\. \(2025\)Gongfan Fang, Xinyin Ma, and Xinchao Wang\. 2025\.[Thinkless: LLM learns when to think](https://openreview.net/forum?id=ariVQf0KZx)\.In*Advances in Neural Information Processing Systems*\.
- Fu et al\. \(2026\)Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, and Hao Zhang\. 2026\.[Efficiently scaling LLM reasoning programs with certaindex](https://openreview.net/forum?id=nn51ewu5k2)\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*\.
- Guo et al\. \(2025\)Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu\. 2025\.[Segment policy optimization: Effective segment\-level credit assignment in RL for large language models](https://openreview.net/forum?id=9osvTOYbT4)\.In*Advances in Neural Information Processing Systems*\.
- He et al\. \(2024\)Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun\. 2024\.[OlympiadBench: A challenging benchmark for promoting AGI with olympiad\-level bilingual multimodal scientific problems](https://aclanthology.org/2024.acl-long.211/)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 3828–3850\. Association for Computational Linguistics\.
- Hou et al\. \(2025\)Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang\. 2025\.[Thinkprune: Pruning long chain\-of\-thought of LLMs via reinforcement learning](https://arxiv.org/abs/2504.01296)\.*Preprint*, arXiv:2504\.01296\.
- Hu \(2025\)Jian Hu\. 2025\.[REINFORCE\+\+: A simple and efficient approach for aligning large language models](https://arxiv.org/abs/2501.03262)\.*Preprint*, arXiv:2501\.03262\.
- Kazemnejad et al\. \(2025\)Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux\. 2025\.[VinePPO: Refining credit assignment in RL training of LLMs](https://proceedings.mlr.press/v267/kazemnejad25a.html)\.In*Proceedings of the 42nd International Conference on Machine Learning*, volume 267 of*Proceedings of Machine Learning Research*, pages 29557–29590\. PMLR\.
- Li et al\. \(2025\)Gang Li, Yan Chen, Ming Lin, and Tianbao Yang\. 2025\.[DRPO: Efficient reasoning via decoupled reward policy optimization](https://arxiv.org/abs/2510.04474)\.*Preprint*, arXiv:2510\.04474\.
- Lightman et al\. \(2024\)Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\. 2024\.Let’s verify step by step\.In*International Conference on Learning Representations*, volume 2024, pages 39578–39601\.
- Liu et al\. \(2026\)Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen\-Yu Wei, and Dong Yu\. 2026\.[Save the good prefix: Precise error penalization via process\-supervised RL to enhance LLM reasoning](https://arxiv.org/abs/2601.18984)\.*Preprint*, arXiv:2601\.18984\.
- Liu et al\. \(2025\)Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin\. 2025\.[Understanding R1\-Zero\-Like training: A critical perspective](https://openreview.net/forum?id=5PAF7PAY2Y)\.In*Proceedings of the Second Conference on Language Modeling*\.
- Long et al\. \(2026\)Quanyu Long, Kai Jie Jiang, Jianda Chen, Xu Guo, Leilei Gan, and Wenya Wang\. 2026\.[Self\-verification dilemma: Experience\-driven suppression of overused checking in LLM reasoning](https://arxiv.org/abs/2602.03485)\.*Preprint*, arXiv:2602\.03485\.
- Math\-AI \(2025\)Math\-AI\. 2025\.Aime24: Math reasoning benchmark\.[https://huggingface\.co/datasets/math\-ai/aime24](https://huggingface.co/datasets/math-ai/aime24)\.
- Math\-AI \(2026\)Math\-AI\. 2026\.Aime25: American invitational mathematics examination 2025\.[https://huggingface\.co/datasets/math\-ai/aime25](https://huggingface.co/datasets/math-ai/aime25)\.
- Mathematical Association of America \(2023\)Mathematical Association of America\. 2023\.[*American Mathematics Competitions \(AMC\) 8, 10A, 10B, 12A, 12B*](https://maa.org/student-programs/amc/)\.Mathematical Association of America, Washington, DC\.Competitions aimed at middle and high school students\.
- Muennighoff et al\. \(2025\)Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei\-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto\. 2025\.[s1: Simple test\-time scaling](https://doi.org/10.18653/v1/2025.emnlp-main.1025)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 20275–20321, Suzhou, China\. Association for Computational Linguistics\.
- Mündler et al\. \(2024\)Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin Vechev\. 2024\.Self\-contradictory hallucinations of large language models: Evaluation, detection and mitigation\.In*The Twelfth International Conference on Learning Representations*\.
- NVIDIA \(2025\)NVIDIA\. 2025\.[Llama\-nemotron: Efficient reasoning models](https://arxiv.org/abs/2505.00949)\.*Preprint*, arXiv:2505\.00949\.
- Open R1 Team \(2025\)Open R1 Team\. 2025\.OpenR1\-Math\-220k\.[https://huggingface\.co/datasets/open\-r1/OpenR1\-Math\-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k)\.Apache 2\.0 License\.
- Peng et al\. \(2025\)Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, and Dacheng Tao\. 2025\.[Revisiting overthinking in long chain\-of\-thought from the perspective of self\-doubt](https://arxiv.org/abs/2505.23480)\.*Preprint*, arXiv:2505\.23480\.
- Pipis et al\. \(2025\)Charilaos Pipis, Shivam Garg, Vasilis Kontonis, Vaishnavi Shrivastava, Akshay Krishnamurthy, and Dimitris Papailiopoulos\. 2025\.Wait, wait, wait… why do reasoning models loop?*arXiv preprint arXiv:2512\.12895*\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.[DeepSeekMath: Pushing the limits of mathematical reasoning in open language models](https://arxiv.org/abs/2402.03300)\.*Preprint*, arXiv:2402\.03300\.
- Sheng et al\. \(2025\)Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu\. 2025\.Hybridflow: A flexible and efficient rlhf framework\.In*Proceedings of the Twentieth European Conference on Computer Systems*, pages 1279–1297\.
- Su et al\. \(2025\)Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie\. 2025\.[Between underthinking and overthinking: An empirical study of reasoning length and correctness in LLMs](https://arxiv.org/abs/2505.00127)\.*Preprint*, arXiv:2505\.00127\.
- Tan et al\. \(2025\)Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang\. 2025\.[GTPO and GRPO\-S: Token and sequence\-level reward shaping with policy entropy](https://arxiv.org/abs/2508.04349)\.*Preprint*, arXiv:2508\.04349\.
- Tran et al\. \(2025\)Hieu Tran, Zonghai Yao, and Hong Yu\. 2025\.[Exploiting tree structure for credit assignment in RL training of LLMs](https://arxiv.org/abs/2509.18314)\.*Preprint*, arXiv:2509\.18314\.
- Wang et al\. \(2024\)Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui\. 2024\.[Math\-shepherd: Verify and reinforce LLMs step\-by\-step without human annotations](https://doi.org/10.18653/v1/2024.acl-long.510)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9426–9439, Bangkok, Thailand\. Association for Computational Linguistics\.
- Wang et al\. \(2026\)Xinyan Wang, Xiaogeng Liu, and Chaowei Xiao\. 2026\.[ROM: Real\-time overthinking mitigation via streaming detection and intervention](https://arxiv.org/abs/2603.22016)\.*Preprint*, arXiv:2603\.22016\.
- Wang et al\. \(2025a\)Yining Wang, Jinman Zhao, Chuangxin Zhao, Shuhao Guan, Gerald Penn, and Shinan Liu\. 2025a\.[λ\\lambda\-GRPO: Unifying the GRPO frameworks with learnable token preferences](https://arxiv.org/abs/2510.06870)\.*Preprint*, arXiv:2510\.06870\.
- Wang et al\. \(2025b\)Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu\. 2025b\.[Thoughts are all over the place: On the underthinking of o1\-like LLMs](https://arxiv.org/abs/2501.18585)\.*Preprint*, arXiv:2501\.18585\.
- Wei et al\. \(2025\)Zihao Wei, Liang Pang, Jiahao Liu, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Jingang Wang, Fei Sun, Xunliang Cai, Huawei Shen, and Xueqi Cheng\. 2025\.[Stop spinning wheels: Mitigating LLM overthinking via mining patterns for early reasoning exit](https://arxiv.org/abs/2508.17627)\.*Preprint*, arXiv:2508\.17627\.
- Xie et al\. \(2025\)Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang\. 2025\.[CAPO: Towards enhancing LLM reasoning through generative credit assignment](https://arxiv.org/abs/2508.02298)\.*Preprint*, arXiv:2508\.02298\.
- Yang et al\. \(2026\)Bangji Yang, Hongbo Ma, Jiajun Fan, and Ge Liu\. 2026\.[Batched contextual reinforcement: A task\-scaling law for efficient reasoning](https://arxiv.org/abs/2604.02322)\.*Preprint*, arXiv:2604\.02322\.
- Yang et al\. \(2025\)Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang\. 2025\.[Dynamic early exit in reasoning models](https://arxiv.org/abs/2504.15895)\.*Preprint*, arXiv:2504\.15895\.
- Yi et al\. \(2025\)Jingyang Yi, Justin Wang, and Sida Li\. 2025\.[ShorterBetter: Guiding reasoning models to find optimal inference length for efficient reasoning](https://openreview.net/forum?id=MJvwM5dBZM)\.In*Advances in Neural Information Processing Systems*\.
- Yu et al\. \(2025\)Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, and 1 others\. 2025\.[DAPO: An open\-source LLM reinforcement learning system at scale](https://openreview.net/forum?id=2a36EMSSTp)\.In*Advances in Neural Information Processing Systems*\.
- Yue et al\. \(2025\)Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, and 8 others\. 2025\.[VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks](https://arxiv.org/abs/2504.05118)\.*Preprint*, arXiv:2504\.05118\.
- Zhou et al\. \(2026\)Shu Zhou, Rui Ling, Junan Chen, Xin Wang, Tao Fan, and Hao Wang\. 2026\.[When more thinking hurts: Overthinking in LLM test\-time compute scaling](https://arxiv.org/abs/2604.10739)\.*Preprint*, arXiv:2604\.10739\.

## Appendix ASignal Implementation Details

##### S1: Repetition\.

Window size 200 tokens, stride 50\. Jaccard similarity over 5\-grams between non\-adjacent window pairs\. Score = max similarity, clipped to\[0,1\]\[0,1\]\. Threshold:\>0\.4\>0\.4\.

##### S2: Hedging\.

Case\-insensitive regex:wait,hmm,actually,hold on,let me reconsider,I’m confused,not sure,on second thought\. Density = count per 100 tokens\.

##### S3: Abandonment\.

Regex:this approach is wrong,let me try another,let’s restart,going back to,scrapping this,dead end,alternatively\. Raw count per trace\.

##### S4: Contradiction\.

Regex:contradicts the previous,which is impossible,this is impossible,can’t be right,that’s not possible,inconsistent with,but we just showed\. Raw count per trace\.

##### S5: Recomputation\.

Numeric values extracted via\[\-\+\]?\\d\+\(?:\\\.\\d\+\)?\. Values appearing 3\+ times within 10\-token computation contexts \(near==,\+\+,−\-,×\\times\) are flagged\. Count of unique repeated values\.

##### S6: Length outlier\.

zz\-score of thinking\-section length within the GRPO prompt group\. Flag threshold:z\>2\.0z\>2\.0\. Score =min⁡\(\(z−2\)/2,1\)\\min\(\(z\-2\)/2,\\ 1\)forz\>2z\>2, else0\.

##### Composite\.

Signals are normalized to\[0,1\]\[0,1\]:s~1=max⁡\(s1−0\.2,0\)/0\.8\\tilde\{s\}\_\{1\}=\\max\(s\_\{1\}\-0\.2,0\)/0\.8,s~2=min⁡\(s2/3,1\)\\tilde\{s\}\_\{2\}=\\min\(s\_\{2\}/3,1\),s~3=min⁡\(s3/3,1\)\\tilde\{s\}\_\{3\}=\\min\(s\_\{3\}/3,1\),s~4=min⁡\(s4/3,1\)\\tilde\{s\}\_\{4\}=\\min\(s\_\{4\}/3,1\),s~5=min⁡\(s5/5,1\)\\tilde\{s\}\_\{5\}=\\min\(s\_\{5\}/5,1\)\. Compositeω=\(s~1\+s~2\+s~3\+s~4\+s~5\)/5\\omega=\(\\tilde\{s\}\_\{1\}\+\\tilde\{s\}\_\{2\}\+\\tilde\{s\}\_\{3\}\+\\tilde\{s\}\_\{4\}\+\\tilde\{s\}\_\{5\}\)/5, clamped to\[0,1\]\[0,1\]\. Overthinking flag:ω\>0\.3\\omega\>0\.3\.

## Appendix BSignal Motivation and Prior Work

##### S1 \(Repetition\)\.

Reasoning models generate long chains of thought but often loop at low temperatures, repeating the same text\(Pipis et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib26)\)\.Duan et al\. \([2026](https://arxiv.org/html/2607.00482#bib.bib5)\)classify these as circular reasoning loops driven by self\-reinforcing attention mechanisms that prevent escape from local minima\.

##### S2 \(Hedging\) and S3 \(Abandonment\)\.

Zhou et al\. \([2026](https://arxiv.org/html/2607.00482#bib.bib43)\)examine negative flips—cases where extended reasoning changes correct answers to incorrect ones\. They find that explicit reconsideration \(hedging\) precedes approach abandonment in over 67% of negative flip cases, and that abandonment rates increase monotonically with token count\. The “alternatively” marker alone is 4\.1–4\.3×\\timesmore common in incorrect Nemotron traces across all scales tested\.

##### S4 \(Contradiction\)\.

Self\-contradiction is a prevalent LLM failure mode:Mündler et al\. \([2024](https://arxiv.org/html/2607.00482#bib.bib22)\)find contradictions in 17\.7% of all ChatGPT sentences\.Yang et al\. \([2026](https://arxiv.org/html/2607.00482#bib.bib38)\)note that extended reasoning chains specifically increase opportunities for self\-contradiction and degenerate outputs\.

##### S5 \(Recomputation\)\.

Duan et al\. \([2026](https://arxiv.org/html/2607.00482#bib.bib5)\)classify “numerical loops”—where models repeatedly derive the same constants—as a distinct loop category triggered by reasoning impasses\.Long et al\. \([2026](https://arxiv.org/html/2607.00482#bib.bib17)\)show through large\-scale analysis that models spend a substantial fraction of reasoning on confirmatory self\-verification that rarely catches errors, reducing tokens by 20\.3% when suppressed\.

##### S6 \(Length outlier\)\.

Our analysis of Nemotron\-4B across difficulty levels reveals a bimodal length distribution on hard problems: responses either solve in<<2K tokens \(88% accuracy\) or spiral into a 10K–25K token dead zone \(0\.6% accuracy\)\. Within\-group normalization captures this pathology adaptively without penalizing legitimately long solutions\.

## Appendix CFull Overthinking Signals On Nemotron\-4B

![Refer to caption](https://arxiv.org/html/2607.00482v1/x4.png)Figure 5:Full breakdown of overthinking signals in Nemotron\-4B reasoning traces on AIME 2024 and 2025 combined \(1,920 traces, 32 per problem\)\. Traces are grouped into quintiles by word count, with bars showing the mean signal value for correct \(green\) and incorrect \(red\) traces within each bucket\. Top row: density\-normalized signals \(counts per 100 words\), which control for the trivial effect of longer traces containing more text\. Bottom row: raw counts\. Across all four signal types—hedging, approach abandonment, self\-contradiction, and numerical recomputation—incorrect traces exhibit higher density than correct traces of comparable length, with the gap most pronounced for abandonment and contradiction\. Raw counts show that incorrect traces accumulate substantially more overthinking markers in absolute terms, consistent with their longer average length\.Figure[5](https://arxiv.org/html/2607.00482#A3.F5)extends the main\-text analysis by showing all four count\-based linguistic signals on AIME 2024 and AIME 2025\. The top row reports density\-normalized counts within each length quintile, while the bottom row reports raw counts\. This separation is important: if overthinking were only a length artifact, the density\-normalized gaps between correct and incorrect traces should largely disappear once traces are compared within the same length bucket\. Instead, incorrect traces retain higher signal density across most buckets, especially for approach abandonment and self\-contradiction\. These are precisely the behaviors most associated with harmful post\-answer reflection: the model notices uncertainty, changes direction, and often fails to resolve the inconsistency before continuing\.

The raw\-count panels show the compounding effect of this behavior\. Longer incorrect traces do not merely contain more words; they accumulate more abandonment, contradiction, hedging, and recomputation events in absolute terms\. Thus the long\-tail failures are both quantitatively longer and qualitatively different from successful traces\. Hedging and recomputation are less individually diagnostic than abandonment, but their upward trend in incorrect traces suggests a common failure pattern: the model repeatedly re\-opens decisions it has already made rather than converting verification into commitment\. This supports the central motivation for DASH: reducing overthinking should not mean penalizing all long reasoning, but assigning credit according to whether reflection moves the answer trajectory toward resolution or back into search\.

## Appendix DContradiction Signal: Discrimination Analysis

The raw s4 count conflates two phenomena:

- •Productive contradiction\(predominant in DASH correct traces\): detect inconsistency→\\todiagnose source→\\toresolve→\\tocorrect answer\.
- •Unproductive contradiction\(predominant in GRPO incorrect traces\): notice error→\\tofail to diagnose→\\tosilently abandon→\\torepeat\.

GRPODASHCorrectIncorrectCorrectIncorrects4: Contradiction0\.472\.860\.923\.54s3: Abandonment1\.132\.501\.603\.77s4/s3 ratio0\.411\.140\.570\.94s4 discrim\. \(inc/cor\)6\.11×\\times3\.86×\\timess3 discrim\. \(inc/cor\)2\.22×\\times2\.35×\\times% correct w/ s426\.1%33\.1%% incorrect w/ s464\.0%81\.6%

Table 4:Contradiction as self\-monitoring vs\. overthinking\(AIME25, 240 traces/model\)\. DASH’s lower discrimination ratio \(3\.9×\\timesvs\. 6\.1×\\times\) indicates contradictions are less predictive of failure—correct traces actively use them for error\-monitoring \(0\.92 vs\. 0\.47/trace\)\. Higher s4/s3 ratio in correct traces \(0\.57 vs\. 0\.41\) shows more diagnosis per approach switch\.Table[4](https://arxiv.org/html/2607.00482#A4.T4)shows that 33\.1% of DASH correct traces contain contradictions \(vs\. 26\.1% for GRPO\), averaging 0\.92 per trace—nearly double GRPO’s 0\.47\. This indicates DASH learned to use self\-contradiction as a reasoning tool rather than being derailed by it\.

### D\.1Qualitative Example: Self\-Correction vs\. Spiraling

GRPO — INCORRECT61K charsDASH — CORRECT34K chars8 abandonments⋅\\cdot0 contradictions1 abandonment⋅\\cdot4 contradictions\[44%\]We need integers satisfying these\.Let’s trypossible pairs\.b=36b=36,d=−36d=\-36…not satisfied\.
\[51%\]Let’s tryb=72b=72,d=72d=72…−864=−1728\-864=\-1728\. Not true\. Next:b=48b=48,d=108d=108…
\[61%\]Let’s tryq=81q=81,t=13t=13…−136≠−1080\-136\\neq\-1080\. Next:q=27q=27,t=39t=39…p2=−24p^\{2\}=\-24\. Impossible\.
\[75%\]Let’s tryb=6b=6:≠36\\neq 36\.b=4b=4:≠36\\neq 36\.b=5b=5:≠36\\neq 36\.
\[82%\]Let’s trym=0\.5m=0\.5…negative\.m=0\.8m=0\.8…positive\. Not helpful…
\[100%\]\[truncates mid\-computation\]\[8%\]Not on the line connecting centers\.That can’t be\.Let me check—ω1\\omega\_\{1\}internally tangent atBB, so centers must be collinear withBB\.
\[23%\]DistanceO2O\_\{2\}toBBis 3, butω2\\omega\_\{2\}has radius 15\.That can’t be\.Ifω1\\omega\_\{1\}insideω2\\omega\_\{2\}, thenA​O2=15−6=9AO\_\{2\}=15\-6=9…
\[37%\]9\+15=249\+15=24, butA​B=6AB=6\.This is impossible\.Must re\-examine internal tangency…
\[59%\]O​B=O​A\+A​B=9\+6=15=r2OB=OA\+AB=9\+6=15=r\_\{2\}\.Resolved:BBis 6 units beyondAAfromOO\.
\[80–100%\]a=2​ba=2b,5​b2=365b^\{2\}=36, area=288/5=288/5\.293\\boxed\{293\}Figure 6:AIME25 Problem 20\(answer = 293\)\.Left:GRPO cycles through 8 blind substitutions \(red\) without diagnosing the geometric misconfiguration, truncating at 61K chars\.Right:DASH encounters 4 contradictions \(blue\), reasons about each, resolves the tangency configuration at 59%, and reaches the answer in 34K chars \(43% shorter\)\.Figure[6](https://arxiv.org/html/2607.00482#A4.F6)contrasts the two reasoning strategies on AIME25 Problem 20 \(circle geometry\)\. GRPO never identifies the root cause—a misunderstanding of internal tangency—and instead exhaustively substitutes values across 8 different parameterizations until truncation\. DASH explicitly names the geometric impossibility \(“that can’t be”\), iteratively narrows the misunderstanding, resolves it at 59% of the trace, and proceeds linearly to the correct computation\.

## Appendix ETraining Dynamics

Figure[7](https://arxiv.org/html/2607.00482#A5.F7)plots KL divergence and entropy over training for all methods\. DASH maintains the lowest KL divergence throughout training, indicating that it achieves length reduction through targeted segment\-level credit assignment rather than aggressive policy deviation, while GRPO \+ Brevity Bonus suffers from entropy collapse \(→\\to0\.21\), suggesting a loss of output diversity\.

![Refer to caption](https://arxiv.org/html/2607.00482v1/x5.png)Figure 7:KL divergence and entropy over training\.DASH maintains the lowest KL divergence, achieving length reduction through targeted credit assignment rather than aggressive policy deviation\. GRPO \+ Brevity Bonus exhibits entropy collapse \(→\\to0\.21\); standard GRPO shows gradually increasing entropy and KL\.
## Appendix FTraining Configuration

All runs share:

- •GRPO with group sizen=16n=16, learning rate3×10−63\\times 10^\{\-6\}, grad clip 1\.0
- •KL penalty:βKL=10−4\\beta\_\{\\text\{KL\}\}=10^\{\-4\}
- •Temperature 0\.9, max response length 15,821 tokens
- •4 nodes×\\times8 H100 GPUs, 1 epochs

##### DASH configuration\.

- •Advantage estimator:grpo\_drift\_hybrid
- •Positive/negative scales:α\+=1\.0\\alpha\_\{\+\}=1\.0,α−=1\.0\\alpha\_\{\-\}=1\.0
- •Length penalty:α=3\.0\\alpha=3\.0,wmax=3\.0w\_\{\\text\{max\}\}=3\.0
- •Neutral mode:conditional,αn=0\.1\\alpha\_\{n\}=0\.1
- •Segment length penalty: enabled

## Appendix GExtended Related Work

### G\.1Adaptive Reasoning Depth

Fang et al\. \([2025](https://arxiv.org/html/2607.00482#bib.bib6)\)approach overthinking from a model\-selection perspective: their Thinkless framework trains a model to adaptively choose between short\-form and long\-form reasoning via control tokens and a Decoupled GRPO \(DeGRPO\) objective, learning*when*to engage in extended reasoning at all\. Our work differs in granularity: rather than gating reasoning depth at the response level, we shape credit*within*a single reasoning trace\.

### G\.2Additional Length Control Methods

##### Global vs\. local normalization\.

REINFORCE\+\+\(Hu,[2025](https://arxiv.org/html/2607.00482#bib.bib11)\)replaces GRPO’s group\-level advantage normalization with global batch\-level normalization, arguing that the former is a biased estimator that interacts poorly with length variation\.

##### Value\-based methods\.

VAPO\(Yue et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib42)\)addresses the length problem within a value\-based PPO framework, introducing Decoupled GAE to handle heterogeneous sequence lengths and a length\-adaptive discount to prevent long traces from dominating value estimates\. While VAPO tackles length heterogeneity through the value function, our method instead operates directly on the policy gradient by reshaping advantages based on detected drift\.

### G\.3Token\-Level Credit Assignment, Continued

TEMPO\(Tran et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib31)\)builds prefix trees for nonparametric token\-level credit\. SPO\(Guo et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib8)\)bridges token\- and trajectory\-level feedback through mid\-grained, segment\-level advantage estimation using flexible cutpoint\- or tree\-based partitions evaluated via Monte Carlo sampling without a critic model\. These methods provide general\-purpose credit assignment without specifically targeting drift; our approach is complementary in that we use the structure of intermediate answers to assign credit based on the model’s own reasoning trajectory, requiring no additional models, sampling, or learned components\.

### G\.4Inference\-Time Early Exit, Continued

Certaindex\(Fu et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib7)\)uses answer stability for early stopping, DEER\(Yang et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib39)\)terminates based on confidence, and ROM\(Wang et al\.,[2026](https://arxiv.org/html/2607.00482#bib.bib33)\)monitors for real\-time overthinking indicators\.Muennighoff et al\. \([2025](https://arxiv.org/html/2607.00482#bib.bib21)\)take a simpler approach with budget forcing, which controls test\-time compute by forcefully terminating or extending the model’s thinking process via appended tokens\.

## Appendix HArtifact Licenses

We list below the licenses of the scientific artifacts used in this work\. Our use of all artifacts is restricted to non\-commercial research on language\-model reasoning, which is consistent with the intended use specified by each artifact’s authors\.

- •Llama\-3\.1\-Nemotron\-Nano\-4B\-v1\.1\(NVIDIA,[2025](https://arxiv.org/html/2607.00482#bib.bib23)\): released by NVIDIA under the NVIDIA Open Model License, with additional terms from the Llama 3\.1 Community License Agreement\.
- •veRL\(Sheng et al\.,[2025](https://arxiv.org/html/2607.00482#bib.bib28)\): released by ByteDance Seed under the Apache License 2\.0\.
- •OpenR1\-Math\-220K\(Open R1 Team,[2025](https://arxiv.org/html/2607.00482#bib.bib24)\): released by the Hugging Face Open\-R1 team under the Apache License 2\.0\.
- •OlympiadBench\(He et al\.,[2024](https://arxiv.org/html/2607.00482#bib.bib9)\): released by OpenBMB under the MIT License\.
- •AMC 2023, AIME 2024, AIME 2025: these are public mathematics competition problems originally published by the Mathematical Association of America \(MAA\)\. We use them solely as held\-out evaluation benchmarks, consistent with established practice in the reasoning\-LLM literature\.

Similar Articles

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

arXiv cs.LG

Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

arXiv cs.AI

This paper introduces a prefix-level trajectory evaluation protocol to distinguish harmful overthinking from verbose but harmless overthinking in large reasoning models, showing that continued reasoning after reaching the correct answer can destabilize performance. The authors find that early stopping improves accuracy by up to 21% on multimodal benchmarks, and identify logical drift and visual reinterpretation as key causes of correctness deviations.

Improving mathematical reasoning with process supervision

OpenAI Blog

OpenAI demonstrates that process supervision—rewarding intermediate reasoning steps rather than just final answers—improves mathematical reasoning while reducing alignment costs. This approach produces more interpretable, human-aligned reasoning without sacrificing model performance.