Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

arXiv cs.AI 论文

摘要

This paper introduces a prefix-level trajectory evaluation protocol to distinguish harmful overthinking from verbose but harmless overthinking in large reasoning models, showing that continued reasoning after reaching the correct answer can destabilize performance. The authors find that early stopping improves accuracy by up to 21% on multimodal benchmarks, and identify logical drift and visual reinterpretation as key causes of correctness deviations.

arXiv:2606.02835v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: "Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?" To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking-past-the-answer.
查看原文
查看缓存全文

缓存时间: 2026/06/03 09:41

# Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models
Source: [https://arxiv.org/html/2606.02835](https://arxiv.org/html/2606.02835)
Simone Caldarella1,Davide Talon3 Rahaf Aljundi2Elisa Ricci1,3Massimiliano Mancini1 1University of Trento 2Toyota Motor Europe 3Fondazione Bruno Kessler

###### Abstract

Large Reasoning Models \(LRMs\) improve performance by generating explicit intermediate reasoning traces through increased test\-time compute, yet the assumption that longer reasoning is consistently beneficial remains under\-examined\. While recent evidence shows that additional reasoning can lead models to overthink, we ask: “*Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?*” To study the dynamics after correctness, we introduce a prefix\-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer\. This allows us to disentangle*verbose*overthinking, where additional reasoning is redundant but harmless, from*harmful*overthinking, where continued reasoning destabilizes an already\-correct trajectory\. Starting from multimodal benchmarks, we find that many instances considered reasoning\-intensive require surprisingly little reasoning\. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time\. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking \(up to 50%\), they fail to mitigate harmful overthinking\. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation\. Finally, we show that our findings generalize to language\-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk\. Code available at[https://simonecaldarella\.github\.io/thinking\-past\-the\-answer](https://simonecaldarella.github.io/thinking-past-the-answer)\.

## 1Introduction

Large Reasoning Models \(LRMs\), such as OpenAI’s o1\(Jaechet al\.,[2024](https://arxiv.org/html/2606.02835#bib.bib52)\)and DeepSeek’s R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib35)\), have shown that allocating additional computation at test time can substantially improve performance on challenging tasks\.111Throughout this paper, we use the term*large reasoning models*to refer jointly to language\-only and multimodal models trained to generate explicit intermediate reasoning traces\.This paradigm, referred to as*test\-time scaling*\(Muennighoffet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib37)\), improves performance by allowing models to produce longer and more deliberative reasoning traces, with gains observed in mathematical problems\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.02835#bib.bib44); Cobbeet al\.,[2021](https://arxiv.org/html/2606.02835#bib.bib45)\), code generation\(Chenet al\.,[2021](https://arxiv.org/html/2606.02835#bib.bib46); Liuet al\.,[2023](https://arxiv.org/html/2606.02835#bib.bib47)\), and multimodal reasoning\(Luet al\.,[2023](https://arxiv.org/html/2606.02835#bib.bib48); Wanget al\.,[2024a](https://arxiv.org/html/2606.02835#bib.bib49)\)\. However, emerging evidence suggests that more reasoning is not always better: LRMs often exhibit systematic*overthinking*, generating reasoning traces substantially longer than necessary to solve a problem\(Suiet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib34); Liuet al\.,[2025b](https://arxiv.org/html/2606.02835#bib.bib57); Chenet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib32)\)\.

Prior work has largely treated overthinking as an efficiency problem, aiming to reduce reasoning cost while preserving the accuracy of full\-length chains of thought \(CoT\)\(Shenet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib14); Zhanget al\.,[2025b](https://arxiv.org/html/2606.02835#bib.bib27); Liuet al\.,[2025a](https://arxiv.org/html/2606.02835#bib.bib26); Linet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib40); Xiao and Gan,[2025](https://arxiv.org/html/2606.02835#bib.bib24); Wanget al\.,[2025b](https://arxiv.org/html/2606.02835#bib.bib19)\)\. This perspective has also been studied mostly in language\-only settings\(Cuadronet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib54); Wanget al\.,[2026b](https://arxiv.org/html/2606.02835#bib.bib53)\), leaving limited insight into multimodal LRMs, where continued reasoning can introduce visual misreadings or unsupported reinterpretations of the input\. In this paper, we argue that this view is incomplete: overthinking is also a reliability problem\. A model may reach the correct answer early, continue reasoning, and later revise, contradict, or overwrite that correct solution\.

![Refer to caption](https://arxiv.org/html/2606.02835v1/x1.png)Figure 1:Performance averaged on LRMs\.*Actual Length*is the model’s default behavior,*No\-CoT*disables intermediate reasoning, and*Instruct Model*is the pre\-reasoning instruction\-tuned model\. Finally,*Optimal Length*stops at the first correct prefix\. The gap between*Actual Length*and*Optimal Length*shows that models often reason past correctness, making additional reasoning harmful\.We study this phenomenon through the lens of*reasoning sufficiency*\. For a given model and question, we define the question’s difficulty as the minimum reasoning budget required for the model to produce the correct answer\. This differs from prior work that proxies difficulty using the average length of model\-generated traces\(Suiet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib34); Shenet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib14); Linet al\.,[2025](https://arxiv.org/html/2606.02835#bib.bib40)\), since trace length can itself be inflated by overthinking\. Our formulation isolates the computation minimally required for correctness and separates two forms of overthinking:*verbose overthinking*, where the model reasons beyond the sufficient budget while preserving the correct answer, and*harmful overthinking*, where additional reasoning causes a trajectory that has already reached the correct answer to end with an incorrect final prediction\. Under this view, test\-time scaling is not monotonically beneficial; additional computation can destabilize an already\-correct solution\.

To measure these effects, we introduce a*prefix\-level*trajectory evaluation protocol\. Given a reasoning trace, we evaluate prefix\-level performance by forcing the model to produce an answer from that partial trace\. This lets us identify when the correct answer first becomes recoverable and whether continued reasoning preserves or loses correctness\. Using this protocol, we find that overthinking is substantial and systematic across multimodal benchmarks\. Many questions commonly viewed as reasoning\-intensive can be solved with surprisingly few reasoning steps, yet models often continue far beyond the sufficient point\. As shown in Fig\.[1](https://arxiv.org/html/2606.02835#S1.F1),*Optimal Length*, which stops at the first correct prefix, outperforms the model’s default*Actual Length*behavior by nearly10%10\\%on average; this gain exceeds the benefit from reasoning\-oriented post\-training over the corresponding instruct model\. These results suggest that current LRMs are limited not only by whether they can reason, but also by whether they can stop reasoning at the right time\.

We further show that harmful overthinking is not tied to a particular answer format or modality\. Both multiple\-choice and free\-form questions exhibit harmful overthinking, with surprisingly stronger effects in the latter settings, where the less constrained output space makes it easier to drift away from a previously correct answer\. Language\-only experiments show that the same phenomenon also affects unimodal LRMs\. Moreover, simply shortening traces is insufficient: early stopping,i\.e\., terminating the reasoning trace earlier, reduces average reasoning length, but fails to mitigate harmful overthinking\. Finally, an analysis of4,8424\{,\}842harmful traces reveals that correctness deviations are dominated by visual and logical errors, while calculation errors account for only a small fraction\.

In summary,the contributions of this paperare as follows:

1. ①We formalize overthinking via the minimum sufficient reasoning budget, disentangling verbose overthinking from harmful overthinking\.
2. ②We introduce a prefix\-level evaluation protocol that measures reasoning sufficiency and correctness instability along model trajectories\.
3. ③We quantify harmful overthinking across multimodal and language\-only benchmarks, showing that LRMs often drift from early correct answers to incorrect final predictions\.
4. ④We categorize the sources of harmful overthinking and show that correctness deviations are driven mainly by logical and visual errors rather than arithmetic mistakes\.

## 2Formalizing Overthinking via Reasoning Sufficiency

In this section, we formalize overthinking through the lens of reasoning sufficiency\. We first define question difficulty as the minimum reasoning budget required for a model to reach a correct answer\. We then use this notion to distinguish*verbose*overthinking from*harmful*overthinking\.

#### Setting and Notation\.

Let\(x,y\)\(x,y\)be a sample with inputx∈𝒳x\\in\\mathcal\{X\}\(potentially multimodal\) and ground\-truth answery∈𝒴y\\in\\mathcal\{Y\}\. We consider a large reasoning model as a generative frameworkℱ:𝒳→𝒯\\mathcal\{F\}:\\mathcal\{X\}\\rightarrow\\mathcal\{T\}that, givenxx, produces a reasoning tracet=ℱ​\(x\)∈𝒯t=\\mathcal\{F\}\(x\)\\in\\mathcal\{T\}that includes the predicted answer, where𝒯\\mathcal\{T\}denotes the space of possible traces\. For consistent evaluation in cases where answer formatting is not followed, we rely on a fixed answer extraction protocol where a language model𝒜:𝒯→𝒴\\mathcal\{A\}:\\mathcal\{T\}\\rightarrow\\mathcal\{Y\}extracts the prediction from the provided tracey^=𝒜​\(t\)\\hat\{y\}=\\mathcal\{A\}\(t\)\. The extractor is implemented as a separate model \(Qwen3\-4BYanget al\.\([2025a](https://arxiv.org/html/2606.02835#bib.bib65)\)\) that operates solely on the generated reasoning trace\.

### 2\.1Problem Difficulty

#### *What does it mean for a problem to be difficult?*

Prior work often characterizes difficultyShenet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib14)\); Muennighoffet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib37)\); Linet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib40)\)using aggregate proxies such as*pass@k*or average chain\-of\-thought lengthShenet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib14)\)\. These proxies are confounded by decoding policy, sampling strategy, and verbosity, and therefore do not isolate the computation actually required for correctness\. We instead define the empirical difficulty of an instancewrt\.a model as the minimum reasoning budget \(i\.e\., shortest CoT\) sufficient for the model to obtain the correct answer\. This separates required reasoning from redundant/harmful continuation\.

Formally, we considerttas a sequence ofNNutterances,t=\(u1,…,uN\)t=\(u\_\{1\},\\dots,u\_\{N\}\), where eachuiu\_\{i\}represents a semantically coherent reasoning step\. We denote byt≤i=\(u1,…,ui\)t\_\{\\leq i\}=\(u\_\{1\},\\dots,u\_\{i\}\)the prefix up to stepii, witht≤0=∅t\_\{\\leq 0\}=\\emptysetcorresponding to no intermediate reasoning,i\.e\., the model can already answer without any reasoning\. Each prefix induces a predictiony^i=𝒜​\(t≤i\)\\hat\{y\}\_\{i\}=\\mathcal\{A\}\(t\_\{\\leq i\}\)\. We define the*first correct index*as:

τy\(x;ℱ\)=arg​mini∈\{0,…,N\}bis\.t\.𝒜\(t≤i\)=y,\\tau\_\{y\}\(x;\\mathcal\{F\}\)=\\operatorname\*\{arg\\,min\}\_\{i\\in\\\{0,\\dots,N\\\}\}b\_\{i\}\\quad\\mathrm\{s\.t\.\}\\quad\\mathcal\{A\}\(t\_\{\\leq i\}\)=y,\(1\)wherebib\_\{i\}is the computational budget associated with prefixt≤i,i=0,…,Nt\_\{\\leq i\},i=0,\\dots,N, and the empirical difficulty of the instance isκ^​\(x,y;ℱ\)=bτy​\(x;ℱ\)\\hat\{\\kappa\}\(x,y;\\mathcal\{F\}\)=b\_\{\\tau\_\{y\}\(x;\\mathcal\{F\}\)\}\. If no prefix yields the correct answer, we setτy=∞\\tau\_\{y\}=\\infty222Mathematicians may forgive us\. In practice, when a trace does not reach the correct solution, we set the optimal length equal to the maximum length\.and leaveκ^\\hat\{\\kappa\}undefined for that trajectory\. We emphasize thatκ^​\(x,y;ℱ\)\\hat\{\\kappa\}\(x,y;\\mathcal\{F\}\)is not an intrinsic property of the instance alone, but a*model\-dependent*difficulty of the sample\. This definition, invariant to overall length, captures the minimal computation required for the model to reach a correct answer: once a correct prefix has been reached, extending the reasoning does not change the difficulty\. In practice,κ^\\hat\{\\kappa\}provides an*empirical*lower bound on the compute required to form the correct answer\.

#### On Tokensvs\.Utterances\.

We instantiate the budgetbib\_\{i\}as the number of utterances int≤it\_\{\\leq i\}\. Unlike token count, utterance\-level budgets are less sensitive to formatting and verbosity, and better align with semantically coherent reasoning steps\. In practice, we instantiate the reasoning steps by splitting traces at explicit delimiters \(line breaks\), which LRMs tend to use naturally\. We use the genericbib\_\{i\}to make clear that the definition is not tied to a particular notion of budget and the same definitions can be applied to token\-level steps\. Appendix[B\.4](https://arxiv.org/html/2606.02835#A2.SS4)analyzes statistics on utterances and tokens\.

### 2\.2Disentangling Overthinking

#### Verbose vs\. Harmful\.

Given the first correct indexτy\\tau\_\{y\}, we define*overthinking*as any continuation beyond the first correct prefix\. That is, all stepsj\>τyj\>\\tau\_\{y\}correspond to computation that is not necessary to first obtain the correct answer\. Then, by comparing the tracet≤τyt\_\{\\leq\\tau\_\{y\}\}with the full model onet≤Nt\_\{\\leq N\}, we distinguish two cases:

①*Verbose*overthinking corresponds to wasted computation: once the model reaches a correct intermediate state, further reasoning does not change the outcome,

𝒜​\(t≤τy\)=y∧𝒜​\(t≤N\)=y\.\\mathcal\{A\}\(t\_\{\\leq\\tau\_\{y\}\}\)=y~~\\land~~\\mathcal\{A\}\(t\_\{\\leq N\}\)=y\.\(2\)Here, additional reasoning is redundant\. The model has already solved the problem, but continues to generate unnecessary steps without affecting the final prediction\.

②*Harmful*overthinking, in contrast, reflects a failure of the reasoning process itself: after reaching a correct answer, additional computation causes the model to deviate from correctness,

𝒜​\(t≤τy\)=y∧𝒜​\(t≤N\)≠y\.\\mathcal\{A\}\(t\_\{\\leq\\tau\_\{y\}\}\)=y~~\\land~~\\mathcal\{A\}\(t\_\{\\leq N\}\)\\neq y\.\(3\)In this case, the model initially reaches the correct solution, but subsequent reasoning introduces errors that override it, making the model reply incorrectly\. Rather than refining the answer, additional computation destabilizes an otherwise correct trajectory\. Crucially, in Sec\.[3\.2](https://arxiv.org/html/2606.02835#S3.SS2.SSS0.Px4.2)we will show that reducing verbose overthinking does not reduce harmful overthinking, highlighting their orthogonality\.

#### Harmful Overthinking as Trajectory Instability\.

The definition ② treats harmful overthinking as a binary event: after first reaching a correct answer, the model terminates with an incorrect one\. To analyze this behavior along the trajectory, we define the correctness state of each prefix as

zi=𝟏​\[𝒜​\(t≤i\)=y\]\.z\_\{i\}=\\mathbf\{1\}\[\\mathcal\{A\}\(t\_\{\\leq i\}\)=y\]\.\(4\)Under monotonic reasoning, correctness would be absorbing: oncezi=1z\_\{i\}=1, all later states would remain correct\. Harmful overthinking corresponds to a violation of this monotonicity\.

We therefore define the event\-level harmful overthinking indicator as

h​\(x;ℱ\)=𝟙​\[τy<∞∧zN=0\]\.h\(x;\\mathcal\{F\}\)=\\mathbbm\{1\}\\left\[\\tau\_\{y\}<\\infty\\ \\wedge\\ z\_\{N\}=0\\right\]\.\(5\)Thus,hhcaptures whether the model reaches a correct prefix but loses correctness by termination\. For a dataset𝒟\\mathcal\{D\}, we report the harmful overthinking rate as the average of this indicator:

H​\(𝒟;ℱ\)=1\|𝒟\|​∑\(x,y\)∈𝒟h​\(x;ℱ\)\.H\(\\mathcal\{D\};\\mathcal\{F\}\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\(x,y\)\\in\\mathcal\{D\}\}h\(x;\\mathcal\{F\}\)\.\(6\)Sec\.[3](https://arxiv.org/html/2606.02835#S3)further analyzes reasoning trajectory through the probability of remaining correct afterτy\\tau\_\{y\}\.

## 3Overthinking in Large Reasoning Models

We now define the main experimental protocol and investigate how reasoning unfolds in practice, relative to the minimum reasoning budget\. Our analysis is guided by three questions: \(i\) how much reasoning is actually required to solve benchmark questions, \(ii\) what happens when models reason beyond this point, and \(iii\) whether reducing reasoning length mitigates potential failures\.

We begin by examining how correct solutions first emerge along the reasoning trajectory, with a focus on the challenging multimodal setting\. Building on this perspective, we then study harmful overthinking, focusing on how additional reasoning can affect correctness\. We further analyze how this phenomenon depends on the answer format, contrasting multiple\-choice and free\-form generation\. To better understand these effects, we adopt a prefix\-level trajectory view and study correctness transitions across reasoning steps, revealing the underlying dynamics of reasoning\. Finally, we evaluate whether reducing verbosity is sufficient to improve reliability, and assess the generality of these behaviors by extending the analysis to language\-only models\.

### 3\.1Experimental Setting

#### Models and Benchmarks\.

Building on prior work on overthinkingXiao and Gan \([2025](https://arxiv.org/html/2606.02835#bib.bib24)\); Linet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib40)\), we analyze recent LRMs for multimodal reasoning: MM\-EurekaMenget al\.\([2025a](https://arxiv.org/html/2606.02835#bib.bib62)\), R1\-VLZhanget al\.\([2025c](https://arxiv.org/html/2606.02835#bib.bib63)\), ThinkLite\-VLWanget al\.\([2025d](https://arxiv.org/html/2606.02835#bib.bib64)\), and VL\-RethinkerWanget al\.\([2025a](https://arxiv.org/html/2606.02835#bib.bib41)\)\. We evaluate these models on a diverse set of multimodal benchmarks spanning diagram understanding, visual grounding, mathematical reasoning, and multiple\-choice vision\-language QA: AI2D\(Kembhaviet al\.,[2016](https://arxiv.org/html/2606.02835#bib.bib59)\), MathVistaLuet al\.\([2023](https://arxiv.org/html/2606.02835#bib.bib48)\), MathVisionWanget al\.\([2024a](https://arxiv.org/html/2606.02835#bib.bib49)\), MathVerseZhanget al\.\([2024](https://arxiv.org/html/2606.02835#bib.bib42)\), MMStarChenet al\.\([2024](https://arxiv.org/html/2606.02835#bib.bib60)\), and VMCBenchZhanget al\.\([2025e](https://arxiv.org/html/2606.02835#bib.bib61)\)\. For language\-only reasoning instead, we consider Qwen3Yanget al\.\([2025a](https://arxiv.org/html/2606.02835#bib.bib65)\)and InternS1Baiet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib36)\)on AIME2025Zhang and Math\-AI \([2025](https://arxiv.org/html/2606.02835#bib.bib56)\)and GPQAReinet al\.\([2023](https://arxiv.org/html/2606.02835#bib.bib55)\)\.

#### Reasoning Strategies\.

We evaluate four strategies spanning lower and upper bounds on reasoning performance\. Instruct Model is the base*Instruction\-Tuned*model before reasoning\-oriented post\-trainingZhanget al\.\([2026](https://arxiv.org/html/2606.02835#bib.bib18)\)\.No\-CoTforces the reasoning model to answer immediately, without intermediate reasoning\.Actual Lengthis the model’s default unconstrained CoT behavior\.Optimal Lengthis an oracle strategy that stops at the first correct prefixt≤τyt\_\{\\leq\\tau\_\{y\}\}\. Since identifying this prefix requires ground\-truth access, it is not deployable; rather, it quantifies the gain achievable by eliminating harmful overthinking\.

#### Prefix\-Level Trajectory Evaluation\.

Inspired by previous workFuet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib28)\); Muennighoffet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib37)\), we probe intermediate reasoning states by evaluating every utterance\-level prefixt≤it\_\{\\leq i\}of a generated trace, including the empty prefixt≤0t\_\{\\leq 0\}\. As reasoning models usually emit answers only at termination, we append a fixed termination template to each prefix333“Oh, I suddenly got the answer to the whole problem\. <answer\> \\n\\n \#\#\# Final Answer: \[boxed\{\.”, thus forcing the model to provide an answer at intermediate steps\. This intervention lets us track correctness across the reasoning trajectory by testing whether each*partial*trace is sufficient to generate a correct answer\. See Appendix[A](https://arxiv.org/html/2606.02835#A1.SS0.SSS0.Px2)and[C\.1](https://arxiv.org/html/2606.02835#A3.SS1)for more details\.

![Refer to caption](https://arxiv.org/html/2606.02835v1/x2.png)Figure 2:Average number of utterances across five multimodal models under Actual Length and Optimal Length\. Even on benchmarks typically considered challenging \(e\.g\., MathvisionWanget al\.\([2024a](https://arxiv.org/html/2606.02835#bib.bib49)\)\) most solvable instances require little to no intermediate reasoning\.

### 3\.2Results

#### How much reasoning is actually required?

We first examine where the optimal stopping point occurs along the reasoning trajectory\. Fig\.[2](https://arxiv.org/html/2606.02835#S3.F2)compares, for solved instances, the model’s actual reasoning length with the optimal length required to first reach the correct answer\. Across benchmarks, optimal lengths are concentrated near the beginning of the trajectory, often at zero utterances, indicating the model can answer correctly without generating an explicit chain of thought\. This is also confirmed by the performance that*No\-CoT*achieves across benchmarks \(see Fig\.[1](https://arxiv.org/html/2606.02835#S1.F1)\)\. On the contrary, actual traces extend substantially further\. Notably, even on more challenging datasets such as MathVision and MathVerse, where traces are longer than on AI2D or VMCBench, the optimal length remains far below the model’s default reasoning length\.

Takeaway A\.Reasoning length is a poor proxy for difficulty: LRMs often solve the problem early, then keep generating long traces that are not required for correctness\.

#### Reasoning beyond optimal\.

Table 1:Main multimodal results\. We report accuracy \(acc↑\\mathrm\{acc\}\\uparrow\), average utterance length \(len↓\\mathrm\{len\}\\downarrow\), and harmful\-overthinking rate \(H↓H\\downarrow\)\.*No\-CoT*is a zero\-reasoning diagnostic; Bolding highlights the best nontrivial reasoning strategy\. The gap between*Actual*and*Optimal*shows that LRMs often reason past correctness and degrade final performance\.ModelStrategyVMCBenchMathVisionMathvistaMMStarMathVerseAI2Dacc↑\\uparrowlen↓\\downarrowHH↓\\downarrowacc↑\\uparrowlen↓\\downarrowHH↓\\downarrowacc↑\\uparrowlen↓\\downarrowHH↓\\downarrowacc↑\\uparrowlen↓\\downarrowHH↓\\downarrowacc↑\\uparrowlen↓\\downarrowHH↓\\downarrowacc↑\\uparrowlen↓\\downarrowHH↓\\downarrowDualMind\-VLMNo\-CoT79\.80\.00\.024\.00\.00\.069\.50\.00\.062\.30\.00\.040\.10\.00\.082\.90\.00\.0Actual80\.95\.64\.426\.318\.021\.174\.711\.26\.664\.46\.17\.149\.719\.511\.383\.33\.73\.9Optimal85\.31\.80\.047\.411\.60\.081\.33\.90\.071\.52\.20\.060\.911\.30\.087\.20\.60\.0MM\-EurekaNo\-CoT75\.80\.00\.025\.30\.00\.067\.90\.00\.060\.30\.00\.038\.60\.00\.082\.10\.00\.0Actual76\.419\.69\.632\.934\.013\.572\.820\.09\.664\.013\.17\.648\.211\.911\.382\.88\.65\.0Optimal86\.06\.20\.046\.420\.10\.082\.47\.50\.071\.64\.60\.059\.56\.80\.087\.81\.50\.0ThinkLite\-VLNo\-CoT75\.70\.00\.021\.40\.00\.065\.50\.00\.064\.10\.00\.040\.10\.00\.082\.70\.00\.0Actual75\.111\.89\.228\.328\.026\.370\.419\.311\.065\.610\.911\.149\.723\.013\.483\.29\.96\.0Optimal84\.33\.20\.054\.614\.90\.081\.46\.30\.076\.73\.60\.063\.012\.20\.089\.21\.50\.0VL\-RethinkerNo\-CoT77\.20\.00\.028\.00\.00\.070\.50\.00\.062\.70\.00\.040\.50\.00\.082\.40\.00\.0Actual79\.219\.87\.833\.936\.324\.773\.026\.211\.963\.017\.213\.751\.229\.912\.183\.615\.07\.1Optimal87\.04\.40\.058\.619\.50\.084\.96\.10\.076\.75\.10\.063\.315\.20\.090\.71\.90\.0R1\-VLNo\-CoT70\.10\.00\.026\.00\.00\.052\.90\.00\.054\.50\.00\.026\.20\.00\.079\.60\.00\.0Actual71\.119\.88\.826\.645\.023\.462\.238\.914\.258\.515\.412\.352\.846\.415\.780\.311\.87\.8Optimal79\.910\.30\.050\.025\.10\.076\.414\.10\.070\.85\.40\.068\.525\.00\.088\.12\.00\.0We next quantify the effect of reasoning beyond the first correct stept≤τyt\_\{\\leq\\tau\_\{y\}\}\. From Tab\.[1](https://arxiv.org/html/2606.02835#S3.T1),*Optimal Length*consistently outperforms*Actual Length*across all models and benchmarks \(e\.g\., \+23\.3% of R1\-VL on MathVision and \+7\.8% of VL\-Rethinker on AI2D\)\. The largest gaps occur on harder, lower\-accuracy benchmarks such as MathVision and MathVerse\. This gap is not merely an efficiency loss\. In many cases, the model has already reached the correct answer, but later reasoning causes it to deviate from the correct answer\. Together with Fig\.[1](https://arxiv.org/html/2606.02835#S1.F1), these results show that allocating the right amount of reasoning is often more important than simply enabling reasoning: the gap between*Optimal Length*and*Actual Length*is larger than the gain from reasoning\-oriented post\-training itself\. See Appendix[B\.5](https://arxiv.org/html/2606.02835#A2.SS5)for analysis of verbose overthinking\.

Takeaway B\.Current LRMs do not merely over\-generate reasoning; instead, they frequently reason past correct intermediate states, making optimal stopping substantially more valuable than additional reasoning\.

#### Multiple\-choicevs\.Free\-form\.

We next ask whether harmful overthinking depends on the answer format\. Fig\.[3](https://arxiv.org/html/2606.02835#S3.F3)compares multiple\-choice \(MC\) and free\-form \(FF\) questions, aggregated across benchmarks\. Harmful overthinking is substantially higher in free\-form \(\.11 for MCvs\.\.24 for FF\), suggesting that earlier correctness and later deviations are not byproducts of a restricted answer space, but rather the opposite\. If correctness deviations were primarily random answer fluctuations, one would expect multiple\-choice tasks to exhibit higher, or at least comparable, earlier correctness and later answer instability\. Surprisingly, we observe the opposite pattern\. This suggest that \(i\) earlier correct answers are not byproduct of randomness and \(ii\) correctness is less stable when the setting involves verification \(MC\) rather than exploration \(FF\), making correctness in FF setting more vulnerable to unsupported revisions, reinterpretations, and reasoning drift\.

Takeaway C\.Free\-form generation exposes harmful overthinking more sharply: without a fixed answer set, the unconstrained reasoning is more likely to deviate from correctness\.

#### Reasoning Dynamics\.

The previous results measure whether harmful overthinking occurs\. We now examine how it occurs by tracking correctness along the reasoning trajectory\. At each prefixt≤it\_\{\\leq i\}, the model is either correct or incorrect, inducing a binary statezi=𝟏​\[𝒜​\(t≤i\)=y\]z\_\{i\}=\\mathbf\{1\}\[\\mathcal\{A\}\(t\_\{\\leq i\}\)=y\]\. If reasoning were monotonic, then reaching a correct state would be absorbing: oncezi=1z\_\{i\}=1, subsequent prefixes would remain correct\. Instead, Fig\.[4](https://arxiv.org/html/2606.02835#S3.F4)shows that correctness is unstable under continued generation\. After the first correct prefixτy\\tau\_\{y\}, the probability of remaining correct drops rapidly as additional reasoning steps are generated, plateauing around 0\.2 after roughly 100 intermediate steps\. This confirms the trajectory\-instability view: reasoning does not simply accumulate evidence toward correctness, but can move the model both toward and away from the correct answer\.

Takeaway D\.Reasoning trajectories are non\-monotonic: after reaching correctness, the probability of staying correct drops rapidly as LRMs keep reasoning\.

Table 2:Early stopping and efficient reasoning reduce verbosity, but do not consistently reduce harmful overthinking\.SettingAcc↑\\uparrowLen↓\\downarrowH\{H\}↓\\downarrowStopping@∞\\infty66\.517\.210\.1Stopping@5564\.38\.918\.4Stopping@2263\.25\.114\.1VL\-Rethinker66\.623\.411\.1DualMind\-VLM66\.311\.28\.6![Refer to caption](https://arxiv.org/html/2606.02835v1/x3.png)
Figure 3:Distribution of overthinking types across response formats\. Bars show the percentage of solved samples exhibiting verbose versus harmful overthinking for multiple\-choice \(MC\) and free\-form \(FF\) settings\.![Refer to caption](https://arxiv.org/html/2606.02835v1/x4.png)
Figure 4:Correctness stability\. After first reaching a correct answer atτy\\tau\_\{y\}, the probability of remaining correct decreases sharply with additional reasoning, revealing diminishing reasoning value\.

#### Can reducing verbosity mitigate harmful overthinking?

A natural hypothesis is that harmful overthinking is simply a consequence of verbosity: if a model reasons less, it should have fewer opportunities to leave a correct trajectory\. We test this hypothesis with two forms of adaptive inference\. First, we consider a training\-free early\-stopping baseline, applied to each model, inspired by prior workFuet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib28)\): after each prefix, we extract the current answer and stop when the prediction remains unchanged forKKconsecutive steps \(i\.e\., Stopping@K\)\. The settingK=∞K=\\inftyrecovers the model’s default behavior,i\.e\.,*Actual Length*\. Second, we compare VL\-RethinkerWanget al\.\([2025a](https://arxiv.org/html/2606.02835#bib.bib41)\), trained to encourage thinking, against DualMind\-VLMLinet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib40)\), which is explicitly trained to select whether to use reasoning or not\.

Tab\.[2](https://arxiv.org/html/2606.02835#S3.T2)shows that both approaches substantially reduce reasoning length\. Early stopping with smaller patience values cuts the average length from17\.217\.2utterances atK=∞K=\\inftyto8\.98\.9forK=5K=5and5\.15\.1forK=2K=2\. Similarly, DualMind\-VLM produces much shorter traces than VL\-Rethinker \(11\.211\.2vs\.23\.423\.4utterances\) while maintaining comparable accuracy\. However, this reduction in verbosity does not translate into a corresponding reduction in harmful overthinking\. In fact, early stopping increases the harmful overthinking rate from10\.110\.1atK=∞K=\\inftyto18\.418\.4atK=5K=5and14\.114\.1atK=2K=2, while DualMind\-VLM still exhibits non\-negligible harmful transitions despite its shorter traces\. These results show that verbose and harmful overthinking are distinct failure modes\. Reducing the amount of generated reasoning can remove wasted computation, but it does not necessarily make the remaining trajectory more stable\. In some cases, aggressive stopping may even truncate useful recovery dynamics while leaving correctness deviations unresolved\.

Takeaway E\.Efficiency\-oriented methods address verbose reasoning, but not correctness instability\.Harmfuloverthinking must therefore be measured separately from*verbose*one\.

Table 3:Language\-only reasoning results\. We report accuracy \(acc↑\\mathrm\{acc\}\\uparrow\), average utterance length \(len↓\\mathrm\{len\}\\downarrow\), and harmful\-overthinking rate \(H↓H\\downarrow\)\.*No\-CoT*is a zero\-reasoning diagnostic; bolding highlights the best nontrivial reasoning strategy\. The pattern also holds for language\-only models\.ModelStrategyGPQAAIME2025acc↑\\uparrowlen↓\\downarrowH↓H\\downarrowacc↑\\uparrowlen↓\\downarrowH↓H\\downarrowQwen3No\-CoT37\.00\.00\.025\.00\.00\.0Actual55\.8125\.522\.158\.3372\.533\.3Optimal77\.928\.90\.091\.729\.90\.0InternS1No\-CoT37\.30\.00\.011\.10\.00\.0Actual64\.4177\.120\.338\.9514\.233\.3Optimal84\.730\.90\.072\.2144\.90\.0
#### Language\-Only Reasoning\.

Finally, we verify that the pattern is not specific to multimodal reasoning\. Tab\.[3](https://arxiv.org/html/2606.02835#S3.T3)shows the same qualitative behavior for language\-only LRMs on GPQA and AIME2025: default reasoning improves over*No\-CoT*, but*Optimal Length*yields much larger gains\. For Qwen3, optimal stopping improves accuracy from55\.855\.8to77\.977\.9on GPQA and from58\.358\.3to91\.791\.7on AIME2025\. Similarly, InternS1 improves from64\.464\.4to84\.784\.7on GPQA and from38\.938\.9to72\.272\.2on AIME2025\. These gains coincide with large reductions in reasoning length\. For example, Qwen3 on AIME2025 drops from372\.5372\.5to29\.929\.9utterances under*Optimal Length*\. Thus, our experiments show that harmful overthinking is not an artifact of visual grounding but reflects a broader instability of the reasoning process\. See Appendix[B\.6](https://arxiv.org/html/2606.02835#A2.SS6)for more results on the language\-only setup\.

Takeaway F\.Harmful overthinking is not merely a byproduct of visual drift or instability in multimodal reasoning: similar patterns also appear in language\-only models, even on math\-heavy complex benchmarks\.

## 4Why Does Reasoning Become Harmful?

The previous section shows that harmful overthinking is a systematic failure mode\.But what causes a model to transition from a correct answer to an incorrect one?In the following, we consider the multimodal setting and categorize the type of errors arising when models reason beyond the optimal\.

#### Taxonomy\.

For each harmful overthinking trajectory, we now identify the last correct prefixi⋆=max⁡\{i<N:𝒜​\(t≤i\)=y\}i^\{\\star\}=\\max\\\{i<N:\\mathcal\{A\}\(t\_\{\\leq i\}\)=y\\\}and compare the reasoning state att≤i⋆t\_\{\\leq i^\{\\star\}\}with the final tracet≤Nt\_\{\\leq N\}\. This isolates the segment of reasoning that turns a correct trajectory into an incorrect one\. We identify three main failure modes:

①*Visual Error\.*The model introduces an error by misreading, inventing, or over\-interpreting visual evidence\. This includes incorrect object recognition, counts, spatial relations, labels, diagram structure, or geometric interpretation\.

Table 4:Failure\-mode distribution by model and benchmark\. Each triplet reports the percentage of valid harmful overthinking traces assigned to visual, calculation, or logical errors \(highest in bold\)\.ModelVMCBenchMathVisionMathVistaMMStarMathVerseAI2DVCLVCLVCLVCLVCLVCLDualMind\-VLM50\.016\.733\.345\.614\.040\.438\.716\.145\.246\.59\.943\.627\.920\.651\.553\.31\.745\.0MM\-Eureka45\.07\.547\.550\.010\.539\.545\.67\.646\.846\.89\.044\.128\.414\.357\.349\.00\.650\.3ThinkLite\-VL47\.45\.347\.473\.84\.821\.459\.16\.834\.163\.82\.933\.338\.412\.449\.264\.50\.035\.5VL\-Rethinker46\.78\.045\.341\.59\.249\.239\.814\.246\.050\.53\.546\.025\.316\.158\.645\.30\.953\.7R1\-VL46\.912\.540\.637\.29\.353\.533\.319\.047\.652\.23\.344\.418\.011\.170\.951\.70\.747\.7②*Calculation error\.*The model perceives and approaches the problem correctly, but introduces an arithmetic, algebraic, unit\-conversion, formula\-selection, or numerical\-computation error\.

③*Logical error\.*The model changes its answer due to a non\-visual and non\-numerical reasoning failure\. This includes unsupported conclusions, contradictions, irrelevant detours, answer\-option mismatches, or answer revisions that are not justified by new visual or computational evidence\.

#### Evaluation Protocol\.

We perform this analysis on harmful overthinking cases from the same multimodal reasoning models considered in Sec\.[3](https://arxiv.org/html/2606.02835#S3), as well as the same benchmarks\. For each harmful trajectory, we construct a pair consisting of the last correct prefix and the final incorrect trace\. We then use an external judge model \(Qwen3\.6\-35B\) to label the dominant failure mode for each harmful trajectory and provide an*evidence*of the error from the original trace\. Additional implementation details, including prompt templates and parsing rules, are provided in Appendix[C\.2](https://arxiv.org/html/2606.02835#A3.SS2)\.

### 4\.1Results

#### Quantitative\.

Table[4](https://arxiv.org/html/2606.02835#S4.T4)reports the failure\-mode decomposition for harmful overthinking across models and benchmarks\. Calculation errors are rarely dominant: they are never the largest failure mode for any model–benchmark pair, and often remain below10%10\\%\. Instead, harmful overthinking is primarily driven by logical drift and visual reinterpretation\. Logical errors are especially prominent on MathVerse and MathVista: on MathVerse, they exceed50%50\\%for four out of five models and reach70\.9%70\.9\\%for R1\-VL; on MathVista, they are the leading failure mode for four out of five models\. Visual errors dominate more strongly on visually grounded benchmarks such as MathVision, MMStar, and AI2D\. For example, ThinkLite\-VL reaches73\.8%73\.8\\%visual errors on MathVision and64\.5%64\.5\\%on AI2D, while visual errors are also the leading category for most models on MMStar\. Thus, the table suggests two recurring mechanisms behind correctness deviation: logical drift on more abstract reasoning benchmarks, and visual reinterpretation on perception\-heavy ones\.

Takeaway G\.Correctness deviations are mainly driven by logical drift and visual reinterpretation rather than arithmetic mistakes\.

#### Qualitative\.

![Refer to caption](https://arxiv.org/html/2606.02835v1/x5.png)Figure 5:Representative correctness deviations\. Each example shows a trajectory that first reaches the correct answer atτy\\tau\_\{y\}, but later changes to an incorrect final answertNt\_\{N\}through perception, calculation, or logical error\. Below an evidence, representing the mistaken step of the reasoning model\.We show representative examples of each failure mode in Fig\.[5](https://arxiv.org/html/2606.02835#S4.F5)\. In the visual\-error case, the model first reaches the correct count,y^i∗=6\\hat\{y\}\_\{i^\{\*\}\}=6, but later changes its answer toy^t≤N=5\\hat\{y\}\_\{t\_\{\\leq N\}\}=5after introducing the false observation that “on the right side, there are 2 bricks missing\.” The subsequent arithmetic is consistent, but the visual premise is wrong\. In the calculation\-error case, the model first gives the correct answer,y^i∗=65∘\\hat\{y\}\_\{i^\{\*\}\}=65^\{\\circ\}, but later outputsy^t≤N=61∘\\hat\{y\}\_\{t\\leq N\}=61^\{\\circ\}\. The added reasoning contains a direct numerical error:2​x=180∘−157∘=23∘,so​x=61∘\.2x=180^\{\\circ\}\-157^\{\\circ\}=23^\{\\circ\},\\quad\\text\{so \}x=61^\{\\circ\}\.Here the failure is not perceptual, but arithmetic introduced during the continuation\. In the logical\-error case, the model first correctly answers that krill would decrease if gulls disappeared, but later changes the answer to herring\. This contradicts its own explanation, which states that herring would increase\. The final answer is therefore unsupported by the model’s causal reasoning\. The resulting picture is that harmful overthinking is not a single error type, but different failure modes contribute to corrupting an already\-correct trajectory\.

## 5Related Work

#### Test\-Time Scaling and Reasoning\.

Recent reasoning models derive much of their performance from*test\-time scaling*: allocating more inference\-time compute via longer chains of thought or larger reasoning budgets often improves accuracyMuennighoffet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib37)\); Baiet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib36)\); Yanget al\.\([2025b](https://arxiv.org/html/2606.02835#bib.bib15)\); Guoet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib35)\)\. Similar trends hold in multimodal settings, where structured deliberative traces further boost performanceMenget al\.\([2025b](https://arxiv.org/html/2606.02835#bib.bib39)\); Zhanget al\.\([2025d](https://arxiv.org/html/2606.02835#bib.bib16)\); Wanget al\.\([2025a](https://arxiv.org/html/2606.02835#bib.bib41),[c](https://arxiv.org/html/2606.02835#bib.bib33)\)\. This line of work largely focuses on average gains from increased compute\. In contrast, we study when additional reasoning is unnecessary or harmful, and when longer traces degrade rather than improve predictions\.

#### Adaptive Thinking and Early Exit\.

Recent work shows that reasoning models often continue generating after reaching a correct solution, and may even revise correct intermediate states into incorrect answersChenet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib32)\)\. Early\-exit methods stop generation using intermediate predictions, confidence, or learned signalsZhanget al\.\([2025a](https://arxiv.org/html/2606.02835#bib.bib31)\); Yanget al\.\([2026](https://arxiv.org/html/2606.02835#bib.bib30)\); Daiet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib29)\); Fuet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib28)\), while adaptive\-thinking methods allocate variable reasoning budgets across examples using proxies such as response length or confidenceShenet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib14)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.02835#bib.bib27)\); Liuet al\.\([2025a](https://arxiv.org/html/2606.02835#bib.bib26)\); Taubenfeldet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib25)\); Linet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib40)\); Xiao and Gan \([2025](https://arxiv.org/html/2606.02835#bib.bib24)\); Wanget al\.\([2025b](https://arxiv.org/html/2606.02835#bib.bib19)\)\. Both primarily target unnecessary computation\. Our perspective is complementary: we separate*verbose*overthinking, which is wasteful but harmless, from*harmful*overthinking, which degrades correctness, showing that efficiency alone does not address reasoning failures\. Closest to our work,Wuet al\.\([2026](https://arxiv.org/html/2606.02835#bib.bib58)\)shows that longer CoTs do not consistently improve performance; we extend this analysis to state\-of\-the\-art reasoning models across language and multimodal benchmarks\.

#### Reasoning Compression and No\-Thinking Settings\.

A related line of work questions how much explicit reasoning is required\. Prior studies show that reasoning traces can often be compressed, and in some cases removed entirely without loss in performanceLiet al\.\([2026](https://arxiv.org/html/2606.02835#bib.bib10)\); Maet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib23)\); Liet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib13)\); Wanget al\.\([2026a](https://arxiv.org/html/2606.02835#bib.bib22)\)\. Our findings align with this view: the key issue is not whether models can reason longer, but whether additional reasoning is useful, redundant, or harmful\.

## 6Conclusion

Test\-time scaling rests on a simple premise: think longer, and performance should improve\. Our results show that this premise is incomplete, offering insights on the overlooked problem of harmful overthinking\. Across multimodal and language\-only benchmarks, LRMs often reach the correct answer before termination, continue generating, and then leave the correct trajectory\. We show that optimal stopping yields large gains, many solvable instances require little or no explicit reasoning, and shorter traces fail to reduce harmful transitions\. Failure analysis shows that these errors rarely stem from arithmetic limitations; they more often arise from logical drift or visual reinterpretation\. We believe these experimental results can stimulate future work on LRMs, focusing not only on making models reason more, but also on helping them*understand when reasoning is sufficient*\.

## Acknowledgments and Disclosure of Funding

The authors acknowledge the CINECA award under the ISCRA initiative for the availability of high performance computing resources and support\. This work was supported by the EU Horizon ELIAS \(No\. 101120237\), ELLIOT \(No\. 101214398\), and TURING \(No\. 101215032\) projects\.

## References

- \[1\]L\. Bai, Z\. Cai, Y\. Cao, M\. Cao, W\. Cao, C\. Chen, H\. Chen, K\. Chen, P\. Chen, Y\. Chen,et al\.\(2025\)Intern\-s1: a scientific multimodal foundation model\.arXiv:2508\.15763\.Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px1.p1.1)\.
- \[2\]L\. Chen, J\. Li, X\. Dong, P\. Zhang, Y\. Zang, Z\. Chen, H\. Duan, J\. Wang, Y\. Qiao, D\. Lin,et al\.\(2024\)Are we on the right way for evaluating large vision\-language models?\.NeurIPS\.Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[3\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv:2107\.03374\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1)\.
- \[4\]X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang,et al\.\(2025\)Do not think that much for 2\+ 3=? on the overthinking of o1\-like llms\.ICML\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[5\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv:2110\.14168\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1)\.
- \[6\]A\. Cuadron, D\. Li, W\. Ma, X\. Wang, Y\. Wang, S\. Zhuang, S\. Liu, L\. G\. Schroeder, T\. Xia, H\. Mao,et al\.\(2025\)The danger of overthinking: examining the reasoning\-action dilemma in agentic tasks\.arXiv:2502\.08235\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p2.1)\.
- \[7\]M\. Dai, C\. Yang, and Q\. Si\(2025\)S\-grpo: early exit via reinforcement learning in reasoning models\.NeurIPS\.Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[8\]Y\. Fu, J\. Chen, S\. Zhu, Z\. Fu, Z\. Dai, Y\. Zhuang, Y\. Ma, A\. Qiao, T\. Rosing, I\. Stoica,et al\.\(2025\)Efficiently scaling llm reasoning with certaindex\.NeurIPS\.Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px3.p1.2),[§3\.2](https://arxiv.org/html/2606.02835#S3.SS2.SSS0.Px5.p1.2),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[9\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.Nature\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px1.p1.1)\.
- \[10\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the math dataset\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1)\.
- \[11\]A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv:2412\.16720\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1)\.
- \[12\]A\. Kembhavi, M\. Salvato, E\. Kolve, M\. Seo, H\. Hajishirzi, and A\. Farhadi\(2016\)A diagram is worth a dozen images\.InECCV,Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[13\]J\. Li, R\. Li, Y\. Zhou, B\. Ma, and J\. Z\. Pan\(2026\)Chain of thought compression: a theoritical analysis\.arXiv preprint arXiv:2601\.21576\.Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px3.p1.1)\.
- \[14\]M\. Li, J\. Zhong, S\. Zhao, Y\. Lai, H\. Zhang, W\. B\. Zhu, and K\. Zhang\(2025\)To think or not to think: a study of thinking in rule\-based visual reinforcement fine\-tuning\.InNeurIPS,Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px3.p1.1)\.
- \[15\]C\. Lin, C\. Chi, J\. Wu, S\. Li, and K\. Zhou\(2025\)Learning to think fast and slow for visual language models\.arXiv:2511\.16670\.Cited by:[Figure 11](https://arxiv.org/html/2606.02835#A2.F11),[Figure 11](https://arxiv.org/html/2606.02835#A2.F11.3.2),[§1](https://arxiv.org/html/2606.02835#S1.p2.1),[§1](https://arxiv.org/html/2606.02835#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.02835#S2.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.02835#S3.SS2.SSS0.Px5.p1.2),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[16\]J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang\(2023\)Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1)\.
- \[17\]W\. Liu, J\. Xu, F\. Yu, Y\. Lin, K\. Ji, W\. Chen, Y\. Xu, Y\. Wang, L\. Shang, and B\. Wang\(2025\)Qfft, question\-free fine\-tuning for adaptive reasoning\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p2.1),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[18\]Y\. Liu, J\. Wu, Y\. He, R\. Gong, J\. Xia, L\. Li, H\. Gao, H\. Chen, B\. Bi, J\. Zhang,et al\.\(2025\)Efficient inference for large reasoning models: a survey\.arXiv:2503\.23077\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1)\.
- \[19\]P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao\(2023\)Mathvista: evaluating mathematical reasoning of foundation models in visual contexts\.ICLR\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[20\]W\. Ma, J\. He, C\. Snell, T\. Griggs, S\. Min, and M\. Zaharia\(2025\)Reasoning models can be effective without thinking\.arXiv:2504\.09858\.Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px3.p1.1)\.
- \[21\]B\. W\. Matthews\(1975\)Comparison of the predicted and observed secondary structure of t4 phage lysozyme\.Biochimica et Biophysica Acta \(BBA\)\-Protein Structure\.Cited by:[Appendix A](https://arxiv.org/html/2606.02835#A1.SS0.SSS0.Px5.p1.2)\.
- \[22\]F\. Meng, L\. Du, Z\. Liu, Z\. Zhou, Q\. Lu, D\. Fu, T\. Han, B\. Shi, W\. Wang, J\. He,et al\.\(2025\)Mm\-eureka: exploring the frontiers of multimodal reasoning with rule\-based reinforcement learning\.arXiv:2503\.07365\.Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[23\]F\. Meng, L\. Du, Z\. Liu, Z\. Zhou, Q\. Lu, D\. Fu, T\. Han, B\. Shi, W\. Wang, J\. He, K\. Zhang, P\. Luo, Y\. Qiao, Q\. Zhang, and W\. Shao\(2025\)MM\-eureka: exploring the frontiers of multimodal reasoning with rule\-based reinforcement learning\.arXiv:2503\.07365\.Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px1.p1.1)\.
- \[24\]N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. B\. Hashimoto\(2025\)S1: simple test\-time scaling\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.02835#S2.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px3.p1.2),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px1.p1.1)\.
- \[25\]D\. Rein, B\. Hou, A\. Stock, W\. Liu, A\. Mandlekar, A\. Ghodsi, D\. Bahri, F\. Zhou, A\. Mehra, E\. Yiu,et al\.\(2023\)GPQA: a graduate\-level google\-proof q&a benchmark\.COLM\.Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[26\]Y\. Shen, J\. Zhang, J\. Huang, S\. Shi, W\. Zhang, J\. Yan, N\. Wang, K\. Wang, Z\. Liu, and S\. Lian\(2025\)Dast: difficulty\-adaptive slow\-thinking for large reasoning models\.InEMNLP,pp\. 2322–2331\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p2.1),[§1](https://arxiv.org/html/2606.02835#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.02835#S2.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[27\]C\. Spearman\(1961\)The proof and measurement of association between two things\.\.Cited by:[Appendix A](https://arxiv.org/html/2606.02835#A1.SS0.SSS0.Px5.p1.2)\.
- \[28\]Y\. Sui, Y\. Chuang, G\. Wang, J\. Zhang, T\. Zhang, J\. Yuan, H\. Liu, A\. Wen, S\. Zhong, N\. Zou,et al\.\(2025\)Stop overthinking: a survey on efficient reasoning for large language models\.arXiv:2503\.16419\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1),[§1](https://arxiv.org/html/2606.02835#S1.p3.1)\.
- \[29\]A\. Taubenfeld, T\. Sheffer, E\. Ofek, A\. Feder, A\. Goldstein, Z\. Gekhman, and G\. Yona\(2025\)Confidence improves self\-consistency in llms\.InFindings\-ACL 2025,Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[30\]H\. Wang, C\. Qu, Z\. Huang, W\. Chu, F\. Lin, and W\. Chen\(2025\)VL\-rethinker: incentivizing self\-reflection of vision\-language models with reinforcement learning\.NeurIPS\.Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.02835#S3.SS2.SSS0.Px5.p1.2),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px1.p1.1)\.
- \[31\]K\. Wang, J\. Pan, W\. Shi, Z\. Lu, H\. Ren, A\. Zhou, M\. Zhan, and H\. Li\(2024\)Measuring multimodal mathematical reasoning with math\-vision dataset\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p1.1),[Figure 2](https://arxiv.org/html/2606.02835#S3.F2),[Figure 2](https://arxiv.org/html/2606.02835#S3.F2.4.2),[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[32\]P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv:2409\.12191\.Cited by:[Figure 11](https://arxiv.org/html/2606.02835#A2.F11),[Figure 11](https://arxiv.org/html/2606.02835#A2.F11.3.2)\.
- \[33\]X\. Wang, S\. Feng, Y\. Li, P\. Yuan, Y\. Zhang, C\. Tan, B\. Pan, Y\. Hu, and K\. Li\(2025\)Make every penny count: difficulty\-adaptive self\-consistency for cost\-efficient reasoning\.InFindings\-NAACL,Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p2.1),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[34\]X\. Wang, N\. Joshi, B\. Plank, R\. Angell, and H\. He\(2026\)Is it thinking or cheating? detecting implicit reward hacking by measuring reasoning effort\.InICLR,Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px3.p1.1)\.
- \[35\]X\. Wang, N\. Joshi, B\. Plank, R\. Angell, and H\. He\(2026\)Is it thinking or cheating? detecting implicit reward hacking by measuring reasoning effort\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p2.1)\.
- \[36\]X\. Wang, Z\. Yang, C\. Feng, H\. Lu, L\. Li, C\. Lin, K\. Lin, F\. Huang, and L\. Wang\(2025\)Sota with less: mcts\-guided sample selection for data\-efficient visual reasoning self\-improvement\.NeurIPS\.Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px1.p1.1)\.
- \[37\]X\. Wang, Z\. Yang, C\. Feng, H\. Lu, L\. Li, C\. Lin, K\. Lin, F\. Huang, and L\. Wang\(2025\)SoTA with less: mcts\-guided sample selection for data\-efficient visual reasoning self\-improvement\.InNeurIPS,Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[38\]Y\. Wu, Y\. Wang, Z\. Ye, T\. Du, S\. Jegelka, and Y\. Wang\(2026\)When more is less: understanding chain\-of\-thought length in llms\.ICLR\.Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[39\]W\. Xiao and L\. Gan\(2025\)Fast\-slow thinking GRPO for large vision\-language model reasoning\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[40\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv:2505\.09388\.Cited by:[§2](https://arxiv.org/html/2606.02835#S2.SS0.SSS0.Px1.p1.9),[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[41\]C\. Yang, Q\. Si, Y\. Duan, Z\. Zhu, C\. Zhu, Q\. Li, M\. Chen, Z\. Lin, and W\. Wang\(2026\)Dynamic early exit in reasoning models\.ICLR\.Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[42\]S\. Yang, Y\. Tong, X\. Niu, G\. Neubig, and X\. Yue\(2025\)Demystifying long chain\-of\-thought reasoning\.InICML,Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px1.p1.1)\.
- \[43\]A\. Zhang, Y\. Chen, J\. Pan, C\. Zhao, A\. Panda, J\. Li, and H\. He\(2025\)Reasoning models know when they’re right: probing hidden states for self\-verification\.COLM\.Cited by:[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[44\]J\. Zhang, N\. Lin, L\. Hou, L\. Feng, and J\. Li\(2025\)Adaptthink: reasoning models can learn when to think\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2606.02835#S1.p2.1),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px2.p1.1)\.
- \[45\]J\. Zhang, J\. Huang, H\. Yao, S\. Liu, X\. Zhang, S\. Lu, and D\. Tao\(2025\)R1\-vl: learning to reason with multimodal large language models via step\-wise group relative policy optimization\.InICCV,Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[46\]J\. Zhang, J\. Huang, H\. Yao, S\. Liu, X\. Zhang, S\. Lu, and D\. Tao\(2025\)R1\-vl: learning to reason with multimodal large language models via step\-wise group relative policy optimization\.ICCV\.Cited by:[Figure 11](https://arxiv.org/html/2606.02835#A2.F11),[Figure 11](https://arxiv.org/html/2606.02835#A2.F11.3.2),[§5](https://arxiv.org/html/2606.02835#S5.SS0.SSS0.Px1.p1.1)\.
- \[47\]R\. Zhang, D\. Jiang, Y\. Zhang, H\. Lin, Z\. Guo, P\. Qiu, A\. Zhou, P\. Lu, K\. Chang, Y\. Qiao,et al\.\(2024\)Mathverse: does your multi\-modal llm truly see the diagrams in visual math problems?\.InECCV,Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[48\]S\. Zhang, L\. Dong, X\. Li, S\. Zhang, X\. Sun, S\. Wang, J\. Li, R\. Hu, T\. Zhang, G\. Wang,et al\.\(2026\)Instruction tuning for large language models: a survey\.ACM\.Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px2.p1.1)\.
- \[49\]Y\. Zhang and T\. Math\-AI\(2025\)American invitational mathematics examination \(aime\) 2025\.Note:[https://huggingface\.co/datasets/math\-ai/aime25](https://huggingface.co/datasets/math-ai/aime25)Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.
- \[50\]Y\. Zhang, Y\. Su, Y\. Liu, X\. Wang, J\. Burgess, E\. Sui, C\. Wang, J\. Aklilu, A\. Lozano, A\. Wei,et al\.\(2025\)Automated generation of challenging multiple\-choice questions for vision language model evaluation\.InCVPR,Cited by:[§3\.1](https://arxiv.org/html/2606.02835#S3.SS1.SSS0.Px1.p1.1)\.

## Supplementary Material Overview

This appendix is organized in four macro blocks complementing the discussion in the main paper\. First, Appendix[A](https://arxiv.org/html/2606.02835#A1)provides robustness analyses for the proposed difficulty\-based reasoning budget, testing the sensitivity of the first correct index estimation to sampling seeds, termination prompts, and answer extraction models\. Second, in Appendix[B](https://arxiv.org/html/2606.02835#A2)we provide additional quantitative results, including token\-level budget statistics, verbose overthinking analysis, and language\-only evaluations\. Third, Appendix[C](https://arxiv.org/html/2606.02835#A3)reports implementation and reproducibility details, such as prompt templates, parsing rules, compute accounting, and the failure\-analysis categorization\. Finally, Appendix[D](https://arxiv.org/html/2606.02835#A4)discusses the limitations and possible future work\.

## Appendix ARobustness Study for Difficulty\-Based Reasoning Budgets

The prefix\-level trajectory evaluation protocol estimates an example’s difficulty by identifying the earliest point in a model’s reasoning trace from which the correct answer can be recovered\. In this section, we test whether that estimate is robust to procedural choices\. In particular, we measure sensitivity to three factors: the sampling seed used to generate the trace, the termination prompt used for prefix\-level probing, and the answer\-extraction model used to parse the answer\.

#### Model and benchmark\.

We run the robustness study with VL\-Rethinker on MathVision\. For each condition, the model first generates a full reasoning trace for every benchmark example\. We use three random seeds to measure sensitivity to stochastic generation\. Raw generations are saved before answer extraction so that answer parsing can be repeated independently with different extraction models\.

#### Termination prompt variants\.

The difficulty pipeline probes partial reasoning traces by appending a termination prompt that asks the model to stop deliberating and provide a final answer\. We compare two variants, shown in Fig\.[14](https://arxiv.org/html/2606.02835#A3.F14): the default prompt used in our pipeline and a reworded version with the same intent and similar length\. This tests whether the estimated difficulty is sensitive to a specific stop\-and\-answer phrase rather than reflecting the content of the reasoning trace\.

#### Answer extraction variants\.

Because benchmark accuracy is computed from parsed final answers, we also vary the answer\-extraction model𝒜\\mathcal\{A\}\. We compareQwen/Qwen3\-4B\-Instruct\-2507andQwen/Qwen3\.5\-4Band report the answer extraction prompt in Fig\.[15](https://arxiv.org/html/2606.02835#A3.F15)\. These models are used only after generation: first to parse the full CoT outputs, and later to parse the intermediate answers\. This separation avoids loading the extractor during expensive VL\-Rethinker generation runs and isolates parser\-induced variance from reasoning\-model variance\.

#### Experimental design\.

The study uses three seeds, two termination prompts, and two answer extractors\. For each condition, we generate raw traces, apply the corresponding answer extractor, run prefix\-level difficulty probing, and evaluate correctness at each probed prefix\.

#### Correlation analysis\.

To assess robustness, we compute pairwise agreement between conditions over the vector of first\-correct budgets\{bτy\}\\\{b\_\{\\tau\_\{y\}\}\\\}using Spearman CorrelationSpearman \([1961](https://arxiv.org/html/2606.02835#bib.bib66)\)\. High agreement across seeds indicates that the difficulty estimate is not dominated by sampling noise\. High agreement across termination prompts indicates that the probing method is not overly sensitive to the exact stop\-and\-answer phrasing\. High agreement across answer extractors indicates that the signal is not primarily an artifact of the parser\. We also report the correlation of the different runs on the answer extracted at the actual length\{zN\}\\\{z\_\{N\}\\\}based on the Matthews Correlation Coefficient \(MCC\)Matthews \([1975](https://arxiv.org/html/2606.02835#bib.bib67)\)\. High agreement scores indicate that the considered conditions tend to have the same final answer\.

#### Interpretation\.

The robustness results in Fig\.[6](https://arxiv.org/html/2606.02835#A1.F6)show that the difficulty\-based budget estimate is highly stable across the considered procedural variations\. The Spearman correlations for the estimated optimal budget remain consistently high across all comparison groups, indicating that examples are ranked similarly by difficulty even when changing the seed, termination prompt, or answer extractor\. Varying only the random seed yields high agreement, while changing the termination prompt introduces the largest drop\.

A similarly stable pattern is observed for final\-answer correctness at the actual reasoning length\. MCC values remain close to one across all conditions, meaning that the same examples tend to be classified as correct or incorrect at the end of the full trace\. The slightly lower agreement when both the answer extractor and the termination prompt change indicates that final correctness is more sensitive to parser choice and termination wording, but the effect remains small overall\.

Overall, these results support the reliability of the prefix\-level trajectory protocol\. The estimated first\-correct budgets are not artifacts of a particular sampling seed, stop\-and\-answer prompt, or extraction model\. Instead, the high correlations suggest that the measured reasoning sufficiency signal is largely tied to the underlying reasoning trajectory\.

![Refer to caption](https://arxiv.org/html/2606.02835v1/x6.png)Figure 6:Robustness of the difficulty analysis across controlled sources of variation\. The left panel reports Spearman correlation of the estimated optimal budgetbτyb\_\{\\tau\_\{y\}\}\. The right panel reports the Matthews Correlation Coefficient for correctness of the final predicted answerzNz\_\{N\}\. High optimal\-budget correlations indicate that examples are ranked similarly by difficulty across conditions, while high final\-correctness MCC indicates that the same examples tend to be correct or incorrect at the actual reasoning length\. The comparison groups show impact of joint variation of procedural factors\.

## Appendix BAdditional Analysis

We provide additional analyses that complement the main results and further characterize harmful overthinking and the minimum reasoning budget for a model to answer a question\.

### B\.1Correlation Between Optimal Length and No\-CoT Among Models

![Refer to caption](https://arxiv.org/html/2606.02835v1/x7.png)Figure 7:Cross\-model Spearman Correlation in estimated*Optimal Length*\. Correlation on the exact optimal length is moderately high, indicating that different LRMs share a notion problem difficulty\.Fig\.[7](https://arxiv.org/html/2606.02835#A2.F7)studies whether estimated reasoning requirements are consistent across different LRMs\. Spearman’s correlation on*Optimal Length*is moderately high\. This suggests that, despite model\-specific differences in how long models reason, they often agree on the number of reasoning steps required to solve a problem\. This supports our central claim that reasoning length is a poor proxy for benchmark difficulty: many examples that elicit long traces are nevertheless perceived by several models as solvable with little or no explicit reasoning\.

### B\.2Optimal Length vs\. Test\-Time Scaling

![Refer to caption](https://arxiv.org/html/2606.02835v1/x8.png)Figure 8:*Optimal Length*scaling compared with standard test\-time \(*Actual Length*\) scaling and*Pass@K*\. Increasing test\-time compute improves performance, but remains below Optimal Length, which stops each trajectory at the first correct prefix\. The gap shows that models often already contain the correct answer before termination, but fail to stop before later reasoning deviates from correctness\.
![Refer to caption](https://arxiv.org/html/2606.02835v1/x9.png)Figure 9:Prefix\-level correctness transitions\. Rows indicate whether the answer after prefixiiis correct or wrong, and columns indicate the answer after prefixi\+1i\+1\. The off\-diagonal correct\-to\-wrong mass measures correctness deviations, showing that reasoning is not monotonic once a model has reached the correct answer\.

Fig\.[9](https://arxiv.org/html/2606.02835#A2.F9)contrasts*Optimal Length*with conventional test\-time scaling\. The test\-time scaling curve, represented by*Actual Length*, improves as additional samples or longer computations are allocated, but remains below the oracle*Optimal Length*strategy, which stops each trajectory at its first correct prefix\. This comparison shows that the limitation is not only whether the model can produce the correct answer at some point, but also whether it can preserve that answer until termination\.*Pass@K*provides an intermediate diagnostic: the correct answer is often present in the trajectory before the considered average length, but not always at the final utterance, corroborating the findings in Sec\.[3](https://arxiv.org/html/2606.02835#S3)\.

### B\.3Transition Matrix of Trajectories

The transition matrix in Fig\.[9](https://arxiv.org/html/2606.02835#A2.F9)highlights the non\-monotonic nature of reasoning trajectories moving fromt≤it\_\{\\leq i\}tot≤i\+1t\_\{\\leq i\+1\}\. If correctness were absorbing, then once a prefix was correct, later prefixes would almost always remain correct\. Instead, a non\-trivial number of trajectories transition from correct to wrong, showing that additional reasoning can mislead a correct intermediate solution\. This is precisely the harmful\-overthinking phenomenon studied in the main paper\. The matrix also shows that trajectories are more likely to remain wrong than correct, further emphasizing the instability of reasoning once models leave the correct state\.

### B\.4Utterances and Tokens

![Refer to caption](https://arxiv.org/html/2606.02835v1/x10.png)
![Refer to caption](https://arxiv.org/html/2606.02835v1/x11.png)

Figure 10:Token\-level statistics for utterance\-based reasoning budgets\. Left: distribution of the number of tokens per utterance, showing that most reasoning steps are short but that occasional long utterances create a heavy tail\. Right: token\-budget distributions underActual LengthandOptimal Length; actual traces consume substantially more tokens than the first\-correct prefixes, confirming that the utterance\-level overthinking effect also appears at the token level\.Our main analysis uses utterances rather than raw tokens as the unit of reasoning budget\. An utterance is a semantically coherent logical step in the generated trace, obtained by splitting the trace along explicit line\-break delimiters \(“\\n\\n” and\\n\) that LRMs naturally use when producing multi\-step reasoning\. This choice makes the budget less sensitive to formatting artifacts, local verbosity, and tokenizer\-specific conventions\. For example, two models may express the same intermediate step with different numbers of tokens, while both still represent a single reasoning transition in the trajectory\.

Fig\.[10](https://arxiv.org/html/2606.02835#A2.F10)reports the relationship between utterance\-level and token\-level budgets\. The left panel shows the distribution of tokens per utterance\. Most utterances are short, but the distribution has a long tail, indicating that token count can be strongly affected by unusually verbose individual steps\. The right panel compares token budgets under actual length and optimal length\. The same qualitative pattern observed with utterances also appears at the token level: actual traces allocate substantially more computation than is required to first reach the correct answer\. Thus, our conclusions are not an artifact of measuring compute in utterances\. Utterances provide a cleaner trajectory step abstraction, while token statistics confirm that the gap between actual and sufficient reasoning remains visible under a lower\-level compute measure\.

### B\.5On Verbose Overthinking

The main paper separates harmful overthinking from verbose overthinking\. Harmful overthinking concerns correctness loss: the model reaches a correct prefix but terminates with an incorrect answer\. Verbose overthinking concerns wasted computation: the model has already reached a correct answer and continues reasoning without changing the final outcome\. In this Section we quantify the latter\.

For each trajectory that reaches a correct prefix, we define the wasted budget as the number of utterances generated after the first correct prefix:

w​\(x;F\)=N−τy​\(x;F\),w\(x;F\)=N\-\\tau\_\{y\}\(x;F\),whereNNis the actual trace length andτy​\(x;F\)\\tau\_\{y\}\(x;F\)is the first correct prefix\. Large values indicate that the model solved the problem early but continued to spend inference compute\.

Fig\.[11](https://arxiv.org/html/2606.02835#A2.F11)reports average wasted budget across multimodal benchmarks and models\. The figure shows substantial variation across models\. DualMind\-VLM, which is trained to decide whether to reason fast or slow, exhibits the smallest wasted budget, averaging roughly 5 unnecessary utterances\. In contrast, R1\-VL produces the largest wasted budget, averaging roughly 18 unnecessary utterances\. However, a lower wasted budget should not be interpreted as eliminating harmful overthinking: as shown in the main results, models with shorter traces can still deviate from correct trajectories\.

### B\.6Overthinking in Language Reasoning Models

The main paper shows that harmful overthinking is not restricted to multimodal reasoning\. Fig\.[13](https://arxiv.org/html/2606.02835#A2.F13)visualizes the same effect for language\-only models by comparing actual and optimal utterance lengths on language benchmarks\. Actual traces are extremely long, especially on mathematical reasoning tasks, whereas optimal prefixes are much shorter\. This mirrors the multimodal setting: models often reach a correct solution far before their natural stopping point\.

Fig\.[13](https://arxiv.org/html/2606.02835#A2.F13)reports harmful overthinking by answer format for language\-only benchmarks\. The effect is again stronger in free\-form settings than in multiple\-choice settings\. This is consistent with the multimodal results: when the output space is unconstrained, the model must preserve and express the correct answer throughout the remainder of the trace, making it more vulnerable to later revisions and contradictions\.

![Refer to caption](https://arxiv.org/html/2606.02835v1/x12.png)Figure 11:Average wasted budget in number of utterances per model per benchmark\. DualMind\-VLMLinet al\.\([2025](https://arxiv.org/html/2606.02835#bib.bib40)\), a model trained to predict input difficulty and use budget accordingly, achieves the lower wasted budget, with an average of 5 wasted utterances\. R1\-VLZhanget al\.\([2025d](https://arxiv.org/html/2606.02835#bib.bib16)\), whose base model is Qwen2VLWanget al\.\([2024b](https://arxiv.org/html/2606.02835#bib.bib17)\), is the least “optimized” model, having an average wasted budget equal to 15 utterances, while having lower base performance than all the other models[1](https://arxiv.org/html/2606.02835#S3.T1)\.![Refer to caption](https://arxiv.org/html/2606.02835v1/x13.png)Figure 12:Actualvs\.optimal reasoning length for language\-only LRMs across all models and benchmarks\. Actual traces are substantially longer than the first\-correct prefixes, showing that language\-only models also reason far beyond the point at which the correct answer first becomes recoverable\.
![Refer to caption](https://arxiv.org/html/2606.02835v1/x14.png)Figure 13:Harmful and verbose overthinking by answer format in language\-only benchmarks\. Free\-form tasks exhibit higher harmful\-overthinking rates than multiple\-choice tasks, confirming the trend shown in the multimodal setting\.

## Appendix CAdditional Details

Here, we provide the procedural details needed to reproduce our prefix\-level evaluation and failure analysis\. We describe the prefix\-level probing setup, the taxonomy for harmful\-overthinking cases, and the implementation details\.

### C\.1Prefix\-Level Evaluation

Algorithm[1](https://arxiv.org/html/2606.02835#alg1)summarizes the prefix\-level trajectory protocol to estimate the difficulty of a sample for a given model\. For each input, the model first generates a full reasoning trace\. The trace is then split into utterances, and every prefix, including the empty prefix, is evaluated by appending a termination template and extracting a final answer\. The returned difficulty is the first utterance index that yields a correct answer\. If no prefix yields the correct answer, the instance is treated as unsolved for that trajectory\.

Early termination prompts[⬇](data:text/plain;base64,UDEgPSAiT2gsIEkgc3VkZGVubHkgZ290IHRoZSBhbnN3ZXIgdG8gdGhlIHdob2xlIHByb2JsZW0uCjxhbnN3ZXI+ICMjIyAqKkZpbmFsIEFuc3dlcioqOiBcWyBib3hlZHsiCgpQMiA9ICJJIGdvdCBpdCBub3cuIEkgY2FuIG5vdyBnaXZlIHRoZSBmaW5hbCByZXNwb25zZS4KPGFuc3dlcj4gIyMjICoqKkZpbmFsIEFuc3dlcioqOiBcWyBib3hlZHsi)P1="Oh,Isuddenlygottheanswertothewholeproblem\.<answer\>\#\#\#\*\*FinalAnswer\*\*:\\\[boxed\{"P2="Igotitnow\.Icannowgivethefinalresponse\.<answer\>\#\#\#\*\*\*FinalAnswer\*\*:\\\[boxed\{"Figure 14:Termination prompts used for prefix\-level probing\. Each prompt is appended to a partial reasoning trace to force the model to stop deliberating and produce a final answer\. The two variants preserve the same function while changing surface wording, allowing us to test whether estimated difficulty is sensitive to the exact probing phrase\.Answer extractor𝒜\\mathcal\{A\}prompt[⬇](data:text/plain;base64,U1lTVEVNOiBZb3UgYXJlIGEgaGVscGZ1bCBhc3Npc3RhbnQgd2hvIGV4dHJhY3RzIGNvbmNpc2UKYW5zd2VycyBmcm9tIHRleHQuIEV4dHJhY3Qgb25seSB0aGUgZGlyZWN0IGFuc3dlciBwcm92aWRlZApieSB0aGUgbW9kZWwsIHJlbW92aW5nIGV4cGxhbmF0aW9ucy4KClVTRVI6IEdpdmVuIHRoZSBmb2xsb3dpbmcgcmVhc29uaW5nIHRyYWNlLCBleHRyYWN0IE9OTFkKdGhlIGZpbmFsIGFuc3dlciBpbiBhIGNvbmNpc2UgZm9ybWF0LgoKTW9kZWwgQW5zd2VyOiB7bW9kZWxfdHJhY2V9CgpFeHRyYWN0IHRoZSBhbnN3ZXIgKGp1c3QgdGhlIGFuc3dlciBpdHNlbGYsIG5vIGV4cGxhbmF0aW9ucyk6)SYSTEM:Youareahelpfulassistantwhoextractsconciseanswersfromtext\.Extractonlythedirectanswerprovidedbythemodel,removingexplanations\.USER:Giventhefollowingreasoningtrace,extractONLYthefinalanswerinaconciseformat\.ModelAnswer:\{model\_trace\}Extracttheanswer\(justtheansweritself,noexplanations\):Figure 15:Prompt used by the answer extractor𝒜\\mathcal\{A\}\. The variablemodel\_tracedenotes the raw generation produced by the evaluated model, either at full length or after prefix\-level probing\. The extractor returns only the concise final answer used for benchmark verification\.
### C\.2Taxonomy Experiment Details

The taxonomy experiment analyzes harmful\-overthinking cases, i\.e\., trajectories that reach a correct answer at some prefix but terminate with an incorrect final prediction\. For each case, we identify the last correct prefix and compare it with the full final trace, thereby isolating the additional reasoning segment responsible for the correctness deviation\. Fig\.[16](https://arxiv.org/html/2606.02835#A4.F16)summarizes the prompt configuration used to extract the category and supporting evidence\.

We classify each harmful trajectory into one dominant failure mode: visual hallucination/perception error, calculation error and Logical error\.

We use an external judge model,Qwen3\.6\-35B, to assign the label\. The judge receives the last correct prefix, the final trace, the ground\-truth metadata, and, when available, the image associated with the example\. The prompt instructs the judge to compare only the reasoning added after the last correct prefix and to ignore the standardized forced\-answer suffix used by the probing pipeline\. The judge returns a compact JSON object containing the primary category, optional secondary categories, severity, a short explanation, evidence, and confidence\. We parse only valid JSON outputs; malformed outputs are discarded or re\-run under the same prompt configuration\.

Algorithm 1PyTorch\-style code forκ^​\(x;ℱ\)\\hat\{\\kappa\}\(x;\\mathcal\{F\}\)defdifficulty\(x,F,A,y,T\):

t=F\.generate\(x\)

utts=split\_utterances\(t\)

foriinrange\(len\(utts\)\+1\):

prefix=""\.join\(utts\[:i\]\)

prompted=prefix\+T

o\_i=F\.generate\_from\_prefix\(x,prompted\)

y\_hat\_i=A\(o\_i\)

ifverify\(y\_hat\_i==y\):

returni

returnNone

### C\.3Implementation Details and Reproducibility

#### Evaluation pipeline

We re\-implement and re\-run all benchmark evaluations from scratch using a unified LLM\-based answer\-extraction pipeline\. Instead of relying on benchmark\-specific regular expressions, we apply a fixed answer extractor𝒜\\mathcal\{A\}to each generated trace and use the extracted concise answer for verification\. This design is important because reasoning models frequently deviate from requested answer formats, and prefix\-level probing produces partial traces whose answers can appear in heterogeneous forms\. Unless otherwise specified, we useQwen/Qwen3\-4B\-Instruct\-2507as the extractor\. Appendix[A](https://arxiv.org/html/2606.02835#A1)repeats the difficulty\-estimation analysis withQwen/Qwen3\.5\-4Bto measure sensitivity to the parser\. The extractor prompt is shown in Fig\.[15](https://arxiv.org/html/2606.02835#A3.F15)\.

#### Hyperparameters and Answer Extraction\.

For each evaluated reasoning model, we use the reference decoding configuration recommended by the corresponding model release whenever available, including temperature, top\-pp, maximum generation length, and image\-processing settings\.Actual Lengthdenotes the model’s natural termination behavior under this configuration\. For prefix\-level difficulty estimation, we first generate the complete reasoning trace, split it into utterances, and probe nested prefixes by appending a fixed termination template that asks the model to stop and provide a final answer\. The failure\-mode taxonomy in Appendix[C\.2](https://arxiv.org/html/2606.02835#A3.SS2)is produced by a separate judge model, which compares the last correct prefix with the final incorrect trace and labels the newly introduced error as visual, calculation, or logical\. The judge prompt explicitly instructs the model to ignore the artificial termination suffix used by the probing pipeline\.

#### Compute\.

All experiments are run withvLLMfor batched inference on machines equipped with four NVIDIA A100\-64GB GPUs\. We store raw generations before answer extraction, which allows parsing, verification, robustness checks, and failure analyses to be repeated without regenerating expensive model traces\. We release the evaluation scripts, prompts, decoding configurations, intermediate generations, parsed predictions, and analysis code required to reproduce the reported results\. On average an evaluation on a benchmark can span from 1 to 4 hours depending on the model and dataset size \(around 1K samples on average in our setting\)\.

#### Packages, versions, and licenses\.

Our implementation was developed in Python 3\.10\.19, distributed under the Python Software Foundation License Version 2\. We used vLLM v0\.20\.0 for efficient large language model inference, released under the Apache License 2\.0; PyTorch v2\.11\.0\+cu130 for tensor operations and GPU\-accelerated model execution, released under a BSD\-style license; and Hugging Face Transformers v5\.6\.2 for model and tokenizer interfaces, released under the Apache License 2\.0\.

## Appendix DLimitations and Future Work

#### Verifiable outputs\.

Our analysis is limited to settings where correctness can be automatically verified, which is necessary for estimating the first correct prefix and separating verbose from harmful overthinking\. The conclusions are therefore strongest for benchmarks with well\-defined ground\-truth answers, such as mathematical reasoning, visual reasoning, and scientific QA\. Open\-ended generation, tool use, and coding tasks may require different definitions of correctness\. For example, a program can be partially correct, fail hidden tests, or improve through later debugging\. Extension to execution\-based or subjective evaluation settings is an important direction for future work\.

#### Model\-dependent difficulty\.

The difficulty we estimate is not an intrinsic property of a problem alone, but a property of how a particular model processes that problem\. We view this model dependence as a feature of the formulation rather than only a limitation\. The same problem may be easy for one model and difficult for another, depending on the model’s training data, post\-training procedure, visual grounding ability, mathematical knowledge, reasoning shortcuts, and decoding policy\. Accordingly, the empirical difficultyκ^​\(x,y;F\)\\hat\{\\kappa\}\(x,y;F\)should be interpreted as a model\-conditioned quantity: it measures the minimum reasoning budget required by modelFF, on a sampled trajectory, to recover the correct answer\. This is precisely the notion we aim to capture, since overthinking is also a property of a model’s own reasoning dynamics rather than of the benchmark instance in isolation\.

#### Compute accounting\.

The experiments require substantial inference compute because prefix\-level probing evaluates many nested prefixes for each generated trace\. We report the hardware used in Appendix[1](https://arxiv.org/html/2606.02835#alg1)and save intermediate generations to avoid unnecessary regeneration\.

#### Oracle stopping and deployability\.

*Optimal Length*is not a deployable inference method because it requires ground\-truth access to identify the first correct prefix\. It serves as an oracle measuring how much performance is lost when models continue reasoning after a correct answer has already become recoverable\. Developing practical stopping policies that approximate these oracles without access to ground truth remains an important open problem\. Future work will explore how to best leverage the empirical difficulty estimateκ^​\(x,y;F\)\\hat\{\\kappa\}\(x,y;F\)\. It provides a possible supervision signal: models could be rewarded for reaching correct answers with sufficient but non\-redundant reasoning, rather than for producing longer traces\. This could support explicit stopping policies, model\-agnostic difficulty predictors, or training objectives that penalize reasoning beyond the first correct prefix\. The taxonomy analysis also suggests targeted interventions: visual errors may require stronger grounding, calculation errors may benefit from symbolic verification, and logical drift may require consistency constraints that prevent unsupported answer revisions\.

Failure\-analysis judge prompt configuration[⬇](data:text/plain;base64,VEFYT05PTVkgPSBbCiAgICAidmlzdWFsX2hhbGx1Y2luYXRpb25fb3JfcGVyY2VwdGlvbiIsCiAgICAiY2FsY3VsYXRpb25fZXJyb3IiLAogICAgImxvZ2ljYWxfZXJyb3IiLApdCgpQUk9CRV9TVUZGSVhfSU5TVFJVQ1RJT04gPSAiIiJJbXBvcnRhbnQgcHJvYmUgYXJ0aWZhY3Q6Ci0gSWdub3JlIGZvcmNlZCBmaW5hbC1hbnN3ZXIgcHJvYmUgc3VmZml4ZXMgdGhhdCBzdGFydCBsaWtlCiAgIk9oLCBJIHN1ZGRlbmx5L2ZpbmFsbHkgZ290IHRoZSBhbnN3ZXIuLi4iIGFuZCBsZWFkIGludG8gIlxcYm94ZWR7Ii4KLSBUcmVhdCB0aGF0IHRleHQgYXMgZXZhbHVhdG9yIHNjYWZmb2xkaW5nLCBub3QgYXMgcmVhc29uaW5nIHByb2R1Y2VkIGJ5CiAgdGhlIG1vZGVsIHVuZGVyIHRlc3QuCi0gRG8gbm90IGNsYXNzaWZ5IGEgc2FtcGxlIGFzIGEgbG9naWNhbCBlcnJvciBvbmx5IGJlY2F1c2UgdGhpcyBzdGFuZGFyZAogIHByb2JlIHN1ZmZpeCBhcHBlYXJzLgotIENsYXNzaWZ5IHRoZSBkcmlmdCB1c2luZyB0aGUgc3Vic3RhbnRpdmUgcmVhc29uaW5nIG9yIGZpbmFsLWFuc3dlcgogIGNoYW5nZSBiZWZvcmUvYXJvdW5kIHRoYXQgc2NhZmZvbGQuIiIiCgpDT01QQUNUX09VVFBVVF9JTlNUUlVDVElPTiA9ICIiIk91dHB1dCBzdHlsZToKLSBSZXR1cm4gZXhhY3RseSBvbmUgY29tcGFjdCBKU09OIG9iamVjdC4KLSBObyBhbmFseXNpcywgbWFya2Rvd24sIHByb3NlLCBvciBwcmVhbWJsZS4KLSBVc2UgdGhlIG1ldGFkYXRhIGFzIGdyb3VuZCB0cnV0aCBmb3IgbGFzdC9maW5hbCBwcmVkaWN0aW9ucy4KLSBLZWVwIHdlbnRfd3JvbmcgdG8gb25lIHNob3J0IHNlbnRlbmNlLgotIFVzZSBleGFtcGxlIGZvciBvbmUgbWluaW1hbCBxdW90ZS9wYXJhcGhyYXNlIGZyb20gdGhpcyBzYW1wbGUuIiIiCgpQUk9NUFQgPSAiIiIKWW91IGFyZSBhbmFseXppbmcgb3ZlcnRoaW5raW5nIGluIGEgbmVzdGVkIGRpZmZpY3VsdHkgcmVhc29uaW5nIHRyYWNlLgoKVGhlIGZpcnN0IHRyYWNlIGlzIHRoZSBMQVNUIHByZWZpeCB3aGVyZSB0aGUgbW9kZWwncyBwYXJzZWQgYW5zd2VyIHdhcwpzdGlsbCBjb3JyZWN0LiBUaGUgc2Vjb25kIHRyYWNlIGlzIHRoZSBGSU5BTCBwcmVmaXggd2l0aCBhbGwgcmV0YWluZWQKdXR0ZXJhbmNlcy4KClRhc2s6CjEuIENvbXBhcmUgb25seSB3aGF0IGNoYW5nZWQgYWZ0ZXIgdGhlIGxhc3QtY29ycmVjdCBwcmVmaXguCjIuIElkZW50aWZ5IHRoZSBtYWluIGZhaWx1cmUgbW9kZSBpbnRyb2R1Y2VkIGJ5IHRoZSBmaW5hbC9mdWxsIHRyYWNlLgozLiBJZiBhbiBpbWFnZSBpcyBwcm92aWRlZCwgZGVjaWRlIHdoZXRoZXIgdGhlIGFkZGVkIHN1ZmZpeCBoYWxsdWNpbmF0ZXMKICAgb3IgbWlzcmVhZHMgdmlzdWFsIGV2aWRlbmNlLgo0LiBDaG9vc2UgdGhlIGJlc3QgYXZhaWxhYmxlIGNhdGVnb3J5IGV2ZW4gd2hlbiB0aGUgZHJpZnQgaXMgYW1iaWd1b3VzLgo1LiBJZ25vcmUgdGhlIHN0YW5kYXJkIGZvcmNlZCBmaW5hbC1hbnN3ZXIgcHJvYmUgc3VmZml4LgoKQWxsb3dlZCBjYXRlZ29yaWVzOiB7Y2F0ZWdvcmllc30KClJldHVybiBvbmx5IHZhbGlkIEpTT046CnsKICAiY2F0ZWdvcnkiOiAib25lX2FsbG93ZWRfY2F0ZWdvcnkiLAogICJzZWNvbmRhcnlfY2F0ZWdvcmllcyI6IFsiemVyb19vcl9tb3JlX2FsbG93ZWRfY2F0ZWdvcmllcyJdLAogICJzZXZlcml0eSI6IDBfdG9fMTAwX2ludGVnZXIsCiAgIndlbnRfd3JvbmciOiAic2hvcnQgZXhwbGFuYXRpb24iLAogICJldmlkZW5jZSI6ICJzaG9ydCBxdW90ZSBvciBwYXJhcGhyYXNlIGZyb20gdGhlIGFkZGVkL2ZpbmFsIHRyYWNlIiwKICAiZXhhbXBsZSI6ICJtaW5pbWFsIHF1b3RlIG9yIHBhcmFwaHJhc2UgaWxsdXN0cmF0aW5nIHRoZSByZWFzb24iLAogICJjb25maWRlbmNlIjogMC4wCn0KIiIi)TAXONOMY=\["visual\_hallucination\_or\_perception","calculation\_error","logical\_error",\]PROBE\_SUFFIX\_INSTRUCTION="""Importantprobeartifact:\-Ignoreforcedfinal\-answerprobesuffixesthatstartlike"Oh,Isuddenly/finallygottheanswer\.\.\."andleadinto"\\\\boxed\{"\.\-Treatthattextasevaluatorscaffolding,notasreasoningproducedbythemodelundertest\.\-Donotclassifyasampleasalogicalerroronlybecausethisstandardprobesuffixappears\.\-Classifythedriftusingthesubstantivereasoningorfinal\-answerchangebefore/aroundthatscaffold\."""COMPACT\_OUTPUT\_INSTRUCTION="""Outputstyle:\-ReturnexactlyonecompactJSONobject\.\-Noanalysis,markdown,prose,orpreamble\.\-Usethemetadataasgroundtruthforlast/finalpredictions\.\-Keepwent\_wrongtooneshortsentence\.\-Useexampleforoneminimalquote/paraphrasefromthissample\."""PROMPT="""Youareanalyzingoverthinkinginanesteddifficultyreasoningtrace\.ThefirsttraceistheLASTprefixwherethemodel’sparsedanswerwasstillcorrect\.ThesecondtraceistheFINALprefixwithallretainedutterances\.Task:1\.Compareonlywhatchangedafterthelast\-correctprefix\.2\.Identifythemainfailuremodeintroducedbythefinal/fulltrace\.3\.Ifanimageisprovided,decidewhethertheaddedsuffixhallucinatesormisreadsvisualevidence\.4\.Choosethebestavailablecategoryevenwhenthedriftisambiguous\.5\.Ignorethestandardforcedfinal\-answerprobesuffix\.Allowedcategories:\{categories\}ReturnonlyvalidJSON:\{"category":"one\_allowed\_category","secondary\_categories":\["zero\_or\_more\_allowed\_categories"\],"severity":0\_to\_100\_integer,"went\_wrong":"shortexplanation","evidence":"shortquoteorparaphrasefromtheadded/finaltrace","example":"minimalquoteorparaphraseillustratingthereason","confidence":0\.0\}"""Figure 16:Failure\-analysis judge prompt\. The judge compares the final incorrect trace against the last correct prefix and labels the dominant failure mode introduced by the additional reasoning\. The prompt explicitly instructs the judge to ignore the standardized forced\-answer suffix used by the probing pipeline\.

相似文章

量化推理模型自以为需要更长的思考,实则不然

arXiv cs.LG

本文揭示,对推理模型进行激进的训练后量化会导致过度思考错误增加,即模型在中间步骤得出正确答案却未能作为最终答案输出。对过度思考标记施加简单的logit惩罚,可将思维链长度减少12-23%,同时提升准确率,尤其对量化模型效果显著。

更多推理,更低准确性?论视觉语言模型中推理的双重性

Papers with Code Trending

本文揭示,视觉语言模型中的长时间推理可能会损害感知基础,导致对基本视觉问题的识别失败。它提出视觉锚定策略优化(VAPO),将推理引导至视觉基础轨迹,并通过VAPO-Thinker-7B模型实现了最先进的性能。

监控内部独白:探针轨迹揭示推理动态

Hugging Face Daily Papers

本文介绍了一种通过分析探针轨迹(即概念概率在生成token上的演变)来监控大型推理模型推理过程的方法。该方法利用隐藏表示中的时间特征和信号处理特征,更好地预测未来模型行为,通过最大池化达到了高达95%的AUROC。