DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
Summary
This paper introduces DyCon, a training-free framework that uses step-level embeddings to model evolving task difficulty and dynamically control reasoning depth in Large Reasoning Models, effectively reducing overthinking and improving efficiency without sacrificing accuracy.
View Cached Full Text
Cached at: 06/08/26, 09:14 AM
# DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
Source: [https://arxiv.org/html/2606.07108](https://arxiv.org/html/2606.07108)
Yulin Li1Huiling Zhen3Libo Qin1Zhoujun Wei4Jinghua Piao2,5Zhuotao Tian1,4Yong Li2,5Min Zhang1,4
###### Abstract
Recent advances in Large Reasoning Models \(LRMs\) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as “overthinking”\. Existing methods to mitigate this issue either rely on static difficulty estimates or require task\-specific training, and thus fail to adapt to the dynamic complexity during reasoning\. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM’s step\-level embeddings\. Building on this insight, we proposeDyCon, a training\-free framework that leverages latent step\-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue\. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate thatDyConsignificantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization\. Project page and code are available at https://github\.com/yu\-lin\-li/DyCon\.
Machine Learning, ICML
1Harbin Institute of Technology, Shenzhen2Zhongguancun Academy3Huawei Noah’s Ark Lab4Shenzhen Loop Area Institute5Tsinghua University
## 1Introduction
Recent advances in Large Reasoning Models \(LRMs\) have shown strong performance on complex reasoning tasks such as mathematical problem\-solving and code generation\(Guoet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib1); Team,[2025](https://arxiv.org/html/2606.07108#bib.bib2); Yanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib3)\)\. These gains mainly arise from the models’ ability to iteratively reflect, explore, and execute during reasoning\(Chenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib30)\)\. However, existing work reveals that while Chain\-of\-Thought \(CoT\) reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2606.07108#bib.bib7)\)substantially boosts accuracy on difficult problems, current LRMs lack precise control over this mechanism\. As a result, they often perform redundant reflection and exploration even on simple or already\-solved tasks, a phenomenon termed “overthinking\.”\(Chenet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib8)\)This inefficiency unnecessarily lengthens reasoning traces and can introduce additional hallucinations\(Sunet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib48)\), posing a critical bottleneck for practical LRM deployment\.
Figure 1:Quantitative comparison\.Our method consistently outperforms prior approaches\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14); Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16); Maet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib4)\)across multiple mathematical reasoning benchmarks and four model architectures \(4B–32B\), while reducing token usage without sacrificing accuracy\.Addressing overthinking essentially involves terminating reasoning once sufficient exploration has been achieved\. Although several methods have been proposed to identify suitable termination points, they typically fall short in adapting effectively to varying problem difficulties\. Specifically, TrimR\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28)\)and FlashThink\(Jianget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib29)\)rely on external models to assess reasoning sufficiency\. However, these strategies apply uniform criteria across all inputs, ignoring problem\-specific difficulty and thus failing to adapt termination points accordingly\. Alternative methods\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14); Fuet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib31)\)leverage handcrafted metrics to gauge the model’s certainty and determine when to terminate reasoning\. While intuitive, these methods depend heavily on human priors and empirical thresholds, limiting their generalizability across problems of varying complexity\.
Figure 2:Dynamic evolution and latent encoding of problem difficulty during reasoning\.\(a\) The dynamic evolution of self\-assessed difficulty across normalized reasoning steps\. The blue curves indicate mean difficulty ratings, while shaded areas represent standard deviations\. Problem difficulty exhibits a consistent declining trend, confirming its dynamic nature throughout reasoning\. \(b\) Linear regression predictions of normalized problem difficulty from step embeddings\. With remaining reasoning length as the proxy for evolving difficulty, predictions closely match actual difficulty with high R² scores \(i\.e\., the coefficient of determination in statistics\), demonstrating a strong linear relationship and confirming that step embeddings encode latent difficulty knowledge\.Another direction\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib52); Louet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib53); Huanget al\.,[2025c](https://arxiv.org/html/2606.07108#bib.bib54)\)employs Supervised Fine\-Tuning \(SFT\) or Reinforcement Learning \(RL\) with specially curated datasets to train models to implicitly infer problem difficulty and decide where the reasoning process terminates\. Despite their potential, such methods are sensitive to the quantity and quality of data and prone to mode collapse\(Louet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib53)\)\. Hence, a key question arises:How can we explicitly model task difficulty to adaptively determine when to terminate or extend the reasoning process, thereby enhancing reasoning efficiency for simpler problems while ensuring comprehensive exploration for complex ones?
#### Key observations\.
Though recent works\(Shenget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib10); Nguyenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib11); Zhaoet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib55)\)have attempted to estimate problem difficulty, they typically assign static difficulty scores before the reasoning process begins based on embeddings derived from the initial question or the<think\>token\. Consequently, these approaches are constrained to the sample\-level estimation and fail to capture how difficulty dynamically evolves throughout the reasoning process itself\.
However, as illustrated in Fig\.[2](https://arxiv.org/html/2606.07108#S1.F2)\(a\), we observe that the problem difficulty is not static but evolves dynamically during reasoning\. When the reasoning path remains valid, the difficulty gradually decreases as the CoT progressively decomposes and clarifies the problem\. Conversely, if reasoning deviates, misleading or distracting CoT content causes difficulty to remain high or even increase\. This observation motivates us to explore a fine\-grained, step\-level metric capable of explicitly modeling and accurately capturing the dynamic variations in problem difficulty during reasoning\.
Furthermore, the results shown in Fig\.[2](https://arxiv.org/html/2606.07108#S1.F2)\(b\) indicate that the step\-level difficulty information in LRMs can be encoded within embeddings at each reasoning step, exhibiting a linear correlation with actual problem difficulty\. This suggests that LRMs inherently possess latent knowledge regarding dynamically evolving difficulty in their embedding spaces\. Inspired by this finding, we ask:Can this latent knowledge be leveraged to adaptively assess difficulty, both across different samples and throughout the reasoning process, thereby facilitating more efficient reasoning?
Figure 3:Overview of DyCon\.\(a\) Explicit Modeling of Evolving Difficulty: In offline reasoning, step embeddings are extracted from model outputs to construct a fitting set with remaining length information\. These lengths are log\-transformed and normalized, creating a bounded difficulty target used to fit a linear regressor as the difficulty estimator\. \(b\) Difficulty\-Aware Dynamic Reasoning Control: During online reasoning, this estimator dynamically predicts step\-level difficulty, guiding logit interventions to reduce the probabilities of reflection\-related tokens based on evolving difficulty\. This adaptive mechanism promotes deeper reasoning when difficulties are high and encourages early termination in simpler scenarios, optimizing the reasoning depth effectively\.
#### Our Solution\.
In this work, we introduceDyCon, a training\-free, evolving difficulty\-aware mechanism for efficient reasoning\.DyConleverages latent knowledge in LRM representations to model both inter\-sample and intra\-reasoning difficulty dynamics\. We fit a linear regressor on a small\-scale seen dataset to map reasoning\-step embeddings to problem difficulty\. During inference, this regressor estimates difficulty at each reasoning step, capturing fine\-grained complexity shifts\. Guided by these estimates,DyCondynamically adjusts the logits for reflection keywords\. If the estimated difficulty is low, indicating adequate reasoning, logits of reflection keywords are reduced to expedite convergence\. Conversely, if the estimated difficulty is high, these logits are increased to encourage deeper reflection\. This mechanism enables dynamic, latent knowledge\-guided control over reasoning length, improving reasoning efficiency on simpler tasks without compromising exploration on complex ones\.
Extensive experiments across four models ranging from 4B to 32B, and on twelve benchmarks covering math reasoning, general question answering, and coding tasks, demonstrate the effectiveness and strong generalization capabilities ofDyCon\. To summarize, our contributions are as follows:
- •We empirically verify that problem difficulty in LRMs evolves dynamically during reasoning\. Our analysis reveals a linear correlation between step embeddings and step\-level difficulty, indicating that LRMs inherently possess latent knowledge capable of explicitly modeling this evolving difficulty\.
- •To achieve a dynamic control of the reasoning behavior, we proposeDyCon, a training\-free evolving difficulty\-aware dynamic reasoning control mechanism\. By employing a lightweight linear regressor to estimate difficulty from step embeddings,DyCondynamically adjusts the logits of reflection\-related keywords based on this latent knowledge, effectively balancing exploration and efficiency during reasoning\.
- •Extensive experiments across different models and tasks demonstrate thatDyConeffectively reduces redundant reasoning without compromising accuracy, exhibiting its strong generalizability and robustness across varying problem complexities and domains\.
## 2Background and Motivation
### 2\.1Preliminaries
This study addresses the problem of efficient reasoning by explicitly modeling step\-level difficulty, enabling adaptive adjustments in reasoning behavior to mitigate overthinking\. In this section, we introduce the preliminaries required to elaborate on the motivation and details of our method\.
#### Inference of LRMs\.
Given an input questionqq, a Large Reasoning Model \(LRM\) generates a sequence of tokens𝐲=\(y1,…,yT\)\\mathbf\{y\}=\(y\_\{1\},\\dots,y\_\{T\}\)autoregressively:
pθ\(𝐲∣q\)=∏t=1Tpθ\(yt∣q,y<t\),p\_\{\\theta\}\(\\mathbf\{y\}\\mid q\)=\\prod\_\{t=1\}^\{T\}p\_\{\\theta\}\(y\_\{t\}\\mid q,y\_\{<t\}\),\(1\)wherepθ\(yt∣q,y<t\)=softmax\(𝐳t\)p\_\{\\theta\}\(y\_\{t\}\\mid q,y\_\{<t\}\)=\\mathrm\{softmax\}\(\\mathbf\{z\}\_\{t\}\)and𝐳t∈ℝ\|𝒱\|\\mathbf\{z\}\_\{t\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}denotes the pre\-softmax logit vector over the vocabulary𝒱\\mathcal\{V\}at decoding steptt\. Letzt,iz\_\{t,i\}be the logit of tokeni∈𝒱i\\in\\mathcal\{V\}at steptt\. The average logit at stepttis given by:
μt=1\|𝒱\|∑i∈𝒱zt,i\.\\mu\_\{t\}=\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{i\\in\\mathcal\{V\}\}z\_\{t,i\}\.\(2\)Our study focuses on the reasoning part of the output, which is enclosed between the tokens<think\>and</think\>\. Following\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\), we consider each occurrence of\\n\\nas the boundary between steps\.tst\_\{s\}andtendt\_\{\\mathrm\{end\}\}denote the token indexes of thess\-th step boundary and the ending token</think\>, respectively\.
#### Representations of reasoning steps\.
To enable fine\-grained control over reasoning behavior, we investigate the latent representations of individual reasoning steps\. Consider an LRM consisting ofLLlayers, where thedd\-dimensional hidden state at layerℓ\\elland token positionttis denoted as𝐡t\(ℓ\)∈ℝd\\mathbf\{h\}^\{\(\\ell\)\}\_\{t\}\\in\\mathbb\{R\}^\{d\}\. Due to the causal attention mask employed during decoding, the hidden state𝐡ts\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}\_\{t\_\{s\}\}at each step boundary \(i\.e\.,\\n\\n\) inherently encodes contextual information from preceding steps\(Chenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib30)\)\. Therefore, we define the step embedding𝐞s\(ℓ\)\\mathbf\{e\}^\{\(\\ell\)\}\_\{s\}for thess\-th reasoning step at layerℓ\\ellas follows:
𝐞s\(ℓ\):=𝐡ts\(ℓ\)\.\\mathbf\{e\}^\{\(\\ell\)\}\_\{s\}:=\\mathbf\{h\}^\{\(\\ell\)\}\_\{t\_\{s\}\}\.\(3\)
#### Proxy for estimating step\-level difficulty\.
Harder tasks require deeper exploration, while simpler tasks benefit from quicker convergence\. Prior work typically uses overall reasoning length as a proxy for task difficulty\(Shenget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib10); Suet al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib58)\)\. However, difficulty often varies throughout the reasoning process, and different stages may present distinct challenges\. Therefore, fine\-grained control necessitates estimating difficulty at the step\-level\. To achieve this, we propose a step\-level proxy defined at each step boundary:
rs:=tend−ts,r\_\{s\}:=t\_\{\\texttt\{end\}\}\-t\_\{s\},\(4\)wheretst\_\{s\}denotes the index at thess\-th step boundary \(i\.e\.,\\n\\n\) andtendt\_\{\\texttt\{end\}\}is the index of the</think\>token\.
Intuitively,rsr\_\{s\}measures the remaining length from the current step boundary to the end of the reasoning trace\. A largerrsr\_\{s\}indicates that substantial reasoning remains, suggesting a more challenging situation, while a smallerrsr\_\{s\}indicates that the reasoning process is closer to termination\.
### 2\.2Key Observations
Existing efficient reasoning methods\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28); Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\)focus on identifying optimal termination points to avoid unnecessary reasoning steps\. These methods assume that problem difficulty remains static throughout reasoning\(Shenget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib10); Zhaoet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib55)\)\. However, we observe that problem difficulty evolves dynamically during the reasoning process and find that large reasoning models \(LRMs\) inherently encode such evolving difficulty as latent knowledge within their internal representations\. We detail our observations below\.
#### Difficulty evolves with the reasoning progress\.
Theoretically, problem difficulty may decrease if the model follows a productive reasoning path, whereas ineffective paths could increase difficulty by introducing noise or confusion\. To empirically validate this assumption, we conduct experiments on level 5 problems from the MATH\-500\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib17)\)benchmark, which typically demand extended CoT and thus enable fine\-grained analysis\.
Specifically, after each reasoning step, we prompt the model to self\-assess current difficulty on a 3\-point scale: 1 \(almost solved\), 2 \(some uncertainty remains\), or 3 \(missing key insight\) \(see Appendix[D\.6](https://arxiv.org/html/2606.07108#A4.SS6)for details\)\. As shown in Fig\.[2](https://arxiv.org/html/2606.07108#S1.F2)\(a\), the average self\-assessed difficulty, normalized and aggregated across all samples, displays a clear decreasing trend with fluctuations\. Notably, this phenomenon consistently emerges across four distinct model families \(1\.5B–32B parameters\)\. Consequently, accurate identification of termination points requires careful monitoring of difficulty evolution\. Practical exploitation of this phenomenon for reasoning control thus requires explicit, fine\-grained difficulty modeling\.
#### Latent knowledge encoded in step embeddings\.
Prior studies\(Suet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib59)\)suggest that internal reasoning states are reflected in hidden states of LRMs\. We hypothesize that step embeddings similarly encode latent difficulty knowledge\.
To investigate this, we take the remaining reasoning length as a difficulty proxy \(Sec\.[2\.1](https://arxiv.org/html/2606.07108#S2.SS1)\), and sample 600 samples from the MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib12)\)training set, fitting a linear regressor to predict normalized difficulty based on corresponding step embeddings \(detailed in Sec\.[3\.2](https://arxiv.org/html/2606.07108#S3.SS2)\)\. As illustrated in Fig\.[2](https://arxiv.org/html/2606.07108#S1.F2)\(b\), predictions from the fitted regressor closely match the actual difficulty values across a held\-out, unseen test set and three distinct model families ranging from 4B to 32B\. The consistently high R2scores \(i\.e\., the coefficient of determination in statistics\) indicate that step embeddings effectively capture latent difficulty information, exhibiting a nearly linear relationship\.
Consequently, the linear relationship between step embeddings and problem difficulty offers an effective foundation for explicit, fine\-grained modeling of difficulty evolution\. Leveraging this latent knowledge enables computationally efficient difficulty estimation, thus facilitating dynamic control over model reasoning behavior\.
## 3Method
### 3\.1Overview
In this section, we introduceDyCon, a dynamic reasoning control mechanism guided by evolving difficulty estimation\. Inspired by the observations described in Sec\.[2\.2](https://arxiv.org/html/2606.07108#S2.SS2),DyConconsists of two steps: \(i\) explicitly modeling step\-level difficulty that evolves throughout the reasoning trajectory by leveraging latent knowledge captured within the hidden representations of the LRM \(Sec\.[3\.2](https://arxiv.org/html/2606.07108#S3.SS2)\); and \(ii\) dynamically adjusting the reasoning behavior based on estimated difficulty, thereby mitigating unnecessary exploration once sufficient reasoning depth has been achieved\. \(Sec\.[3\.3](https://arxiv.org/html/2606.07108#S3.SS3)\)\.
### 3\.2Explicit Modeling of Evolving Difficulty
As discussed in Sec\.[2\.2](https://arxiv.org/html/2606.07108#S2.SS2), step embeddings naturally encode evolving difficulty information\. Therefore,DyConintroduces a lightweight difficulty estimator that maps hidden step embeddings directly to step\-level difficulty\. Crucially,DyCondoes not alter the original LRM parametersθ\\theta; instead, we fit a simple linear regressor on a small\-scale seen dataset to decode the latent difficulty signals inherently captured by the model\.
Table 1:Performance on math reasoning benchmarks\.Following prior work\(Jaechet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib34); Guoet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib1)\), we evaluate our method on small\-scale benchmarks using multiple independent sampling trials to assess stability; detailed results are provided in Appendix[B\.5](https://arxiv.org/html/2606.07108#A2.SS5)\. Since TrimR\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28)\), FlashThink\(Jianget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib29)\), and ThinkPilot\(Liet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib13)\)are not publicly released, we re\-implemented these methods based on their published descriptions\.MATH\-500AIME24AIME25GSM8KAMC23MMLUalgebra\{\}\_\{\\text\{algebra\}\}MethodPass@1↑\\uparrow\#Tok↓\\downarrowPass@1↑\\uparrow\#Tok↓\\downarrowPass@1↑\\uparrow\#Tok↓\\downarrowPass@1↑\\uparrow\#Tok↓\\downarrowPass@1↑\\uparrow\#Tok↓\\downarrowPass@1↑\\uparrow\#Tok↓\\downarrowDeepSeek\-R1\-Distill\-Qwen\-7BBaseline\(Guoet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib1)\)92\.0395550\.01300836\.71524590\.6121487\.5619390\.02387CoD\(Xuet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib5)\)81\.8197653\.31141933\.31433385\.430180\.0481085\.01091Nothinking\(Maet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib4)\)80\.0102016\.7422223\.3438582\.124272\.5114174\.0760Thinkpilot\(Liet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib13)\)78\.071513\.3122910\.0196186\.732760\.0104274\.0705DEER\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\)89\.8214349\.2983936\.7725790\.691785\.0445179\.01493SEAL\(Chenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib30)\)91\.6294343\.31109226\.71109288\.888977\.5526780\.01507Manifold Steering\(Huanget al\.,[2025d](https://arxiv.org/html/2606.07108#bib.bib15)\)88\.4223953\.38457––87\.644087\.54440––Controlling Thinking Speed\(Linet al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib33)\)90\.0281850\.01258840\.01099786\.447882\.5543390\.01719NoWait\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\)89\.6270240\.0728126\.7930289\.179485\.0437689\.01347Ours92\.0321653\.31090636\.71241591\.188090\.0380191\.01488Δ\\Deltavs\. Baseline\( \+0\.0 \)\( \-18\.7% \)\( \+3\.3 \)\( \-16\.2% \)\( \+0\.0 \)\( \-18\.6% \)\( \+0\.5 \)\( \-27\.5% \)\( \+2\.5 \)\( \-38\.6% \)\( \+1\.0 \)\( \-37\.7% \)Qwen3\-4B\-Thinking\-2507Baseline\(Yanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib3)\)96\.2674983\.32149376\.72270895\.914941001107394\.03496CoD\(Xuet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib5)\)95\.6448483\.31865280\.02124695\.7952100897395\.03209Thinkpilot\(Liet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib13)\)88\.6291143\.3791330\.0881494\.787875\.0508583\.01306Nothinking\(Maet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib4)\)95\.2436273\.31655673\.31917795\.0113797\.5773894\.02331DEER\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\)94\.6550866\.71272870\.01334295\.71037100952192\.01945NoWait\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\)92\.6506253\.31239353\.31332294\.8107092\.5820495\.02068Ours96\.2609286\.71886776\.72110095\.71098100916295\.02122Δ\\Deltavs\. Baseline\( \+0\.0 \)\( \-9\.7% \)\( \+3\.4 \)\( \-12\.2% \)\( \+0\.0 \)\( \-7\.1% \)\( \-0\.2 \)\( \-26\.5% \)\( \+0\.0 \)\( \-17\.3% \)\( \+1\.0 \)\( \-39\.3% \)QwQ\-32BBaseline\(Team,[2025](https://arxiv.org/html/2606.07108#bib.bib2)\)96\.0426773\.31336460\.01646296\.8150597\.5716695\.02133CoD\(Xuet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib5)\)94\.8366263\.31102946\.71328996\.561792\.5632197\.01345Nothinking\(Maet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib4)\)95\.6398966\.71150770\.01531296\.5133197\.5747296\.01431DEER\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\)94\.6331670\.01008750\.01159896\.397795\.0578296\.01395FlashThink\(Jianget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib29)\)93\.2314460\.01003440\.01186196\.591092\.56702––TrimR\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28)\)93\.8383056\.7834543\.3882793\.7131990\.06055––SEAL\(Chenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib30)\)93\.0366763\.31206456\.71208996\.3123197\.5644895\.01541NoWait\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\)93\.6290273\.3940556\.71187196\.7983100\.0453695\.01302Ours95\.8334573\.31279466\.71364096\.8995100565497\.01266Δ\\Deltavs\. Baseline\( \-0\.2 \)\( \-21\.6% \)\( \+0\.0 \)\( \-4\.3% \)\( \+6\.7 \)\( \-17\.1% \)\( \+0\.0 \)\( \-33\.9% \)\( \+2\.5 \)\( \-21\.1% \)\( \+2\.0 \)\( \-40\.6% \)Qwen3\-14BBaseline\(Yanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib3)\)95\.0496276\.71274670\.01661396\.3169397\.5667196\.02545CoD\(Xuet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib5)\)93\.8353563\.31142646\.71239196\.267092\.5637194\.01381Nothinking\(Maet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib4)\)87\.494030\.0512323\.3511594\.926075\.0181884\.0547Thinkpilot\(Liet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib13)\)86\.885426\.7784123\.3348894\.927472\.5156188\.0538Dynasor\-CoT\(Fuet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib31)\)93\.8402373\.31036960\.01215995\.6148395\.0658291\.01733DEER\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\)94\.0331676\.7761966\.71113595\.384095\.0476387\.01380NoWait\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\)94\.6330576\.71018160\.01227695\.8112597\.5493593\.01729Ours95\.0364576\.71053670\.01453796\.3116697\.5524096\.02073Δ\\Deltavs\. Baseline\( \+0\.0 \)\( \-26\.6% \)\( \+0\.0 \)\( \-17\.3% \)\( \+0\.0 \)\( \-12\.5% \)\( \+0\.0 \)\( \-31\.1% \)\( \+0\.0 \)\( \-21\.4% \)\( \+0\.0 \)\( \-18\.5% \)
#### From remaining length to evolving difficulty\.
We randomly sample 600 instances from the MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib12)\)training set\. For each instance, we run the LRM to generate its Chain\-of\-Thought \(CoT\) output enclosed by<think\>⋯\\cdots</think\>\. Following the definitions in Sec\.[2\.1](https://arxiv.org/html/2606.07108#S2.SS1), at each step boundary \(i\.e\.,\\n\\n\), we record: \(i\) the step embedding𝐞s\\mathbf\{e\}\_\{s\}, and \(ii\) the corresponding remaining lengthrsr\_\{s\}, forming a step\-level fitting set:
𝒟=\{\(𝐞s,rs\)\}\.\\mathcal\{D\}=\\left\\\{\\big\(\\mathbf\{e\}\_\{s\},r\_\{s\}\\big\)\\right\\\}\.\(5\)However, directly usingrsr\_\{s\}as a regression target may be suboptimal because it typically exhibits a heavy\-tailed distribution: a small number of steps can have extremely large remaining lengths, disproportionately influencing the regression \(see Tab\.[8](https://arxiv.org/html/2606.07108#A1.T8)\)\. To mitigate this, we first apply a log\-transform to compress the scale of remaining lengths, followed by normalization to derive a bounded difficulty targetdsd\_\{s\}for fitting:
r~s=ln\(1\+rs\),ds=r~s−r~minr~max−r~min∈\[0,1\],\\tilde\{r\}\_\{s\}=\\ln\(1\+r\_\{s\}\),\\qquad d\_\{s\}=\\frac\{\\tilde\{r\}\_\{s\}\-\\tilde\{r\}\_\{\\min\}\}\{\\tilde\{r\}\_\{\\max\}\-\\tilde\{r\}\_\{\\min\}\}\\in\[0,1\],\(6\)wherer~min\\tilde\{r\}\_\{\\min\}andr~max\\tilde\{r\}\_\{\\max\}are computed over the fitting set\. By construction, a largerdsd\_\{s\}corresponds to a more difficult reasoning step \(indicating more reasoning remains\), whereas a smallerdsd\_\{s\}indicates an easier step\.
#### Linear decoding of latent difficulty knowledge\.
To leverage the linear encoding of evolving difficulty within the step embeddings \(Sec\.[2\.2](https://arxiv.org/html/2606.07108#S2.SS2)\), we fit a ridge regressor to estimate the difficulty based on the step embeddings\. Specifically, with𝐞s∈ℝd\\mathbf\{e\}\_\{s\}\\in\\mathbb\{R\}^\{d\}denoting the extracted embedding of stepss, we can model the normalized step difficultydsd\_\{s\}via a linear decoder that yields estimated difficultyd^s\\hat\{d\}\_\{s\}:
d^s=f\(𝐞s\)=𝐰⊤𝐞s\+b,\\hat\{d\}\_\{s\}=f\(\\mathbf\{e\}\_\{s\}\)=\\mathbf\{w\}^\{\\top\}\\mathbf\{e\}\_\{s\}\+b,\(7\)The learnable parameters𝐰∈ℝd\\mathbf\{w\}\\in\\mathbb\{R\}^\{d\}andb∈ℝb\\in\\mathbb\{R\}are optimized via ridge regression:
min𝐰,b∑\(𝐞s,ds\)∈𝒟\(d^s−ds\)2\+α∥𝐰∥22,\\min\_\{\\mathbf\{w\},b\}\\ \\sum\_\{\(\\mathbf\{e\}\_\{s\},d\_\{s\}\)\\in\\mathcal\{D\}\}\\left\(\\hat\{d\}\_\{s\}\-d\_\{s\}\\right\)^\{2\}\+\\alpha\\lVert\\mathbf\{w\}\\rVert\_\{2\}^\{2\},\(8\)whereα≥0\\alpha\\geq 0controls the strength ofℓ2\\ell\_\{2\}regularization\. We note that we extract embeddings from a specific layer of the model, and both the embedding layer and the ridge regularization weightα\\alphaare determined automatically by maximizing theR2R^\{2\}score on a held\-out validation set without manual tuning \(see Appendix[A\.4](https://arxiv.org/html/2606.07108#A1.SS4)for more details\)\.
#### Test\-time difficulty estimation\.
During test\-time reasoning, whenever a new step boundary is generated, we compute its step embedding𝐞s\\mathbf\{e\}\_\{s\}and estimate the difficultyd^s=f\(𝐞s\)\\hat\{d\}\_\{s\}=f\(\\mathbf\{e\}\_\{s\}\), which tracks the evolution of difficulty along the reasoning trajectory, enabling dynamic, difficulty\-aware reasoning control\.
### 3\.3Difficulty\-Aware Dynamic Reasoning Control
With the step\-level estimated difficultyd^s\\hat\{d\}\_\{s\}available during inference,DyCondynamically controls the LRM’s reasoning behavior to mitigate overthinking\. The control follows a simple yet effective principle: for steps identified as low\-difficulty, the model is encouraged to terminate the reasoning; conversely, for steps assessed as high\-difficulty, the model’s reasoning capacity should be preserved for deeper reflection and exploration\.
Existing works\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\)terminate reasoning by suppressing the probabilities of reflection keywords\. Inspired by them, to achieve difficulty\-aware dynamic reasoning control, we propose reducing the token logits of the reflection keywords based on estimated difficulties\. Specifically, we define a set𝒮⊂𝒱\\mathcal\{S\}\\subset\\mathcal\{V\}of token IDs corresponding to reflection\-related keywords, as detailed in Appendix[D\.2](https://arxiv.org/html/2606.07108#A4.SS2)\. Then, at each decoding stepss, we compute a difficulty\-conditioned logit bias for eachi∈𝒮i\\in\\mathcal\{S\}, and subtract the bias from the logits of reflection\-triggers,i\.e\., the tokens belonging to the reflection keywords:
zt,i′=\{zt,i−δs,i,i∈𝒮,zt,i,otherwise,z^\{\\prime\}\_\{t,i\}=\\begin\{cases\}z\_\{t,i\}\-\\delta\_\{s,i\},&i\\in\\mathcal\{S\},\\\\ z\_\{t,i\},&\\text\{otherwise\},\\end\{cases\}\(9\)and sample the next token from the intervened distribution
yt∼softmax\(𝐳t′\)\.y\_\{t\}\\sim\\mathrm\{softmax\}\(\\mathbf\{z\}^\{\\prime\}\_\{t\}\)\.\(10\)Different from prior studies\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28); Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\), our strategy does not enforce termination\. Instead, it dynamically reduces the probabilities of reflection triggers based on the reasoning depth, enabling fine\-grained and adaptive control over the reasoning behavior of LRMs\. Next, we need to derive the logit biasδs,i\\delta\_\{s,i\}based on the estimated difficultyd^s\\hat\{d\}\_\{s\}\.
#### Difficulty\-aware logit bias\.
To generate the difficulty\-aware logit biasδt\\delta\_\{t\}, given the logits𝐳t\\mathbf\{z\}\_\{t\}of thess\-th step boundary, we first compute the mean logitμt\\mu\_\{t\}as in Eq\. \(2\) and define the positive marginmt,im\_\{t,i\}as
mt,i:=\[zt,i−μt\]\+=max\(zt,i−μt,0\)\.m\_\{t,i\}:=\[z\_\{t,i\}\-\\mu\_\{t\}\]\_\{\+\}=\\max\(z\_\{t,i\}\-\\mu\_\{t\},0\)\.\(11\)This formulation ensures the logit bias is applied only to reflection\-triggers whose logits exceed the average, thereby preserving normal reasoning patterns\. Otherwise, the reasoning cannot proceed as shown in Appendix[B\.4](https://arxiv.org/html/2606.07108#A2.SS4)\.
We then define the bias magnitudeδt,i\\delta\_\{t,i\}using a thresholdτ\\tau, which is consistent across all models and tasks:
δs,i=\(1−d^s\)⋅\{mt,i,d^s≥τ,mt,i,d^s<τ\.\\delta\_\{s,i\}=\(1\-\\hat\{d\}\_\{s\}\)\\cdot\\begin\{cases\}\\sqrt\{m\_\{t,i\}\},&\\hat\{d\}\_\{s\}\\geq\\tau,\\\\ m\_\{t,i\},&\\hat\{d\}\_\{s\}<\\tau\.\\end\{cases\}\(12\)
Table 2:Generalization capabilities on non\-mathematical benchmarks\.

Figure 4:\(a–b\) Olympiad performance of \(a\) R1\-Qwen\-7B and \(b\) Qwen3\-4B\. \(c\) Early\-exit evaluation on Math\-500 for Qwen3\-4B\. \(d\) Early\-exit evaluation on AIME2025 for Qwen3\-4B\.Although our central objective is to mitigate overthinking, an essential challenge lies in removing redundant reflections without disrupting the model’s normal reasoning process, particularly when solving difficult problems that inherently require deeper reflection\. Thus, to protect the integrity of normal reasoning, our formulation scales the bias magnitude by1−d^s1\-\\hat\{d\}\_\{s\}, ensuring weaker suppression for high\-difficulty steps and stronger suppression for low\-difficulty ones\. Furthermore, when difficulty surpasses the threshold \(d^s≥τ\\hat\{d\}\_\{s\}\\geq\\tau\), we introduce the square root of the marginmt,im\_\{t,i\}to additionally reduce the bias magnitude\. This design ensures gentler suppression in challenging scenarios, preserving essential reflective exploration without unintended interference\. The sensitivity analysis and necessity of introducing the thresholdτ\\tauare illustrated in Fig\.[5](https://arxiv.org/html/2606.07108#S4.F5)\.
## 4Experiment
Evaluation is conducted on benchmarks spanning multiple reasoning domains\.Mathematical reasoning datasets: Math\-500\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib17)\), AIME2024\(AI\-MO,[2024a](https://arxiv.org/html/2606.07108#bib.bib19)\), AIME2025\(OpenCompass,[2025](https://arxiv.org/html/2606.07108#bib.bib20)\), AMC23\(AI\-MO,[2024b](https://arxiv.org/html/2606.07108#bib.bib21)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib18)\), Olympiad Bench\(Heet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib22)\), MMLUalgebra\{\}\_\{\\text\{algebra\}\}\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.07108#bib.bib27)\)\.Scientific reasoning datasets: GPQA\-Diamond\(Reinet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib23)\)\.Code reasoning datasets: LiveCodeBench\(Jainet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib25)\)\.Implicit reasoning datasets: StrategyQA\(Gevaet al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib24)\)\.Commonsense reasoning datasets: CommonSenseQA\(Talmoret al\.,[2019](https://arxiv.org/html/2606.07108#bib.bib32)\)\.Knowledge\-intensive question answering datasets: TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.07108#bib.bib26)\)\. For each backbone, a regressor is fitted offline on 600 randomly sampled problems from Math\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib12)\)and remains fixed across all evaluations\. A sensitivity analysis of this choice is reported in Fig\.[5](https://arxiv.org/html/2606.07108#S4.F5)\(c\)\. Additional experimental settings and baseline details are provided in Appendix[D](https://arxiv.org/html/2606.07108#A4)\.
### 4\.1Main Results
As shown in Table[1](https://arxiv.org/html/2606.07108#S3.T1), Table[4](https://arxiv.org/html/2606.07108#S4.T4), Table[2](https://arxiv.org/html/2606.07108#S3.T2), and Fig\.[4](https://arxiv.org/html/2606.07108#S3.F4)\(a–b\), our method consistently outperforms all baselines, achieving up to40\.6% token reductionand6\.7% absolute accuracy gainson mathematical benchmarks, and up to52\.5% token reductionand8\.6% absolute accuracy gainson non\-mathematical benchmarks\. The gains generalize beyond Qwen backbones to alternative architectures such as LLaMA family models\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib60)\), demonstrating strong cross\-architecture effectiveness\. Moreover, even when the regressor is fitted only on Math and applied to other datasets without adaptation, the method maintains high accuracy and strong efficiency gains\. This suggests that the temporal evolution patterns of reasoning difficulty learned from mathematical trajectories are qualitatively similar across diverse reasoning domains, enabling effective transfer\. Detailed regressor analysis is provided in Appendices[A\.4](https://arxiv.org/html/2606.07108#A1.SS4)and[A\.5](https://arxiv.org/html/2606.07108#A1.SS5)\. Further results on the domain generalizability of the regressor are reported in Appendix[A\.6](https://arxiv.org/html/2606.07108#A1.SS6)\.


Figure 5:Detailed analysis of Qwen3\-4B\.\(a\) Hyperparameter sensitivity on MATH\-500\. \(b\) Comparison of different logits\-statistic variants on AIME 2024\. \(c\) Sensitivity of regressor fitting to sample size on AIME 2024\. \(d\) Performance of the regressor fitted on data from different domains\.Table 3:Difficulty Awareness Ablation\.R1\-Qwen\-7B\.
### 4\.2Ablation Study
#### Importance of the regressor\.
Table[3](https://arxiv.org/html/2606.07108#S4.T3)shows that replacing adaptive difficulty awareness with a static coefficient leads to a substantial degradation in accuracy\. Static suppression indiscriminately over\-suppresses challenging instances, reducing the method to a conventional efficiency strategy that fails to balance accuracy and cost\. While token\-level entropy \(as an alternative proxy\) captures local uncertainty, it lacks a global view of the reasoning trajectory and fails to distinguish globally complex problems\. In contrast, our trajectory\-level representation enables a temporally consistent difficulty assessment, which is essential for reliable control during reasoning\.
Table 4:Performance on R1\-Llama\-8B\.Table 5:Ablation on regressor type\.Results with Qwen3\-4B\.Table 6:Vocabulary Ablation\.Qwen3\-4B on Math\-500\.
#### Impact of an alternative efficient strategy\.
We further evaluate alternative efficiency strategies by integrating difficulty awareness into an early\-exit mechanism \(Fig\.[4](https://arxiv.org/html/2606.07108#S3.F4)\(c–d\)\)\. While this variant outperforms existing early\-exit baselines, its reliance on discrete stopping decisions inherently limits the granularity of control\. In contrast, our soft difficulty\-aware mechanism provides continuous, trajectory\-level control over computation, enabling finer\-grained adjustment and consistently yielding a more favorable balance between efficiency and accuracy\. See Appendix[B\.1](https://arxiv.org/html/2606.07108#A2.SS1)for implementation details\. Additional results on using a GRU\-based policy to guide efficient reasoning and on the bidirectional DyCon strategy are discussed in Appendices[B\.2](https://arxiv.org/html/2606.07108#A2.SS2)and[B\.11](https://arxiv.org/html/2606.07108#A2.SS11), respectively\.
#### Sensitivity to the hyperparameter\.
Fig\.[5](https://arxiv.org/html/2606.07108#S4.F5)\(a\) analyzes the sensitivity to the hyperparameter that balances the linear and square\-root distance terms\. Increasing the weight on the square\-root term leads to more conservative inference and higher token usage, whereas increasing the weight on the linear term improves efficiency with a modest reduction in accuracy\.
#### Impact of aggregation operator choices\.
As shown in Fig\.[5](https://arxiv.org/html/2606.07108#S4.F5)\(b\), we replace the mean with the median, trimmed mean, and winsorized mean for aggregating token\-level states\. The results show comparable performance across aggregation choices, indicating low sensitivity to the specific operator\. Stability analysis is shown in Appendix[B\.3](https://arxiv.org/html/2606.07108#A2.SS3)\.
#### Impact of regressor data and model choice\.
Fig\.[5](https://arxiv.org/html/2606.07108#S4.F5)\(c–d\) studies the effect of regressor fitting data scale and source, while Table[5](https://arxiv.org/html/2606.07108#S4.T5)compares different regressor architectures for difficulty prediction\. We observe that insufficient fitting data substantially degrades predictive accuracy and downstream performance, whereas performance improvements largely saturate at around 300 samples\. Moreover, regressors trained on GPQA exhibit strong cross\-domain transferability, generalizing well to Math and other benchmarks\. We further discuss the noise introduced by using reasoning length as a difficulty proxy in Appendix[B\.8](https://arxiv.org/html/2606.07108#A2.SS8), where removing samples with redundant reasoning is shown to degrade DyCon’s performance\.
Across regressor types, DyCon remains broadly robust, with Elastic Net yielding further improvements on Math\-500\. In contrast, Random Forest leads to degraded performance, consistent with its inferior predictive quality \(R2=0\.6398R^\{2\}=0\.6398compared to approximately0\.80\.8for other regressors\)\. Overall, these results highlight that accurate difficulty regression is a key factor for reliable difficulty estimation and effective downstream control\. Further analyses of regressor fitting, more complex nonlinear regressors such as MLPs, and additional experiments on iteratively refining the regressor with DyCon\-generated trajectories are provided in Appendices[A\.4](https://arxiv.org/html/2606.07108#A1.SS4),[B\.12](https://arxiv.org/html/2606.07108#A2.SS12), and[B\.10](https://arxiv.org/html/2606.07108#A2.SS10), respectively\.
Across regressor types, DyCon remains broadly robust, with Elastic Net yielding further improvements on Math\-500\. In contrast, Random Forest leads to degraded performance, consistent with its inferior predictive quality \(R2=0\.6398R^\{2\}=0\.6398compared to approximately0\.80\.8for other regressors\)\. Overall, these results highlight that accurate difficulty regression is a key factor for reliable difficulty estimation and effective downstream control\. Detailed analyses of regressor fitting are provided in Appendix[A\.4](https://arxiv.org/html/2606.07108#A1.SS4), and additional studies on more complex nonlinear regressors, such as MLPs, are presented in Appendix[B\.12](https://arxiv.org/html/2606.07108#A2.SS12)\.
#### Impact of the vocabulary design\.
Table[6](https://arxiv.org/html/2606.07108#S4.T6)shows that replacing our suppression vocabulary with the SEAL\(Chenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib30)\)reflection list yields comparable or even superior performance\. This result suggests that DyCon is largely insensitive to the exact choice of suppression vocabulary and remains effective as long as reflective terms are appropriately suppressed\. More detailed analyses of vocabulary optimization and token sensitivity are provided in Appendix[B\.7](https://arxiv.org/html/2606.07108#A2.SS7), and cross\-lingual analyses are presented in Appendix[B\.9](https://arxiv.org/html/2606.07108#A2.SS9)\.
## 5Conclusion
This paper shows that LLMs continuously encode difficulty signals, which we leverage for adaptive inference\. Our proposed method,DyCon, is training\-free and improves efficiency while preserving performance\. Extending DyCon to multi\-modal scenarios is a promising future direction\.
## Acknowledgements
This work was supported by the Shenzhen Science and Technology Program \(KJZD20240903102901003\), the Zhongguancun Academy under Grant No\. C20250201, and the National Natural Science Foundation of China \(NSFC\) via Grant No\. 92570120\.
## Impact Statement
This paper proposesDyCon, a training\-free dynamic control mechanism for Large Reasoning Models to improve inference efficiency by reducing redundant reasoning while preserving accuracy\. By modeling evolving problem difficulty from latent representations, our approach adaptively reallocates computation during reasoning, lowering inference\-time cost and improving accessibility under constrained computational budgets\.
Potential risks are similar to those of general\-purpose reasoning language models\. Increased efficiency may lower the cost of misuse, and latent difficulty estimation may be unreliable on out\-of\-distribution or adversarial inputs, potentially leading to premature termination or insufficient reasoning\. The method introduces no new data collection or training and inherits the biases and limitations of the underlying pretrained models\. Responsible deployment should rely on existing safety and moderation mechanisms\.
## References
- AI\-MO \(2024a\)AIME 2024\.External Links:[Link](https://huggingface.co/datasets/AI-MO/aimo-validation-aime)Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p3.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- AI\-MO \(2024b\)AMC 2023\.External Links:[Link](https://huggingface.co/datasets/AI-MO/aimo-validation-amc)Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p5.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- D\. Arora and A\. Zanette \(2025\)Training language models to reason efficiently\.arXiv preprint arXiv:2502\.04463\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk,et al\.\(2024\)Graph of thoughts: solving elaborate problems with large language models\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 17682–17690\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p2.1)\.
- R\. Chen, Z\. Zhang, J\. Hong, S\. Kundu, and Z\. Wang \(2025\)Seal: steerable reasoning calibration of large language models for free\.arXiv preprint arXiv:2504\.07986\.Cited by:[§B\.7](https://arxiv.org/html/2606.07108#A2.SS7.SSS0.Px1.p1.1),[§B\.7](https://arxiv.org/html/2606.07108#A2.SS7.SSS0.Px1.p3.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[§1](https://arxiv.org/html/2606.07108#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07108#S2.SS1.SSS0.Px2.p1.9),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.24.7.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.44.27.1),[§4\.2](https://arxiv.org/html/2606.07108#S4.SS2.SSS0.Px6.p1.1),[Table 6](https://arxiv.org/html/2606.07108#S4.T6.2.5.3.1)\.
- X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang,et al\.\(2024\)Do not think that much for 2\+ 3=? on the overthinking of o1\-like llms\.arXiv preprint arXiv:2412\.21187\.Cited by:[§A\.1](https://arxiv.org/html/2606.07108#A1.SS1.SSS0.Px1.p3.1),[§1](https://arxiv.org/html/2606.07108#S1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p2.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- J\. Cui, S\. Liu, Z\. Tian, Z\. Zhong, and J\. Jia \(2022\)Reslt: residual learning for long\-tailed recognition\.IEEE transactions on pattern analysis and machine intelligence45\(3\),pp\. 3695–3706\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- J\. Cui, Z\. Zhong, Z\. Tian, S\. Liu, B\. Yu, and J\. Jia \(2023\)Generalized parametric contrastive learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(12\),pp\. 7463–7474\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- M\. Ding, H\. Liu, Z\. Fu, J\. Song, W\. Xie, and Y\. Zhang \(2024\)Break the chain: large language models can be shortcut reasoners\.arXiv preprint arXiv:2406\.06580\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[§4\.1](https://arxiv.org/html/2606.07108#S4.SS1.p1.1)\.
- Y\. Fu, J\. Chen, Y\. Zhuang, Z\. Fu, I\. Stoica, and H\. Zhang \(2025\)Reasoning without self\-doubt: more efficient chain\-of\-thought through certainty probing\.InICLR 2025 Workshop on Foundation Models in the Wild,Cited by:[§B\.6](https://arxiv.org/html/2606.07108#A2.SS6.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[§1](https://arxiv.org/html/2606.07108#S1.p2.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.52.35.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021\)Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.Transactions of the Association for Computational Linguistics9,pp\. 346–361\.Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p10.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px2.p1.1),[§D\.1](https://arxiv.org/html/2606.07108#A4.SS1.p1.6),[§1](https://arxiv.org/html/2606.07108#S1.p1.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.19.2.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.40.2.1)\.
- M\. Gurbuzbalaban, U\. Simsekli, and L\. Zhu \(2021\)The heavy\-tail phenomenon in sgd\.InInternational Conference on Machine Learning,pp\. 3964–3975\.Cited by:[§B\.10](https://arxiv.org/html/2606.07108#A2.SS10.SSS0.Px1.p3.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. L\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang,et al\.\(2024\)Olympiadbench: a challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.arXiv preprint arXiv:2402\.14008\.Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p6.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p7.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§A\.2](https://arxiv.org/html/2606.07108#A1.SS2.SSS0.Px1.p5.1),[§2\.2](https://arxiv.org/html/2606.07108#S2.SS2.SSS0.Px2.p2.1),[§3\.2](https://arxiv.org/html/2606.07108#S3.SS2.SSS0.Px1.p1.3),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- J\. Huang, X\. Hu, B\. Han, S\. Shi, Z\. Tian, T\. He, and L\. Jiang \(2025a\)Memory forcing: spatio\-temporal memory for consistent scene generation on minecraft\.arXiv preprint arXiv:2510\.03198\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- J\. Huang, X\. Hu, S\. Shi, Z\. Tian, and L\. Jiang \(2025b\)Edit360: 2d image edits to 3d assets from any angle\.InICCV,Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- S\. Huang, H\. Wang, W\. Zhong, Z\. Su, J\. Feng, B\. Cao, and Y\. R\. Fung \(2025c\)AdaCtrl: towards adaptive and controllable reasoning via difficulty\-aware budgeting\.arXiv preprint arXiv:2505\.18822\.Cited by:[§1](https://arxiv.org/html/2606.07108#S1.p3.1)\.
- Y\. Huang, H\. Chen, S\. Ruan, Y\. Zhang, X\. Wei, and Y\. Dong \(2025d\)Mitigating overthinking in large reasoning models via manifold steering\.arXiv preprint arXiv:2505\.22411\.Cited by:[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.25.8.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.40.2.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)Livecodebench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p9.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- G\. Jiang, G\. Quan, Z\. Ding, Z\. Luo, D\. Wang, and Z\. Hu \(2025\)Flashthink: an early exit method for efficient reasoning\.arXiv preprint arXiv:2505\.13949\.Cited by:[§B\.2](https://arxiv.org/html/2606.07108#A2.SS2.SSS0.Px1.p8.1),[§B\.6](https://arxiv.org/html/2606.07108#A2.SS6.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[§1](https://arxiv.org/html/2606.07108#S1.p2.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.42.25.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.40.2.1)\.
- L\. Jiang, S\. Shi, Z\. Tian, X\. Lai, S\. Liu, C\. Fu, and J\. Jia \(2021\)Guided point contrastive learning for semi\-supervised point cloud semantic segmentation\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 6423–6432\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension\.arXiv preprint arXiv:1705\.03551\.Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p11.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- D\. Kahneman \(2011\)Thinking, fast and slow\.Farrar, Straus and Giroux\.Cited by:[§A\.1](https://arxiv.org/html/2606.07108#A1.SS1.SSS0.Px1.p2.1)\.
- Y\. Kang, X\. Sun, L\. Chen, and W\. Zou \(2025\)C3ot: generating shorter chain\-of\-thought without compromising effectiveness\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 24312–24320\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- A\. Kumar, V\. Zhuang, R\. Agarwal, Y\. Su, J\. D\. Co\-Reyes, A\. Singh, K\. Baumli, S\. Iqbal, C\. Bishop, R\. Roelofs,et al\.\(2024\)Training language models to self\-correct via reinforcement learning\.arXiv preprint arXiv:2409\.12917\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p2.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th symposium on operating systems principles,pp\. 611–626\.Cited by:[§D\.3](https://arxiv.org/html/2606.07108#A4.SS3.p1.1)\.
- X\. Lai, Z\. Tian, Y\. Chen, Y\. Li, Y\. Yuan, S\. Liu, and J\. Jia \(2024a\)Lisa: reasoning segmentation via large language model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9579–9589\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- X\. Lai, Z\. Tian, Y\. Chen, S\. Yang, X\. Peng, and J\. Jia \(2024b\)Step\-dpo: step\-wise preference optimization for long\-chain reasoning of llms\.arXiv preprint arXiv:2406\.18629\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p2.1)\.
- X\. Lai, Z\. Tian, L\. Jiang, S\. Liu, H\. Zhao, L\. Wang, and J\. Jia \(2021\)Semi\-supervised semantic segmentation with directional context\-aware consistency\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 1205–1214\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- S\. Li, Z\. Lin, S\. Yang, J\. Zhao, and W\. Chen \(2025a\)ThinkPilot: steering reasoning models via automated think\-prefixes optimization\.arXiv preprint arXiv:2510\.12063\.Cited by:[§B\.7](https://arxiv.org/html/2606.07108#A2.SS7.SSS0.Px1.p5.1),[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.22.5.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.32.15.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.51.34.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.40.2.1)\.
- Y\. Li, T\. Tu, L\. Ding, J\. Wang, H\. Zhen, Y\. Chen, Y\. Li, and Z\. Tian \(2026\)Efficient reasoning with balanced thinking\.arXiv preprint arXiv:2603\.12372\.Cited by:[§B\.11](https://arxiv.org/html/2606.07108#A2.SS11.SSS0.Px1.p3.1)\.
- Y\. Li, Z\. Liu, Z\. Li, X\. Zhang, Z\. Xu, X\. Chen, H\. Shi, S\. Jiang, X\. Wang, J\. Wang,et al\.\(2025b\)Perception, reason, think, and plan: a survey on large multimodal reasoning models\.arXiv preprint arXiv:2505\.04921\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p1.1),[§2\.2](https://arxiv.org/html/2606.07108#S2.SS2.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- W\. Lin, X\. Li, Z\. Yang, X\. Fu, H\. Zhen, Y\. Wang, X\. Yu, W\. Liu, X\. Li, and M\. Yuan \(2025a\)TrimR: verifier\-based training\-free thinking compression for efficient test\-time scaling\.arXiv preprint arXiv:2505\.17155\.Cited by:[§B\.2](https://arxiv.org/html/2606.07108#A2.SS2.SSS0.Px1.p8.1),[§B\.6](https://arxiv.org/html/2606.07108#A2.SS6.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[§1](https://arxiv.org/html/2606.07108#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.07108#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.07108#S3.SS3.p2.5),[Table 1](https://arxiv.org/html/2606.07108#S3.T1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.43.26.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.40.2.1)\.
- Z\. Lin, Z\. Fu, Z\. Chen, C\. Chen, L\. Xie, W\. Wang, D\. Cai, Z\. Wang, and J\. Ye \(2025b\)Controlling thinking speed in reasoning models\.arXiv preprint arXiv:2507\.03704\.Cited by:[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.26.9.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- C\. Lou, Z\. Sun, X\. Liang, M\. Qu, W\. Shen, W\. Wang, Y\. Li, Q\. Yang, and S\. Wu \(2025\)AdaCoT: pareto\-optimal adaptive chain\-of\-thought triggering via reinforcement learning\.arXiv preprint arXiv:2505\.11896\.Cited by:[§1](https://arxiv.org/html/2606.07108#S1.p3.1)\.
- X\. Luo, Z\. Tian, T\. Zhang, B\. Yu, Y\. Y\. Tang, and J\. Jia \(2023\)Pfenet\+\+: boosting few\-shot semantic segmentation with the noise\-filtered context\-aware prior mask\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(2\),pp\. 1273–1289\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- W\. Ma, J\. He, C\. Snell, T\. Griggs, S\. Min, and M\. Zaharia \(2025\)Reasoning models can be effective without thinking\.arXiv preprint arXiv:2504\.09858\.Cited by:[§A\.1](https://arxiv.org/html/2606.07108#A1.SS1.SSS0.Px1.p5.1),[Table 7](https://arxiv.org/html/2606.07108#A1.T7.28.26.31.2.1),[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[Figure 1](https://arxiv.org/html/2606.07108#S1.F1),[Figure 1](https://arxiv.org/html/2606.07108#S1.F1.4.2.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.21.4.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.33.16.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.40.23.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.50.33.1),[Table 4](https://arxiv.org/html/2606.07108#S4.T4.4.7.3.1)\.
- T\. Munkhbat, N\. Ho, S\. H\. Kim, Y\. Yang, Y\. Kim, and S\. Yun \(2025\)Self\-training elicits concise reasoning in large language models\.arXiv preprint arXiv:2502\.20122\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- S\. Nayab, G\. Rossolini, M\. Simoni, A\. Saracino, G\. Buttazzo, N\. Manes, and F\. Giacomelli \(2024\)Concise thoughts: impact of output length on llm reasoning and cost\.arXiv preprint arXiv:2407\.19825\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- B\. Nguyen, H\. T\. Nguyen, R\. She, X\. Fu, and V\. A\. Nguyen \(2025\)Reasoning planning for language models\.arXiv preprint arXiv:2511\.00521\.Cited by:[§A\.2](https://arxiv.org/html/2606.07108#A1.SS2.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.07108#S1.SS0.SSS0.Px1.p1.1)\.
- Z\. Ning, Z\. Tian, G\. Lu, and W\. Pei \(2023\)Boosting few\-shot 3d point cloud segmentation via query\-guided enhancement\.InProceedings of the 31st ACM international conference on multimedia,pp\. 1895–1904\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- OpenCompass \(2025\)AIME 2025\.External Links:[Link](https://huggingface.co/datasets/opencompass/AIME2025)Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p4.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- B\. Peng, Z\. Tian, S\. Liu, M\. Yang, and J\. Jia \(2024a\)Scalable language model with generalized continual learning\.arXiv preprint arXiv:2404\.07470\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p2.1)\.
- B\. Peng, Z\. Tian, X\. Wu, C\. Wang, S\. Liu, J\. Su, and J\. Jia \(2023\)Hierarchical dense correlation distillation for few\-shot segmentation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 23641–23651\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- B\. Peng, X\. Wu, L\. Jiang, Y\. Chen, H\. Zhao, Z\. Tian, and J\. Jia \(2024b\)Oa\-cnns: omni\-adaptive sparse cnns for 3d semantic segmentation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 21305–21315\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- S\. Peng, W\. Wang, Z\. Tian, S\. Yang, X\. Wu, H\. Xu, C\. Zhang, T\. Isobe, B\. Hu, and M\. Zhang \(2025a\)Omni\-dpo: a dual\-perspective paradigm for dynamic preference learning of llms\.arXiv preprint arXiv:2506\.10054\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p2.1)\.
- S\. Peng, S\. Yang, L\. Jiang, and Z\. Tian \(2025b\)Mitigating object hallucinations via sentence\-level early intervention\.arXiv preprint arXiv:2507\.12455\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§A\.6](https://arxiv.org/html/2606.07108#A1.SS6.SSS0.Px1.p5.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p8.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- M\. Renze and E\. Guven \(2024\)The benefits of a concise chain of thought on problem\-solving in large language models\.In2024 2nd International Conference on Foundation and Large Language Models \(FLLM\),pp\. 476–483\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- T\. Shao, Z\. Tian, H\. Zhao, and J\. Su \(2024\)Explore the potential of clip for training\-free open vocabulary semantic segmentation\.InEuropean Conference on Computer Vision,pp\. 139–156\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- Y\. Shen, J\. Zhang, J\. Huang, S\. Shi, W\. Zhang, J\. Yan, N\. Wang, K\. Wang, Z\. Liu, and S\. Lian \(2025\)Dast: difficulty\-adaptive slow\-thinking for large reasoning models\.arXiv preprint arXiv:2503\.04472\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- L\. Sheng, A\. Zhang, Z\. Wu, W\. Zhao, C\. Shen, Y\. Zhang, X\. Wang, and T\. Chua \(2025\)On reasoning strength planning in large reasoning models\.arXiv preprint arXiv:2506\.08390\.Cited by:[§A\.2](https://arxiv.org/html/2606.07108#A1.SS2.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.07108#S1.SS0.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2606.07108#S2.SS1.SSS0.Px3.p1.4),[§2\.2](https://arxiv.org/html/2606.07108#S2.SS2.p1.1)\.
- F\. Shi, M\. Suzgun, M\. Freitag, X\. Wang, S\. Srivats, S\. Vosoughi, H\. W\. Chung, Y\. Tay, S\. Ruder, D\. Zhou,et al\.\(2022\)Language models are multilingual chain\-of\-thought reasoners\.arXiv preprint arXiv:2210\.03057\.Cited by:[§B\.9](https://arxiv.org/html/2606.07108#A2.SS9.SSS0.Px1.p1.1)\.
- D\. Su, H\. Zhu, Y\. Xu, J\. Jiao, Y\. Tian, and Q\. Zheng \(2025a\)Token assorted: mixing latent and text tokens for improved language model reasoning\.arXiv preprint arXiv:2502\.03275\.Cited by:[§2\.2](https://arxiv.org/html/2606.07108#S2.SS2.SSS0.Px2.p1.1)\.
- J\. Su, J\. Healey, P\. Nakov, and C\. Cardie \(2025b\)Between underthinking and overthinking: an empirical study of reasoning length and correctness in llms\.arXiv preprint arXiv:2505\.00127\.Cited by:[§2\.1](https://arxiv.org/html/2606.07108#S2.SS1.SSS0.Px3.p1.4)\.
- Z\. Sun, Q\. Wang, H\. Wang, X\. Zhang, and J\. Xu \(2025\)Detection and mitigation of hallucination in large reasoning models: a mechanistic perspective\.arXiv preprint arXiv:2505\.12886\.Cited by:[§1](https://arxiv.org/html/2606.07108#S1.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)Commonsenseqa: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4149–4158\.Cited by:[§D\.4](https://arxiv.org/html/2606.07108#A4.SS4.p12.1),[§4](https://arxiv.org/html/2606.07108#S4.p1.1)\.
- Q\. Team \(2025\)QwQ\-32b: embracing the power of reinforcement learning\.External Links:[Link](https://qwenlm.github.io/blog/qwq-32b/)Cited by:[§D\.1](https://arxiv.org/html/2606.07108#A4.SS1.p1.6),[§1](https://arxiv.org/html/2606.07108#S1.p1.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.38.21.1)\.
- Z\. Tian, P\. Chen, X\. Lai, L\. Jiang, S\. Liu, H\. Zhao, B\. Yu, M\. Yang, and J\. Jia \(2022a\)Adaptive perspective distillation for semantic segmentation\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(2\),pp\. 1372–1387\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- Z\. Tian, J\. Cui, L\. Jiang, X\. Qi, X\. Lai, Y\. Chen, S\. Liu, and J\. Jia \(2023\)Learning context\-aware classifier for semantic segmentation\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 2438–2446\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- Z\. Tian, X\. Lai, L\. Jiang, S\. Liu, M\. Shu, H\. Zhao, and J\. Jia \(2022b\)Generalized few\-shot semantic segmentation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11563–11572\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- Z\. Tian, M\. Shu, P\. Lyu, R\. Li, C\. Zhou, X\. Shen, and J\. Jia \(2019\)Learning shape\-aware embedding for scene text detection\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 4234–4243\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- Z\. Tian, H\. Zhao, M\. Shu, Z\. Yang, R\. Li, and J\. Jia \(2020\)Prior guided feature enrichment network for few\-shot segmentation\.IEEE transactions on pattern analysis and machine intelligence44\(2\),pp\. 1050–1065\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- C\. Wang, L\. Jiang, X\. Wu, Z\. Tian, B\. Peng, H\. Zhao, and J\. Jia \(2024\)Groupcontrast: semantic\-aware self\-supervised representation learning for 3d understanding\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 4917–4928\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- C\. Wang, Y\. Feng, D\. Chen, Z\. Chu, R\. Krishna, and T\. Zhou \(2025a\)Wait, we don’t need to” wait”\! removing thinking tokens improves reasoning efficiency\.arXiv preprint arXiv:2506\.08343\.Cited by:[§B\.7](https://arxiv.org/html/2606.07108#A2.SS7.SSS0.Px1.p2.1),[§D\.2](https://arxiv.org/html/2606.07108#A4.SS2.p1.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[Table 37](https://arxiv.org/html/2606.07108#A4.T37),[Figure 1](https://arxiv.org/html/2606.07108#S1.F1),[Figure 1](https://arxiv.org/html/2606.07108#S1.F1.4.2.1),[§2\.1](https://arxiv.org/html/2606.07108#S2.SS1.SSS0.Px1.p1.13),[§3\.3](https://arxiv.org/html/2606.07108#S3.SS3.p2.3),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.27.10.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.35.18.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.45.28.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.54.37.1),[Table 6](https://arxiv.org/html/2606.07108#S4.T6.2.4.2.1)\.
- J\. Wang, B\. Chen, Y\. Li, B\. Kang, Y\. Chen, and Z\. Tian \(2025b\)Declip: decoupled learning for open\-vocabulary dense perception\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 14824–14834\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- J\. Wang, K\. Chen, Y\. Li, B\. Chen, H\. Zhao, X\. Qi, and Z\. Tian \(2025c\)Generalized decoupled learning for enhancing open\-vocabulary dense perception\.arXiv preprint arXiv:2508\.11256\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, Q\. Liu, J\. Xu, T\. Liang, X\. Chen, Z\. He, L\. Song, D\. Yu, J\. Li, Z\. Zhang,et al\.\(2025d\)Thoughts are all over the place: on the underthinking of o1\-like llms\.arXiv preprint arXiv:2501\.18585\.Cited by:[§A\.1](https://arxiv.org/html/2606.07108#A1.SS1.SSS0.Px1.p3.1),[§B\.2](https://arxiv.org/html/2606.07108#A2.SS2.SSS0.Px1.p8.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§A\.1](https://arxiv.org/html/2606.07108#A1.SS1.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.07108#S1.p1.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz,et al\.\(2019\)Huggingface’s transformers: state\-of\-the\-art natural language processing\.arXiv preprint arXiv:1910\.03771\.Cited by:[§D\.3](https://arxiv.org/html/2606.07108#A4.SS3.p1.1)\.
- X\. Wu, Z\. Tian, X\. Wen, B\. Peng, X\. Liu, K\. Yu, and H\. Zhao \(2024\)Towards large\-scale 3d representation learning with multi\-dataset point prompt training\.InCVPR,Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- Y\. Wu, J\. Shi, B\. Wu, J\. Zhang, X\. Lin, N\. Tang, and Y\. Luo \(2025\)Concise reasoning, big gains: pruning long reasoning trace with difficulty\-aware prompting\.arXiv preprint arXiv:2505\.19716\.Cited by:[§B\.10](https://arxiv.org/html/2606.07108#A2.SS10.SSS0.Px1.p4.1)\.
- S\. Xu, W\. Xie, L\. Zhao, and P\. He \(2025a\)Chain of draft: thinking faster by writing less\.arXiv preprint arXiv:2502\.18600\.Cited by:[§B\.6](https://arxiv.org/html/2606.07108#A2.SS6.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.20.3.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.31.14.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.39.22.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.49.32.1)\.
- Y\. Xu, X\. Guo, Z\. Zeng, and C\. Miao \(2025b\)Softcot: soft chain\-of\-thought for efficient reasoning with llms\.arXiv preprint arXiv:2502\.12134\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Table 7](https://arxiv.org/html/2606.07108#A1.T7.28.26.30.1.1),[Table 15](https://arxiv.org/html/2606.07108#A2.T15.1.1.3.2.1),[Table 16](https://arxiv.org/html/2606.07108#A2.T16.7.1.3.2.1),[§D\.1](https://arxiv.org/html/2606.07108#A4.SS1.p1.6),[§1](https://arxiv.org/html/2606.07108#S1.p1.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.30.13.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.48.31.1)\.
- C\. Yang, Q\. Si, Y\. Duan, Z\. Zhu, C\. Zhu, Q\. Li, M\. Chen, Z\. Lin, and W\. Wang \(2025b\)Dynamic early exit in reasoning models\.arXiv preprint arXiv:2504\.15895\.Cited by:[§B\.6](https://arxiv.org/html/2606.07108#A2.SS6.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1),[§D\.5](https://arxiv.org/html/2606.07108#A4.SS5.p1.1),[Figure 1](https://arxiv.org/html/2606.07108#S1.F1),[Figure 1](https://arxiv.org/html/2606.07108#S1.F1.4.2.1),[§1](https://arxiv.org/html/2606.07108#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.07108#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.07108#S3.SS3.p2.5),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.23.6.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.34.17.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.41.24.1),[Table 1](https://arxiv.org/html/2606.07108#S3.T1.17.17.53.36.1)\.
- S\. Yang, Y\. Chen, Z\. Tian, C\. Wang, J\. Li, B\. Yu, and J\. Jia \(2025c\)Visionzip: longer is better but not necessary in vision language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 19792–19802\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- S\. Yang, T\. Qu, X\. Lai, Z\. Tian, B\. Peng, S\. Liu, and J\. Jia \(2023\)Lisa\+\+: an improved baseline for reasoning segmentation with large language model\.arXiv preprint arXiv:2312\.17240\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- S\. Yang, Z\. Tian, L\. Jiang, and J\. Jia \(2024\)Unified language\-driven zero\-shot domain adaptation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 23407–23415\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p2.1)\.
- S\. Zhang, J\. Wu, J\. Chen, C\. Zhang, X\. Lou, W\. Zhou, S\. Zhou, C\. Wang, and J\. Wang \(2025a\)OThink\-r1: intrinsic fast/slow thinking mode switching for over\-reasoning mitigation\.arXiv preprint arXiv:2506\.02397\.Cited by:[§1](https://arxiv.org/html/2606.07108#S1.p3.1)\.
- Y\. Zhang, X\. Wu, Y\. Lao, C\. Wang, Z\. Tian, N\. Wang, and H\. Zhao \(2025b\)Concerto: joint 2d\-3d self\-supervised learning emerges spatial representations\.InNeurIPS,Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px1.p1.1)\.
- Z\. Zhang, X\. He, W\. Yan, A\. Shen, C\. Zhao, S\. Wang, Y\. Shen, and X\. E\. Wang \(2025c\)Soft thinking: unlocking the reasoning potential of llms in continuous concept space\.arXiv preprint arXiv:2505\.15778\.Cited by:[Appendix C](https://arxiv.org/html/2606.07108#A3.SS0.SSS0.Px3.p1.1)\.
- B\. Zhao, B\. Kapusuzoglu, K\. Balasubramaniam, S\. Sahu, S\. Chakraborty, and G\. I\. Winata \(2025\)Optimizing reasoning efficiency through prompt difficulty prediction\.arXiv preprint arXiv:2511\.03808\.Cited by:[§1](https://arxiv.org/html/2606.07108#S1.SS0.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.07108#S2.SS2.p1.1)\.
## Contents
[E Case Study](https://arxiv.org/html/2606.07108#A5)\.[E](https://arxiv.org/html/2606.07108#A5)
## Appendix AFurther Discussion on Motivation
### A\.1System 1 or System 2: Which Reasoning Mode Is Needed?
#### Summary\.
In this section, we examine whether reasoning\-oriented language models should uniformly rely on slow, deliberative System 2 reasoning, or instead adaptively switch between System 1–like and System 2–like reasoning modes according to problem difficulty\. Our validation study shows that explicit reasoning\-termination signals can substantially reduce token consumption, especially when injected during the reasoning process, but such compression also leads to notable accuracy degradation on challenging benchmarks such as AIME and Olympiad\. In contrast, simpler datasets such as GSM8K are much less affected, suggesting that easy problems often do not require extended deliberation, whereas hard problems depend critically on sustained reasoning\. These findings motivate difficulty\-adaptive inference: a reasoning model should estimate task difficulty either before or during generation, and dynamically allocate cognitive effort by using fast heuristic responses for simple instances while preserving deliberate reasoning for complex ones\.
The dual\-process theory of cognition distinguishes between fast, automatic System 1 processes and slow, deliberative System 2 reasoning\(Kahneman,[2011](https://arxiv.org/html/2606.07108#bib.bib6)\)\. Recent reasoning\-oriented language models draw inspiration from this framework by encouraging step\-by\-step deliberation through chain\-of\-thought supervision\(Weiet al\.,[2022](https://arxiv.org/html/2606.07108#bib.bib7)\), thereby inducing behaviors characteristic of System 2 reasoning in large language models\.
Reasoning language models have achieved remarkable success in domains that require complex computation and multi\-step reasoning, such as mathematics and programming\. However, this paradigm also introduces new challenges, including overthinking\(Chenet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib8)\), underthinking\(Wanget al\.,[2025d](https://arxiv.org/html/2606.07108#bib.bib9)\), and reasoning drift\.
In essence, effective reasoning models should adaptively allocate cognitive effort: they should rely on System 1–like processing to produce fast and direct responses for simple problems, rather than repeatedly re\-evaluating trivial cases, while engaging System 2–like deliberation for complex problems to ensure correctness through careful and sustained reasoning\.
Motivated by the prevalence of overthinking, numerous studies have proposed methods to shorten the reasoning trajectories of reasoning\-oriented models\. Among these approaches, NoThinking\(Maet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib4)\)represents the most aggressive form of compression, as it injects an explicit termination cue to prematurely halt the chain\-of\-thought, forcing a reasoning model to behave in a manner analogous to System 1 processing rather than System 2 deliberation\. While effective in reducing reasoning length, this strategy also incurs the largest performance degradation in terms of accuracy\. To systematically evaluate this trade\-off, we conduct a validation study comparing NoThinking, a standard baseline, and a variant that inserts a reasoning termination cue immediately after the first step of the model’s reasoning process\. The results are summarized in Table[7](https://arxiv.org/html/2606.07108#A1.T7)\.
We observe that both NoThinking and NoThinking Variant substantially reduce token consumption in reasoning models\. Notably, NoThinking Variant achieves a markedly stronger compression effect, reducing the average token usage by 64\.28% relative to the baseline\.
This result suggests that injecting an explicit reasoning\-termination semantic during the reasoning process is more effective than introducing such a signal prior to the onset of reasoning, as it better preserves the model’s instruction\-following behavior while suppressing redundant deliberation\. At the same time, we observe a pronounced accuracy degradation on more challenging benchmarks, such asAIMEandOlympiad, whereas performance on simpler datasets \(e\.g\.,GSM8K\) remains largely unaffected\. This contrast suggests that complex reasoning problems critically rely on extended deliberative processes to maintain accuracy, while simpler problems can often be solved correctly with substantially reduced reasoning depth\.
These observations raise a fundamental question: can a reasoning model identify problem difficulty either before or during the reasoning process, and dynamically adapt its cognitive strategy accordingly—employing a fast, heuristic\-driven*System 1*mode for simpler problems, while reserving more deliberate*System 2*reasoning for harder ones?
In response to the above question, we argue that a reasoning model does not need to commit to a single cognitive strategy throughout the entire inference process\. Instead, either*before*processing a problem or*during*reasoning, once the model judges the task to be sufficiently simple, it can directly switch to a fast, heuristic\-driven mode of reasoning\. As illustrated in Figure[6](https://arxiv.org/html/2606.07108#A1.F6), an explicit or implicit estimation of task difficulty enables*per\-question*adaptive inference, allowing the model to dynamically balance efficiency and accuracy by combining fast and slow thinking in a principled manner\.
Figure 6:Difficulty\-adaptive reasoning\.We illustrate the central hypothesis: a reasoning model may infer problem difficulty either*before*or*during*generation, and accordingly switch its cognitive mode—using a fast, heuristic*System 1*strategy for easy instances, while allocating more deliberate*System 2*reasoning for hard ones\.Table 7:Comparison of accuracy \(ACC\) and average token usage \(Tok\) across reasoning control strategies\. ReportedΔ\\Deltavalues indicate relative percentage changes with respect to the Baseline\.NoThinkinginserts the explicit termination cue “Okay, I have finished thinking\.</think\>” at step 0, whileNoThinking Variantinserts “Okay, I have finished thinking\.</think\>” at step 1, allowing minimal initial deliberation before terminating the reasoning process\.
### A\.2Who Decides Difficulty? A Model\-Centric Perspective
#### Summary\.
In this section, we argue that task difficulty should be understood from a model\-centric perspective rather than as a fixed, model\-agnostic property\. Since different models possess different capacities and reasoning abilities, the same problem may be difficult for a smaller model but easy for a stronger one\. Therefore, difficulty should be defined relative to the model’s own competence and internal uncertainty, and should be assessed dynamically during inference\. Building on this view, we investigate whether difficulty awareness emerges throughout the reasoning process rather than only before generation begins\. By segmenting model reasoning into discrete steps and analyzing step\-level hidden states on the MATH dataset, we find that difficulty\-related information is continuously encoded in the model’s internal representations across reasoning steps\. This suggests that models can maintain and update an intrinsic perception of task difficulty during reasoning, supporting the feasibility of dynamic, model\-aware difficulty estimation\.
A central question that follows is whether a model can meaningfully perceive problem difficulty\. We argue that difficulty assessment should be an intrinsic, model\-dependent process rather than being imposed by an external or universal discriminator\. Different models possess distinct capacities, inductive biases, and reasoning strengths; consequently, the same problem may require explicit multi\-step reasoning for a smaller model \(e\.g\., 1\.5B parameters\), while being solvable almost immediately by a larger or more capable one \(e\.g\., 32B parameters\)\. This heterogeneity implies that there is no single, model\-agnostic notion of difficulty\. Instead, difficulty should be understood as a relative concept, defined by the model’s own competence and internal uncertainty, and evaluated dynamically during inference\. EPIC\(Nguyenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib11)\)employs a contrastive learning paradigm to select appropriate reasoning strategies for a given query\. Its learned mapper is able to separate hard and easy mathematical problems in the latent space, indicating that problem difficulty can be effectively encoded and distinguished at the representation level\.Shenget al\.\([2025](https://arxiv.org/html/2606.07108#bib.bib10)\)further observe that special indicator tokens<think\>at the onset of reasoning encode the model’s internal perception of problem difficulty, suggesting that difficulty awareness is already present before or at the early stages of the reasoning process\. However, the aforementioned studies assess task difficulty either*before*the model begins reasoning or at the very early stages of the reasoning process\. In contrast, human difficulty assessment is inherently*dynamic*and unfolds during reasoning: a problem that initially appears difficult may become easier as reasoning progresses, while a seemingly simple problem may later reveal unexpected complexity\. This observation motivates a central question of our work:*does a model’s assessment of task difficulty also emerge and evolve during the reasoning process itself?*
Following prior work, we segment the reasoning process of a large language model into a sequence of discrete reasoning steps, each separated by the delimiter\\n\\n\. Formally, we denote the resulting sequence of reasoning steps as
𝒮=\{S0,S1,S2,…,Sn\}\.\\mathcal\{S\}=\\\{S\_\{0\},S\_\{1\},S\_\{2\},\\ldots,S\_\{n\}\\\}\.\(13\)
where eachSsS\_\{s\}represents the model’s intermediate reasoning state at stepss\. The final answer is generated after completing the last reasoning stepSnS\_\{n\}\.
To investigate whether a model exhibits an awareness of task difficulty during the reasoning process, we conduct experiments on the Math dataset\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib12)\), which provides discrete difficulty annotations ranging from Level 1 to Level 5\. Following our step\-based formulation, we associate the hidden state extracted at each reasoning step—segmented by the delimiter\\n\\n—with the corresponding difficulty level of the problem\.
Formally, letℓ∈\{1,2,3,4,5\}\\ell\\in\\\{1,2,3,4,5\\\}denote the ground\-truth difficulty level of a given problem, and let
𝐡s∈ℝd\.\\mathbf\{h\}\_\{s\}\\in\\mathbb\{R\}^\{d\}\.\(14\)
represent the hidden state at reasoning stepSsS\_\{s\}\. We analyze the relationship between𝐡s\\mathbf\{h\}\_\{s\}andℓ\\ellacross different reasoning stepsss\.
As illustrated in Figure[7](https://arxiv.org/html/2606.07108#A1.F7), we find that difficulty related information is not only encoded before or at the very beginning of the reasoning process, but is instead continuously embedded in the model’s hidden representations throughout reasoning\.
Figure 7:t\-SNE visualization of hidden representations colored by difficulty level\.From left to right, each panel shows the t\-SNE projection of the hidden states extracted at the first, second, and third reasoning steps \(defined by the delimiter\\n\\n\), respectively\. Colors indicate the ground\-truth difficulty level \(Level 1–Level 5\)\. We observe that such difficulty information is continuously encoded throughout reasoning, and this phenomenon consistently holds across different model families\.
### A\.3Generation Length as a Generalizable Difficulty Indicator
#### Summary\.
In this section, we show that remaining generation length provides a continuous, fine\-grained, and generalizable indicator of the model’s perceived task difficulty\. Unlike discrete difficulty annotations, which are often unavailable outside specific benchmarks, reasoning length naturally reflects the amount of computation a model allocates to solving a problem\. Through hidden state visualizations, we observe that representations associated with shorter remaining lengths largely align with low\-difficulty regions, while harder instances requiring longer reasoning trajectories occupy distinct regions of the representation space\. This structure remains stable as diverse mathematical datasets and the non\-mathematical GPQA benchmark are progressively incorporated, suggesting that remaining\-length encoding captures a robust, task\-agnostic difficulty signal rather than a dataset\-specific artifact\. Building on this property, we further demonstrate that a simple unsupervised difficulty classifier trained from this signal can produce intuitive difficulty distributions across datasets, supporting its potential use for difficulty estimation, dataset characterization, and future curriculum design\.
Given that the model exhibits a continuous perception of difficulty throughout the reasoning process, a natural question is whether this perceived difficulty evolves along the reasoning trajectory\. From a latent\-variable perspective, for challenging problems, the inferred difficulty may decrease as intermediate reasoning states accumulate sufficient evidence toward a solution, whereas for simpler problems, the difficulty may be assessed as low from the outset\. Such a trajectory\-dependent, continuous notion of difficulty offers a more principled and expressive representation than discrete multi\-class formulations\.
This naturally leads to a new question: how can one construct a continuous metric that reflects the model’s perceived task difficulty? While explicit difficulty annotations are available for certain mathematical benchmarks, such labels are absent in most real\-world datasets, posing a challenge for generalization\. A practical and broadly applicable solution is to use the model’s reasoning length as a proxy for difficulty\. Reasoning length is a continuous variable, and for problems of a similar type, more difficult instances tend to induce longer reasoning trajectories, whereas easier instances require substantially fewer reasoning steps\. For example, on relatively simple benchmarks such as GSM8K, the average reasoning length is around 1,000 tokens, whereas on more challenging benchmarks such as AIME, it approaches 20,000 tokens\.
As shown in Fig\.[8](https://arxiv.org/html/2606.07108#A1.F8), we color the hidden states of Qwen3\-4B\-Thinking\-2507 using two different criteria: difficulty level and remaining generation length\. We observe that the model not only continuously encodes difficulty\-related information, but also captures signals associated with the remaining generation length\. Notably, regions corresponding to low\-difficulty instances exhibit a substantial overlap with those associated with shorter remaining generation lengths, which aligns well with intuitive expectations\. This observation suggests that remaining generation length can serve as a more continuous and fine\-grained proxy for difficulty awareness, providing a principled and effective signal for modeling task difficulty\.
Figure 8:Visualization of Layer\-28 Hidden States at the First Reasoning Break for Qwen3\-4B\-Thinking\-2507 on Math\-500\. Left: colored by difficulty level; Right: colored by remaining generation length \(tokens\)\.As shown in Fig\.[9](https://arxiv.org/html/2606.07108#A1.F9), we investigate whether this property is specific to Math\-500 or persists under broader data distributions\. We find that as datasets are progressively and cumulatively incorporated, the model’s ability to encode remaining generation length remains consistently observable across all mathematical benchmarks\.
In particular, difficult instances from Olympiad and AIME2025 concentrate in the same region of the representation space, while easier instances from GSM8K and Math\-500 cluster in a distinct and aligned region, exhibiting a clear directional separation\.
Furthermore, we extend this analysis to a non\-mathematical benchmark, GPQA, and observe that difficult GPQA instances are mapped to the same region as difficult Olympiad problems\. These results indicate that the encoding of remaining generation length reflects a persistent, task\-agnostic property of the model, rather than a dataset\-specific artifact, thereby providing a principled foundation for the strong generalization capability of our method\.
Figure 9:Progressive generalization of remaining\-length encoding across cumulatively added datasets\. Hidden states of Qwen3\-4B\-Thinking\-2507 at the first reasoning break \(Layer 28\) are colored by remaining generation length\. \(a\) Math\-500; \(b\) Math\-500 \+ GSM8K; \(c\) Math\-500 \+ GSM8K \+ Olympiad; \(d\) Math\-500 \+ GSM8K \+ Olympiad \+ AIME2025; \(e\) Math\-500 \+ GSM8K \+ Olympiad \+ AIME2025 \+ AMC23; \(f\) Math\-500 \+ GSM8K \+ Olympiad \+ AIME2025 \+ AMC23 \+ GPQA\. The overall geometric structure remains stable as additional datasets are incorporated, indicating that remaining generation length captures a robust and highly transferable signal\.Building on this representational property, we show that it is possible to train a difficulty classifier in an unsupervised manner, without relying on any explicit difficulty annotations, and to assign difficulty labels to large\-scale datasets\. Specifically, we fit a simple binary logistic regression classifier on mathematical benchmarks and find that it can effectively annotate difficulty information across diverse datasets\.
As illustrated in Fig\.[10](https://arxiv.org/html/2606.07108#A1.F10), the resulting difficulty distributions are highly intuitive\. Math\-500, which is designed to be difficulty\-balanced, yields an approximately balanced distribution between easy and hard instances\. In contrast, GSM8K, a relatively simple benchmark, is dominated by instances classified as easy, while OLYMPIAD, a substantially more challenging benchmark, exhibits a distribution heavily skewed toward the hard class\. This property suggests that the learned difficulty signal can be leveraged to support large\-scale dataset characterization, and may further serve as a useful signal for model pretraining or curriculum design in future work\.
Figure 10:Unsupervised difficulty classification results across datasets using a logistic regression classifier\.
### A\.4Fitting a Regressor for Continuous Difficulty Estimation
#### Summary\.
In this section, we fit a lightweight regressor to estimate continuous task difficulty from hidden\-state representations using the model’s remaining generation length as supervision\. To obtain a stable regression target, we apply a logarithmic transformation followed by min–max normalization, which effectively mitigates the heavy right\-tailed distribution of generation lengths\. We then train Ridge regressors on step\-level hidden states and find that difficulty\-related signals become stronger in intermediate\-to\-late layers across multiple reasoning models\. Empirically, the log\-transformed and normalized target consistently outperforms raw or directly normalized remaining length, with the best models achieving strong validation performance\. Finally, we introduce an automatic validation\-based procedure for selecting both the optimal hidden layer and regularization strength, ensuring robust regressor fitting without manual tuning or test\-set leakage\.
Building on the properties of large language models discussed above, we fit a regressor to predict a continuous difficulty\-related signal\. Our regression target is derived from the model’s remaining generation length\. Specifically, given the raw remaining lengthyy, we first apply a logarithmic transformation followed by min–max normalization:
y~=log\(1\+y\)−yminymax−ymin,ymin≜minilog\(1\+yi\),ymax≜maxilog\(1\+yi\)\.\\tilde\{y\}=\\frac\{\\log\(1\+y\)\-y\_\{\\min\}\}\{y\_\{\\max\}\-y\_\{\\min\}\},\\quad y\_\{\\min\}\\triangleq\\min\_\{i\}\\log\(1\+y\_\{i\}\),\\;y\_\{\\max\}\\triangleq\\max\_\{i\}\\log\(1\+y\_\{i\}\)\.\(15\)
whereyminy\_\{\\min\}andymaxy\_\{\\max\}denote the minimum and maximum remaining lengths observed in the dataset, respectively\.
This transformation is motivated by the empirical distribution of generation lengths, which is strongly right\-skewed and contains a small number of extremely long outputs\. Applying a logarithmic transform effectively compresses the tail of the distribution, yielding a more stable and well\-conditioned regression target\. We fit a Ridge regression model to predict the normalized remaining\-length signal from hidden\-state representations\. Let𝐡i∈ℝd\\mathbf\{h\}\_\{i\}\\in\\mathbb\{R\}^\{d\}denote the hidden state extracted from the selected layer at the first reasoning break for sampleii, and lety~i\\tilde\{y\}\_\{i\}be the corresponding regression target defined in Eq\. \(X\)\. The Ridge regressor is trained by minimizing the following objective:
𝐰∗=argmin𝐰∑i=1N\(𝐰⊤𝐡i−y~i\)2\+λ∥𝐰∥22,\\mathbf\{w\}^\{\\ast\}=\\arg\\min\_\{\\mathbf\{w\}\}\\sum\_\{i=1\}^\{N\}\\left\(\\mathbf\{w\}^\{\\top\}\\mathbf\{h\}\_\{i\}\-\\tilde\{y\}\_\{i\}\\right\)^\{2\}\+\\lambda\\lVert\\mathbf\{w\}\\rVert\_\{2\}^\{2\},\(16\)where𝐰∈ℝd\\mathbf\{w\}\\in\\mathbb\{R\}^\{d\}denotes the regression weights andλ\\lambdais the regularization coefficient\. Theℓ2\\ell\_\{2\}regularization term mitigates overfitting and stabilizes training when the hidden representations are high\-dimensional and potentially correlated\.
As shown in Fig\.[11](https://arxiv.org/html/2606.07108#A1.F11), we fit a regressor to predict the remaining length from the hidden states at each layer\. We observe that difficulty\-related signals are progressively strengthened with increasing depth, and reach their maximum at intermediate\-to\-late layers\.
Figure 11:Layer\-wise validationR2R^\{2\}of the remaining\-length regressor across different models\. Panels \(a\)–\(c\) correspond to DeepSeek\-R1\-Distill\-Qwen\-7B, QwQ\-32B, and Qwen3\-14B, respectively\. For each layer, the best ridge regularization strength is selected based on validation performance\.As shown in Table[8](https://arxiv.org/html/2606.07108#A1.T8), we report detailed regression performance under different target normalization strategies\. We find that applying a logarithmic transformation followed by min–max normalization consistently yields the best performance, which is consistent with the heavy right\-tailed distribution of model\-generated remaining lengths\. The best models achieve anR2R^\{2\}of approximately 0\.8, indicating that difficulty\-related signals are strongly encoded in the hidden representations\. This provides a stable and reliable signal source for downstream difficulty\-aware control\.
Table 8:Summary of the remain\-length regressor across models\. For each model, we report the best\-performing layer, the target normalization range, and the corresponding regression performance\. Normalization options include Log1p \+ Min–Max, Min–Max, and Raw \(no transform\)\.
#### Automatic layer and hyperparameter selection\.
To avoid manual tuning, we automatically select both the hidden layer and the regressor hyperparameter via a validation\-based grid search\. Specifically, we first identify the set of*common layers*that are available across all sampled trajectories, ensuring a consistent hidden dimensionality\. For each candidate layer, we extract the corresponding hidden states and train a Ridge regression model to predict the \(optionally normalized\) remaining length\.
We perform a grid search over the Cartesian product of candidate layers and Ridge regularization strengths, and evaluate each configuration on a held\-out validation split\. The best configuration is selected by maximizing the validationR2R^\{2\}, with validation MAE used as a tie\-breaker\. After selecting the optimal layer and regularization strength, we refit the regressor on the combined training and validation set and report performance on a held\-out test set\. This procedure enables data\-driven selection of both the representational layer and model capacity, while preventing test\-set leakage\.
### A\.5Trend Analysis of Difficulty Estimation
#### Summary\.
In this section, we show that the difficulty estimator learned from MATH transfers well to unseen reasoning benchmarks and preserves meaningful dataset\-level difficulty trends\. Across different base models, the regressor consistently assigns lower difficulty scores to simpler benchmarks such as GSM8K, moderate scores to Math\-500, and higher scores to AIME\- and Olympiad\-style benchmarks\. This alignment with ground\-truth difficulty indicates that the regressor captures generalizable temporal patterns of reasoning difficulty rather than merely memorizing the training distribution, further supporting its reliability as the difficulty signal used by DyCon\.
In this section, we further examine whether the learned difficulty estimator captures meaningful difficulty trends across benchmarks\. While the main experiments demonstrate that DyCon can improve reasoning efficiency and accuracy, it is also important to verify that the underlying difficulty regressor produces reasonable estimates beyond the training distribution\. To this end, we fit the regressor on the MATH dataset and evaluate it on multiple benchmarks with different levels of reasoning complexity\. The results show that the predicted difficulty scores are closely aligned with the corresponding ground\-truth difficulty scores, and that the regressor can effectively recover the expected dataset\-level difficulty ordering\.
The difficulty estimator in DyCon is designed to provide an online estimate of the current reasoning difficulty during generation\. Therefore, a desirable property is that its predictions should not only be accurate at the instance level, but should also preserve meaningful aggregate trends across datasets\. In particular, benchmarks that typically require longer and more complex reasoning, such as AIME and Olympiad\-style problems, are expected to receive higher difficulty scores, while relatively simpler grade\-school arithmetic problems, such as GSM8K, should receive lower scores\.
To analyze this property, we train the regressor on the MATH dataset and then evaluate it on five benchmarks: Math\-500, AIME24, AIME25, GSM8K, and Olympiad\. These benchmarks cover a broad spectrum of mathematical reasoning difficulty\. For each benchmark, we compute the mean predicted difficulty score produced by the regressor and compare it with the corresponding ground\-truth difficulty score\. Higher scores indicate higher estimated reasoning difficulty\.
As shown in Table[9](https://arxiv.org/html/2606.07108#A1.T9), the regressor produces predictions that are highly consistent with the ground\-truth difficulty values across different base models\. For Qwen3\-4B\-Thinking\-2507, the predicted scores almost exactly match the ground\-truth scores on Math\-500 and AIME24, and remain very close on AIME25, GSM8K, and Olympiad\. Similar patterns can also be observed for DeepSeek\-R1\-Distill\-Qwen\-7B, Qwen3\-14B, and QwQ\-32B\. Across all models, GSM8K consistently receives the lowest difficulty score, Math\-500 receives a moderate score, and AIME/Olympiad benchmarks receive substantially higher scores\.
This trend is important because it suggests that the regressor is not merely memorizing superficial properties of the training set\. Instead, it generalizes to unseen benchmarks and preserves the relative difficulty structure among datasets\. In particular, the estimated ordering broadly follows the expected pattern: Olympiad and AIME\-style benchmarks are more difficult than Math\-500, while GSM8K is the easiest among the evaluated datasets\. The close agreement between regressor predictions and ground\-truth scores further supports the reliability of the learned difficulty estimator used by DyCon\.
Table 9:Trend analysis of difficulty estimation across benchmarks\. We report the mean regressor\-predicted difficulty scores and the corresponding ground\-truth difficulty scores\. Higher values indicate higher estimated reasoning difficulty\.Overall, these results provide additional evidence that the learned difficulty estimator captures meaningful reasoning difficulty rather than dataset\-specific artifacts\. The estimator can recover both fine\-grained numerical scores and coarse\-grained benchmark\-level trends, which makes it suitable for dynamically controlling the reasoning behavior of DyCon during inference\.
### A\.6Domain Generalizability of Difficulty Estimation
#### Summary\.
In this section, we evaluate whether the difficulty estimator learned from mathematical reasoning can generalize to non\-math domains\. We find that a regressor fitted only on MATH already transfers reasonably well to CommonsenseQA and GPQA, indicating that step\-level hidden representations encode difficulty\-related signals beyond mathematical tasks\. However, its calibration degrades on domains with substantially different interaction patterns, such as MultiChallenge\. To improve robustness, we refit the regressor on a balanced mixture of MATH, CommonsenseQA, GPQA, and MultiChallenge\. The refitted estimator achieves much closer alignment with ground\-truth difficulty across both math and non\-math benchmarks, while also generalizing to the unseen C4 domain\. These results suggest that diverse\-domain fitting strengthens the calibration and transferability of difficulty estimation, supporting DyCon as a general difficulty\-aware control mechanism rather than a math\-specific method\.
The difficulty estimator in DyCon is designed to estimate the model’s reasoning difficulty during generation\. Since our main experiments fit the regressor on mathematical reasoning data, it is important to test whether the learned signal is specific to math problems or transferable to other reasoning domains\. Non\-math benchmarks introduce different forms of difficulty: CommonsenseQA relies more on implicit world knowledge, GPQA requires expert\-level scientific reasoning, and MultiChallenge evaluates realistic multi\-turn conversation abilities such as context tracking and instruction following\.
To evaluate this transferability, we apply the MATH\-fitted regressor to CommonsenseQA, GPQA, and MultiChallenge using Qwen3\-4B\-Thinking\-2507 as the base model\. Table[10](https://arxiv.org/html/2606.07108#A1.T10)reports the mean regressor\-predicted difficulty scores and the corresponding ground\-truth difficulty scores\. The regressor provides reasonably aligned estimates on CommonsenseQA and GPQA, suggesting that the difficulty signal learned from math data contains transferable information\. However, the gap becomes larger on MultiChallenge, where the predicted difficulty is noticeably higher than the ground\-truth value\. This indicates that single\-domain fitting can generalize to some extent, but may not fully capture difficulty patterns in domains with very different interaction structures\.
Table 10:Out\-of\-domain evaluation of the difficulty regressor fitted on the MATH dataset using Qwen3\-4B\-Thinking\-2507\. We report the mean regressor\-predicted difficulty scores and the corresponding ground\-truth difficulty scores\. Higher values indicate higher estimated reasoning difficulty\.The results in Table[10](https://arxiv.org/html/2606.07108#A1.T10)motivate a more diverse fitting strategy\. If different domains express reasoning difficulty through different generation behaviors, then exposing the regressor to multiple reasoning distributions should improve its calibration\. Therefore, we refit the regressor on a balanced dataset spanning MATH, CommonsenseQA, GPQA, and MultiChallenge\. The refitting split is separated from the final evaluation split, so the reported results are not obtained by evaluating on the same examples used for fitting\.
Table[11](https://arxiv.org/html/2606.07108#A1.T11)presents the results of the refitted regressor\. Compared with the MATH\-fitted setting in Table[10](https://arxiv.org/html/2606.07108#A1.T10), the refitted regressor achieves much closer alignment with ground\-truth difficulty scores on CommonsenseQA, GPQA, and MultiChallenge\. Importantly, this improvement does not come at the cost of performance on mathematical benchmarks\. The refitted regressor remains well aligned with the ground\-truth scores on Math\-500, AIME2024, AIME2025, AMC23, GSM8K, and Olympiad\. It also generalizes well to C4\(Raffelet al\.,[2020](https://arxiv.org/html/2606.07108#bib.bib62)\), which is not included in the refitting mixture, suggesting that diverse fitting can improve the robustness of difficulty estimation beyond the training domains\.
Table 11:Domain generalization of the refitted difficulty regressor using Qwen3\-4B\-Thinking\-2507\. The regressor is refitted on a balanced mixture of MATH, CommonsenseQA, GPQA, and MultiChallenge, and evaluated across both math and non\-math benchmarks\. We report the mean regressor\-predicted difficulty scores and the corresponding ground\-truth difficulty scores\. Higher values indicate higher estimated reasoning difficulty\.Overall, the two experiments lead to complementary conclusions\. First, the MATH\-fitted regressor already transfers reasonably well to several non\-math domains, showing that step\-level representations contain difficulty\-related signals beyond mathematical reasoning\. Second, fitting the regressor on diverse reasoning distributions further improves its calibration and robustness, especially for domains whose reasoning patterns differ from math\. These results support the use of DyCon as a general difficulty\-aware control mechanism rather than a method restricted to mathematical benchmarks\.
### A\.7Recovering Token\-Space Performance and Cross\-Distribution Generalization
#### Summary\.
In this section, we further evaluate whether the remaining\-length regressor preserves its predictive utility after mapping normalized predictions back to the original token space\. The results show that the regressor can estimate token\-level remaining length with relatively small errors on simpler datasets such as GSM8K, while larger absolute errors arise on harder benchmarks such as AIME2025 and Olympiad due to both the intrinsic uncertainty of difficult reasoning and the compression effect of the logarithmic transformation\. Nevertheless, the regressor still captures the coarse scale of reasoning effort and becomes increasingly accurate in the later stages of difficult problems, supporting the view that hard instances gradually transition into easier regimes as reasoning progresses\.
A key question is whether the strong performance observed in the transformed target space can be retained after mapping predictions back to the original token space\. Given the estimatedyminy\_\{\\min\}andymaxy\_\{\\max\}, we invert the normalization to recover predictions in terms of the original remaining\-token counts\.
y^=exp\(y~^\(ymax−ymin\)\+ymin\)−1,\\hat\{y\}=\\exp\\\!\\Big\(\\hat\{\\tilde\{y\}\}\\,\(y\_\{\\max\}\-y\_\{\\min\}\)\+y\_\{\\min\}\\Big\)\-1,\(17\)
We then evaluate the regressor trained on Math on other mathematical datasets with different distributions to assess its cross\-distribution generalization\. As shown in Table[12](https://arxiv.org/html/2606.07108#A1.T12), we evaluate the regressor in the original token space\. For simpler datasets such as GSM8K, the prediction error is on the order of∼\\sim100 tokens, indicating a high level of accuracy\. In contrast, for more challenging datasets such as AIME2025 and Olympiad, the prediction error increases, although the regressor still captures the coarse scale of the remaining length\. From a modeling perspective, this behavior is partly attributable to the logarithmic transformation, which naturally compresses large values and thus leads to larger absolute errors for long outputs after inverse transformation\. From an intuitive perspective, this trend is also expected: for simple problems, the model can reliably anticipate how many tokens are required, whereas for difficult problems, the model primarily recognizes that the problem is hard, but cannot precisely predict the exact number of tokens needed to reach a solution\.
Table 12:Token\-space evaluation of the remaining\-length regressor for Qwen3\-4B\-Thinking\-2507 across mathematical datasets\. We report the average ground\-truth remaining length, the average prediction, and the corresponding absolute and percentage errors \(lower is better\)\.As shown in Figure[12](https://arxiv.org/html/2606.07108#A1.F12), we randomly sample one instance from each dataset for visualization\. We observe that the regressor can continuously track the problem difficulty and provide relatively accurate token\-length predictions, particularly for simple problems\. For challenging problems, the predictions are less precise, consistent with our earlier quantitative results\. However, in the later stages of difficult problem solving, the regressor becomes increasingly accurate, indicating that once a difficult problem enters its closing phase, it effectively transitions into an easier regime\. This phenomenon further corroborates our earlier hypothesis that hard problems tend to become easy in the later stages of reasoning\.
Figure 12:Token\-space visualization of remaining\-length prediction for Qwen3\-4B\-Thinking\-2507 across datasets\.
### A\.8From Instruct\-Style to Reasoning\-Style: Difficulty\-Adaptive Generation via a Regressor
#### Summary\.
In this section, we introduce a regressor\-guided adaptive termination mechanism that converts the estimated reasoning difficulty into a continuous logit bias on the</think\>token\. This design selectively increases the probability of terminating reasoning when the regressor predicts low necessity for continued deliberation, while leaving the rest of the token distribution unchanged\. As a result, the model can adapt its generation behavior according to task difficulty: on easier benchmarks such as GSM8K, it shifts toward instruct\-style fast generation with substantially reduced token usage, whereas on harder benchmarks such as AIME2024, it largely preserves reasoning\-style deliberation and maintains accuracy\. These results suggest that difficulty\-aware control enables reasoning models to dynamically switch between System 1–like and System 2–like behaviors\. However, aggressive termination may amplify regressor errors or the model’s intrinsic underthinking, causing premature stopping and accuracy degradation\. Therefore, we adopt a soft control strategy in the main paper to balance efficiency gains with reasoning reliability\.
Leveraging the regressor’s ability to estimate task difficulty, we model the decision of whether to terminate reasoning as a continuous, difficulty\-aware logit bias applied to the reasoning termination token:
Δℓ⟨/think⟩\(t\)=λ⋅f\(1−r\(ht\)\),λ≥0\.\\Delta\\ell\_\{\\langle/\\text\{think\}\\rangle\}\(t\)=\\lambda\\cdot f\\\!\\left\(1\-r\(h\_\{t\}\)\\right\),\\qquad\\lambda\\geq 0\.\(18\)Here,ht∈ℝdh\_\{t\}\\in\\mathbb\{R\}^\{d\}denotes the hidden state at thett\-th reasoning checkpoint, andr\(ht\)∈\[0,1\]r\(h\_\{t\}\)\\in\[0,1\]is the regressor’s prediction of the necessity to continue reasoning\. The monotonic mappingf\(⋅\):\[0,1\]→\[0,1\]f\(\\cdot\):\[0,1\]\\rightarrow\[0,1\]transforms the regressor output into a normalized control signal, whileλ\\lambdacontrols the maximum strength of the termination bias\. The resultingΔℓ⟨/think⟩\(t\)\\Delta\\ell\_\{\\langle/\\text\{think\}\\rangle\}\(t\)is added to the logit of the</think\>token at the next generation step\.
During decoding, the model’s conditional distribution is modified by injecting the difficulty\-aware logit bias into the reasoning termination token:
p\(yt∣y<t\)=Softmax\(ℓt\+Δℓ⟨/think⟩\(t\)e⟨/think⟩\),p\(y\_\{t\}\\mid y\_\{<t\}\)=\\mathrm\{Softmax\}\\\!\\Big\(\\ell\_\{t\}\+\\Delta\\ell\_\{\\langle/\\text\{think\}\\rangle\}\(t\)\\,e\_\{\\langle/\\text\{think\}\\rangle\}\\Big\),\(19\)whereℓt\\ell\_\{t\}denotes the original pre\-softmax logits at steptt, ande⟨/think⟩e\_\{\\langle/\\text\{think\}\\rangle\}is a one\-hot vector with value 1 at the position corresponding to the</think\>token and 0 elsewhere\. This formulation ensures that the difficulty\-aware control signal selectively affects only the probability of terminating reasoning, while leaving the remaining token distribution unchanged\.
Table 13:Adaptive behavior switching onQwen3\-4B\-Thinking\-2507\. On GSM8K, our method induces instruct\-style generation with a large reduction in token usage, while on AIME2024 it preserves reasoning\-style behavior\.As shown in Table[13](https://arxiv.org/html/2606.07108#A1.T13), the model adaptively adjusts its generation behavior based on the regressor\-assisted estimation of task difficulty\. On easier benchmarks, the model exhibits instruct\-style behavior, while on more challenging benchmarks it preserves reasoning\-style generation\. This emergent behavior is particularly encouraging, as it suggests that the reasoning model acquires the ability to autonomously switch between*System 1*\(fast, shallow\) and*System 2*\(slow, deliberate\) modes of reasoning\.
However, we observe that under any form of hard or aggressive control, the model tends to suffer a non\-negligible accuracy drop on easy datasets\. For this reason, we adopt a*soft*control mechanism in the main paper\. We attribute the observed performance degradation to multiple factors\. First, the regressor is not perfectly accurate and may occasionally misclassify hard or medium\-difficulty problems as easy, leading to premature termination of reasoning and consequent accuracy loss\. Second, the base model itself may exhibit an*underthinking*phenomenon, where an instance is initially judged as easy due to insufficient early deliberation or overconfidence\. In such cases, the regressor may further amplify this underthinking behavior, exacerbating premature stopping and increasing the likelihood of errors\.
### A\.9Temporal Dynamics and Non\-Stationarity of Difficulty Signals
#### Summary\.
In this section, we analyze the temporal dynamics of the regressor\-predicted difficulty signal during reasoning\. Rather than treating difficulty as a static property, we examine whether the estimated necessity of continued reasoning evolves across reasoning steps\. Through ADF and KPSS stationarity tests, we find that the predicted difficulty signals are predominantly first\-order non\-stationary across multiple reasoning models, indicating the presence of systematic temporal trends rather than stationary fluctuations around a fixed mean\. This provides mechanistic evidence that the model’s perceived difficulty is dynamically updated throughout the reasoning process, supporting our view that difficulty\-aware control should operate as an online, trajectory\-dependent mechanism rather than a one\-shot static decision\.
Beyond aggregate performance metrics, we seek to understand the temporal behavior of the regressor\-predicted difficulty signal during the reasoning process\. Since the regressor is designed to estimate the necessity of continued reasoning at each step, this signal is inherently dynamic and may evolve as the model progressively refines its internal understanding of the problem\. To characterize this temporal structure, we analyze the stationarity properties of the predicted difficulty signal across reasoning steps\.
As shown in Table[14](https://arxiv.org/html/2606.07108#A1.T14), we conduct time\-series stationarity tests using ADF and KPSS on the regressor’s per\-step difficulty predictions\. We find that for the vast majority of trajectories, the predicted signal exhibits first\-order non\-stationarity with a systematic trend, rather than stationary fluctuations around a constant mean\. This observation provides mechanistic evidence that the model’s perceived task difficulty evolves systematically over the course of reasoning, reflecting dynamic updates of its internal assessment as reasoning progresses\.
Table 14:Stationarity order distribution of the regressor\-predicted difficulty signal across reasoning trajectories \(with a maximum differencing order of 6\)\. Percentages are computed over all trajectories\.
### A\.10Analysis of Overthinking Behavior in LLMs
#### Summary\.
In this section, we perform a step\-level analysis of overthinking by identifying the earliest reasoning step at which the correct answer first appears, and using this to define the early\-correctness ratio and the corresponding overthinking ratio\. The results show that, across multiple reasoning models, correct answers typically emerge well before the end of the full reasoning trajectory, indicating that a substantial proportion of later steps are potentially redundant\. This provides direct empirical evidence that overthinking is a systematic and widespread phenomenon in reasoning\-oriented language models\. At the same time, the distributions reveal clear model\-dependent differences in the severity of overthinking, as well as substantial instance\-level variability in when correctness first arises\. These findings suggest that overthinking cannot be effectively characterized or controlled by a single fixed stopping threshold, and instead motivate a more robust, distribution\-aware strategy for adaptive reasoning control\.
As shown in Fig\.[13](https://arxiv.org/html/2606.07108#A1.F13), we conduct a systematic step\-level analysis of the overthinking phenomenon in LLMs\. Specifically, we employ an LLM\-based judge to identify the earliest step at which the correct answer first appears in the generated reasoning\. Based on this, we first define the*early\-correctness ratio*as
rearly=NearliestNtotal,r\_\{\\text\{early\}\}=\\frac\{N\_\{\\text\{earliest\}\}\}\{N\_\{\\text\{total\}\}\},\(20\)whereNearliestN\_\{\\text\{earliest\}\}denotes the earliest step at which the correct answer is identified andNtotalN\_\{\\text\{total\}\}is the total number of reasoning steps\.
Intuitively, a smallerrearlyr\_\{\\text\{early\}\}indicates that correctness is achieved earlier in the reasoning process, implying that a larger fraction of subsequent steps are potentially redundant\. Accordingly, we define the*overthinking ratio*as
roverthink=1−rearly=Ntotal−NearliestNtotal,r\_\{\\text\{overthink\}\}=1\-r\_\{\\text\{early\}\}=\\frac\{N\_\{\\text\{total\}\}\-N\_\{\\text\{earliest\}\}\}\{N\_\{\\text\{total\}\}\},\(21\)which directly quantifies the proportion of reasoning steps generated after correctness is already achieved\. This metric therefore serves as a proxy for the degree of redundant reasoning, and hence the extent of overthinking exhibited by the model\.
Fig\.[13](https://arxiv.org/html/2606.07108#A1.F13)visualizes the empirical distributions ofrearlyr\_\{\\text\{early\}\}for three representative models\. Across all models, the medianrearlyr\_\{\\text\{early\}\}values are well below0\.50\.5, indicating that correct answers typically emerge in the first third to first half of the reasoning process, after which a substantial fraction of generated steps are potentially redundant\. This provides direct empirical evidence that overthinking is a systematic phenomenon rather than an isolated case\.
Moreover, we observe clear model\-dependent differences\. In particular, QwQ\-32B exhibits the earliest correctness \(medianrearly=0\.327r\_\{\\text\{early\}\}=0\.327\), suggesting the most severe overthinking behavior, while Qwen3\-4B\-Thinking\-2507 reaches correctness later on average \(medianrearly=0\.427r\_\{\\text\{early\}\}=0\.427\), implying relatively milder overthinking\. R1\-7B lies between these two models\.
Importantly, all three models display wide interquartile ranges, reflecting substantial instance\-level variability in the stage at which correctness first appears\. This distributional spread suggests that overthinking does not occur at a fixed step or ratio, but rather varies significantly across instances\. These observations naturally motivate a robust, distribution\-aware hyperparameter design, instead of relying on a single fixed threshold for early stopping or suppression\.
Figure 13:Kernel density of earliest correctness emergence\.The distribution ofr=earliest\_step/num\_stepsr=\\mathrm\{earliest\\\_step\}/\\mathrm\{num\\\_steps\}\(identified by an LLM\-judge\) shows substantial variability across instances, indicating that the correct answer can emerge at markedly different reasoning stages\.
## Appendix BAdditional Experimental Results and Ablations
### B\.1Alternative Efficient Reasoning Strategies
#### Summary\.
In this section, we discuss alternative strategies for efficient reasoning and show that difficulty awareness can serve as a transferable control signal beyond our soft suppression framework\. We evaluate both classifier\-based and regressor\-based early\-exit methods, where reasoning is explicitly terminated once the predicted difficulty falls below a predefined threshold\. The results demonstrate that difficulty\-aware early exit can substantially reduce token consumption while largely maintaining accuracy across multiple benchmarks\. However, compared with our soft control strategy, hard early\-exit mechanisms provide coarser control and are more prone to suboptimal termination, leading to weaker overall trade\-offs between efficiency and accuracy\. These findings suggest that continuous difficulty\-aware modulation offers a more fine\-grained and reliable approach, while also highlighting promising future directions such as integrating difficulty prediction with steering\-based reasoning control\.
Efficient reasoning can be achieved through a wide range of strategies\. Beyond our soft suppression of reflective transition terms, existing approaches include early exit, steering mechanisms, and prompt\-based methods\. As a transferable component, difficulty awareness can be naturally integrated into these alternative strategies\.
As shown in Table[15](https://arxiv.org/html/2606.07108#A2.T15), we further evaluate early\-exit strategies guided by difficulty awareness\. In addition to the regressor\-based design, we also fit a classifier to predict whether the current problem instance can be safely terminated\. In both cases, when the predicted difficulty falls below a predefined threshold, we explicitly terminate the reasoning process by appending the text sequence “</think\>”\. The detailed experimental results are summarized in Table[15](https://arxiv.org/html/2606.07108#A2.T15)\. We observe that early\-exit methods augmented with difficulty awareness can substantially reduce the number of generated tokens while largely preserving accuracy\. However, their performance remains inferior to our soft control approach\. We attribute this gap to the finer granularity of soft modulation, which enables more precise control over the reasoning process and better preserves accuracy\. As a promising direction for future work, difficulty\-aware steering frameworks warrant further investigation\. Since the strength of steering is inherently governed by a tunable parameter, it can be naturally coupled with difficulty prediction: the steering strength can be reduced when the model identifies a problem as difficult and increased when the problem is deemed easy\. This opens several avenues for future research\.
Table 15:Comparison of classifier\-based and regressor\-based early\-exit strategies onQwen3\-4B\-Thinking\-2507\.
### B\.2Direct Earliest\-Correctness Modeling with GRU and Underthinking
#### Summary\.
In this section, we investigate whether the earliest step at which a model reaches the correct answer can be directly predicted from its hidden\-state trajectory\. We formulate earliest\-correctness prediction as a sequence labeling problem and train a GRU\-based model to identify whether each reasoning step occurs after the first correct solution point\. Although the GRU achieves high prediction accuracy and can guide early exit with substantial token reductions, further analysis reveals an important limitation: it often learns surface\-level conclusion patterns, such as “Final answer” or “In conclusion,” rather than the true point at which correctness is achieved\. As a result, the model may terminate prematurely when an initial conclusion is incorrect but would have been corrected through later self\-reflection, leading to underthinking\. These findings suggest that direct earliest\-correctness modeling is trainable but unreliable as a standalone control mechanism, motivating our preference for softer, distribution\-aware difficulty control rather than hard termination based on a single predicted exit point\.
Given that we can identify the earliest step at which the model first produces the ground\-truth \(GT\) answer, and that hidden states are shown to encode evolving difficulty\-related signals, we ask whether the hidden states also encode sufficient information to predict the step at which the model solves the problem\.
Let𝐇0:t=\{𝐡0,𝐡1,…,𝐡t\}\\mathbf\{H\}\_\{0:t\}=\\\{\\mathbf\{h\}\_\{0\},\\mathbf\{h\}\_\{1\},\\ldots,\\mathbf\{h\}\_\{t\}\\\}denote the sequence of hidden states up to steptt\. Given the earliest correct stepi⋆i^\{\\star\}, we define the binary supervision as
yi=𝕀\[i≥i⋆\],y\_\{i\}=\\mathbb\{I\}\[i\\geq i^\{\\star\}\],\(22\)where𝕀\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function\.
We parameterize a sequence modelfθf\_\{\\theta\}with a GRU to predict the earliest\-correctness signal:
𝐬i\\displaystyle\\mathbf\{s\}\_\{i\}=GRU\(𝐡i,𝐬i−1\),\\displaystyle=\\mathrm\{GRU\}\(\\mathbf\{h\}\_\{i\},\\mathbf\{s\}\_\{i\-1\}\),\(23\)y^i\\displaystyle\\hat\{y\}\_\{i\}=σ\(𝐰⊤𝐬i\+b\),\\displaystyle=\\sigma\(\\mathbf\{w\}^\{\\top\}\\mathbf\{s\}\_\{i\}\+b\),\(24\)whereσ\(⋅\)\\sigma\(\\cdot\)denotes the sigmoid function\.
The model is trained with a sequence\-wise binary cross\-entropy loss:
ℒ=1t\+1∑i=0tBCE\(y^i,yi\)\.\\mathcal\{L\}=\\frac\{1\}\{t\+1\}\\sum\_\{i=0\}^\{t\}\\mathrm\{BCE\}\(\\hat\{y\}\_\{i\},y\_\{i\}\)\.\(25\)
As shown in Table[17](https://arxiv.org/html/2606.07108#A2.T17)and Figure[14](https://arxiv.org/html/2606.07108#A2.F14), we find that this formulation is indeed trainable in practice\. The GRU achieves relatively high accuracy in predicting the model’s earliest exit point\. We then use this trained GRU to guide the execution of the early\-exit algorithm\.
Figure 14:Training curves of the GRU\-based earliest\-correctness predictor\.Table 16:Comparison between baseline and GRU\-based early exit onQwen3\-4B\-Thinking\-2507\.As shown in Table[16](https://arxiv.org/html/2606.07108#A2.T16), the GRU achieves strong performance on several benchmarks and attains state\-of\-the\-art results on some datasets\. We further observe that the steps selected by the GRU are often highly precise, frequently occurring immediately after conclusion\-related patterns such as “Final answer” or “In conclusion\.”
Compared to approaches that rely on large\-model\-based judges, such as FlashThinking\(Jianget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib29)\)and TrimR\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28)\), the GRU\-based method achieves superior performance\. However, we also observe substantial accuracy degradation on relatively simple datasets\. We attribute this to the fact that the appearance of conclusion\-style phrases does not necessarily indicate that the ground\-truth answer has been correctly derived\. In many cases, the model may first produce an incorrect conclusion and subsequently enter a self\-reflection or correction phase\. In such cases, the GRU may incorrectly identify these premature conclusion steps as valid exit points, leading to underthinking\(Wanget al\.,[2025d](https://arxiv.org/html/2606.07108#bib.bib9)\)\.
These results suggest that the GRU primarily learns surface\-level conclusion patterns rather than the true optimal point of correctness\. As a result, it fails to reliably capture the genuine earliest\-correctness step\. This limitation motivates our use of averaging over optimal points, as there exists no stable and learnable pattern that consistently corresponds to the true optimal solving step\.
Table 17:Summary of the GRU\-based earliest\-correctness predictor\. We report the best\-performing layer and the corresponding accuracy and loss for each model\.
### B\.3Stability Analysis of the Distance\-Based Suppression Signal
#### Summary\.
In this section, we define and validate a logit\-gap distance for measuring the relative dominance of targeted transition tokens in the model’s output distribution\. By computing mean and maximum positive gaps with respect to the vocabulary\-wide mean logit, we show that the proposed distance yields a stable and well\-calibrated scale across different reasoning models, while its extreme\-value structure reflects meaningful model\-dependent capability differences\. After normalizing by the global logit standard deviation, the distance remains tightly structured and scale\-invariant, indicating that it naturally adapts to each model’s intrinsic uncertainty without requiring manually tuned bias terms\. We further show that our adjustment precisely re\-centers the targeted token subset around the global mean, reducing over\-dominant transition probabilities without hard blocking or suppressing model expressivity\. Finally, global drift analysis demonstrates that this local correction does not distort the full\-vocabulary logit distribution, preserving global scale, extrema, and distributional structure\. These results support the proposed logit\-gap distance as a robust, localized, and cross\-model generalizable control metric\.
#### Logit\-Gap Distance Definition\.
Letzv∈ℝz\_\{v\}\\in\\mathbb\{R\}denote the logit of tokenvvat a given decoding step, and let
μ=1\|𝒱\|∑v∈𝒱zv\\mu\\;=\\;\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{v\\in\\mathcal\{V\}\}z\_\{v\}\(26\)be the vocabulary\-wide mean logit, where𝒱\\mathcal\{V\}is the vocabulary\. For a token subsetℬ⊆𝒱\\mathcal\{B\}\\subseteq\\mathcal\{V\}, we define the per\-token positive mean\-centered logit gap as
δv=\(zv−μ\)\+=max\(zv−μ,0\),v∈ℬ\.\\delta\_\{v\}\\;=\\;\\bigl\(z\_\{v\}\-\\mu\\bigr\)\_\{\+\}\\;=\\;\\max\\\!\\bigl\(z\_\{v\}\-\\mu,\\;0\\bigr\),\\qquad v\\in\\mathcal\{B\}\.\(27\)
#### Aggregated Gap Statistics\.
Based on\{δv\}v∈ℬ\\\{\\delta\_\{v\}\\\}\_\{v\\in\\mathcal\{B\}\}, we summarize the overall gap strength with two statistics:
dm=1\|ℬ\|∑v∈ℬδv,d\_\{m\}\\;=\\;\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{v\\in\\mathcal\{B\}\}\\delta\_\{v\},\(28\)which we refer to as the*mean positive gap*\(average positive advantage\), and
dM=maxv∈ℬδv,d\_\{M\}\\;=\\;\\max\_\{v\\in\\mathcal\{B\}\}\\delta\_\{v\},\(29\)which we refer to as the*max positive gap*\(maximum positive advantage\)\.
As shown in Table[18](https://arxiv.org/html/2606.07108#A2.T18), the distance scale remains highly consistent across different models\. This indicates that the proposed logit\-gap distance defines a well\-calibrated scale in logit space and does not require additional normalization\.
Moreover, we observe that the extreme\-value structure systematically strengthens with increasing model capacity, suggesting a meaningful monotonic relationship between the proposed distance and model capability\. This property constitutes a key advantage of our distance over fixed bias\-based heuristics, as it eliminates the need to manually tune model\-specific bias terms\.
Overall, the distance exhibits a two\-scale structure, characterized by a stable central tendency and an expressive heavy tail, enabling strong cross\-model generalization\. These observations provide critical evidence supporting the validity and robustness of the proposed distance metric\.
Table 18:Raw logit\-gap distance statistics across models\.Summary statistics of the mean positive gapdmd\_\{m\}and the max positive gapdMd\_\{M\}computed over the tracked token subset\.We defineσ0\\sigma\_\{0\}as the standard deviation of the full\-vocabulary logit distribution at each decoding step, which provides an intrinsic measure of the model’s instantaneous uncertainty\. Formally, let\{zv\}v∈𝒱\\\{z\_\{v\}\\\}\_\{v\\in\\mathcal\{V\}\}denote the raw logits over the full vocabulary at a given step, and let
σ0=1\|𝒱\|∑v∈𝒱\(zv−μ0\)2,\\sigma\_\{0\}\\;=\\;\\sqrt\{\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{v\\in\\mathcal\{V\}\}\\bigl\(z\_\{v\}\-\\mu\_\{0\}\\bigr\)^\{2\}\},\(30\)whereμ0=1\|𝒱\|∑v∈𝒱zv\\mu\_\{0\}=\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{v\\in\\mathcal\{V\}\}z\_\{v\}is the mean logit\. As shown in Table[19](https://arxiv.org/html/2606.07108#A2.T19), after accounting for the step\-wise global uncertainty, the proposed distance remains well\-structured and tightly concentrated, without exhibiting collapse or explosion\. This demonstrates strong stability across decoding steps and varying uncertainty levels\. Except for DeepSeek \(which exhibits distinct distillation\-specific characteristics\), the remaining model families show highly consistent normalized distances, indicating that the proposed metric naturally adapts to each model’s intrinsic uncertainty\. This property constitutes a key advantage over fixed bias\-based heuristics, as it eliminates the need for manual tuning of model\-specific bias terms\.
Table 19:Logit\-gap distance normalized by global logit scale\.Summary statistics of the mean and max positive gaps normalized by the global logit standard deviationσ0\\sigma\_\{0\}\. The results demonstrate scale invariance of the proposed distance and confirm that large gaps correspond to statistically significant \(σ\\sigma\-level\) advantages\.Letℬ⊆𝒱\\mathcal\{B\}\\subseteq\\mathcal\{V\}denote the targeted token subset and letμ\\mudenote the mean of the full\-vocabulary logits at a given decoding step\. We define the subset mean logits before and after adjustment as
z¯ℬpre=1\|ℬ\|∑v∈ℬzvpre,z¯ℬpost=1\|ℬ\|∑v∈ℬzvpost\.\\bar\{z\}\_\{\\mathcal\{B\}\}^\{\\text\{pre\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{v\\in\\mathcal\{B\}\}z\_\{v\}^\{\\text\{pre\}\},\\qquad\\bar\{z\}\_\{\\mathcal\{B\}\}^\{\\text\{post\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{v\\in\\mathcal\{B\}\}z\_\{v\}^\{\\text\{post\}\}\.\(31\)We then define the relative subset offsets as
Δℬpre=z¯ℬpre−μ,Δℬpost=z¯ℬpost−μ\.\\Delta\_\{\\mathcal\{B\}\}^\{\\text\{pre\}\}=\\bar\{z\}\_\{\\mathcal\{B\}\}^\{\\text\{pre\}\}\-\\mu,\\qquad\\Delta\_\{\\mathcal\{B\}\}^\{\\text\{post\}\}=\\bar\{z\}\_\{\\mathcal\{B\}\}^\{\\text\{post\}\}\-\\mu\.\(32\)Finally, we define the fraction of subset logits above the global mean as
ρℬpre=1\|ℬ\|∑v∈ℬ𝕀\[zvpre\>μ\],ρℬpost=1\|ℬ\|∑v∈ℬ𝕀\[zvpost\>μ\]\.\\rho\_\{\\mathcal\{B\}\}^\{\\text\{pre\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{v\\in\\mathcal\{B\}\}\\mathbb\{I\}\[z\_\{v\}^\{\\text\{pre\}\}\>\\mu\],\\qquad\\rho\_\{\\mathcal\{B\}\}^\{\\text\{post\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{v\\in\\mathcal\{B\}\}\\mathbb\{I\}\[z\_\{v\}^\{\\text\{post\}\}\>\\mu\]\.\(33\)
Table 20:Targeted\-subset re\-centering across models\.We report full distribution statistics for the subset offsetΔℬpre\\Delta\_\{\\mathcal\{B\}\}^\{\\text\{pre\}\}andΔℬpost\\Delta\_\{\\mathcal\{B\}\}^\{\\text\{post\}\}, as well as the fraction of subset logits above the global meanρℬpre\\rho\_\{\\mathcal\{B\}\}^\{\\text\{pre\}\}andρℬpost\\rho\_\{\\mathcal\{B\}\}^\{\\text\{post\}\}\.As shown in Table[20](https://arxiv.org/html/2606.07108#A2.T20), all models exhibit a substantial positive subset offset prior to control, indicating that the targeted subset is consistently ranked above the global mean\. After applying our adjustment, the subset is precisely re\-centered around zero\. Importantly, this is not achieved through a global shift, but via fine\-grained local correction\.
The targeted tokens are still permitted to appear; however, their relative ranking is softened such that their average probability mass is reduced from being significantly above0\.50\.5to approximately neutral\. Compared to hard semantic suppression or blocking, our approach constitutes a substantially softer intervention that preserves model expressivity while effectively controlling over\-dominant transitions\.
Next, we investigate whether such adjustments induce any unintended changes to the global logit distribution\. Let𝐳t\(0\)=\{zt,v\(0\)\}v=1V\\mathbf\{z\}\_\{t\}^\{\(0\)\}=\\\{z\_\{t,v\}^\{\(0\)\}\\\}\_\{v=1\}^\{V\}and𝐳t\(1\)=\{zt,v\(1\)\}v=1V\\mathbf\{z\}\_\{t\}^\{\(1\)\}=\\\{z\_\{t,v\}^\{\(1\)\}\\\}\_\{v=1\}^\{V\}denote the full\-vocabulary logits at decoding stepttbefore and after adjustment, respectively\. We define the global mean and standard deviation as
μt\(k\)=1V∑v=1Vzt,v\(k\),σt\(k\)=1V∑v=1V\(zt,v\(k\)−μt\(k\)\)2,k∈\{0,1\}\.\\mu\_\{t\}^\{\(k\)\}=\\frac\{1\}\{V\}\\sum\_\{v=1\}^\{V\}z\_\{t,v\}^\{\(k\)\},\\qquad\\sigma\_\{t\}^\{\(k\)\}=\\sqrt\{\\frac\{1\}\{V\}\\sum\_\{v=1\}^\{V\}\\big\(z\_\{t,v\}^\{\(k\)\}\-\\mu\_\{t\}^\{\(k\)\}\\big\)^\{2\}\},\\quad k\\in\\\{0,1\\\}\.\(34\)
We further define the global extrema as
mt\(k\)=minvzt,v\(k\),Mt\(k\)=maxvzt,v\(k\),k∈\{0,1\}\.m\_\{t\}^\{\(k\)\}=\\min\_\{v\}z\_\{t,v\}^\{\(k\)\},\\qquad M\_\{t\}^\{\(k\)\}=\\max\_\{v\}z\_\{t,v\}^\{\(k\)\},\\quad k\\in\\\{0,1\\\}\.\(35\)
The corresponding drift components are defined by
Δμt=μt\(1\)−μt\(0\),Δσt=σt\(1\)−σt\(0\),Δmt=mt\(1\)−mt\(0\),ΔMt=Mt\(1\)−Mt\(0\)\.\\Delta\\mu\_\{t\}=\\mu\_\{t\}^\{\(1\)\}\-\\mu\_\{t\}^\{\(0\)\},\\qquad\\Delta\\sigma\_\{t\}=\\sigma\_\{t\}^\{\(1\)\}\-\\sigma\_\{t\}^\{\(0\)\},\\qquad\\Delta m\_\{t\}=m\_\{t\}^\{\(1\)\}\-m\_\{t\}^\{\(0\)\},\\qquad\\Delta M\_\{t\}=M\_\{t\}^\{\(1\)\}\-M\_\{t\}^\{\(0\)\}\.\(36\)
As a scalar summary of potential global distributional side effects, we define the global drift score as
Dtglobal=\|Δμt\|\+\|Δσt\|\.\\boxed\{D\_\{t\}^\{\\mathrm\{global\}\}=\\big\|\\Delta\\mu\_\{t\}\\big\|\+\\big\|\\Delta\\sigma\_\{t\}\\big\|\}\.\(37\)
Table 21:Global logit distribution drift across models\.Full distribution statistics for drift components \(Δμt,Δσt,Δmt,ΔMt\\Delta\\mu\_\{t\},\\Delta\\sigma\_\{t\},\\Delta m\_\{t\},\\Delta M\_\{t\}\) and the composite scoreDtglobal=\|Δμt\|\+\|Δσt\|D\_\{t\}^\{\\mathrm\{global\}\}=\|\\Delta\\mu\_\{t\}\|\+\|\\Delta\\sigma\_\{t\}\|\.As shown in Table[21](https://arxiv.org/html/2606.07108#A2.T21), our method does not induce harmful changes to the global logit distribution\. The extrema structure is preserved, with no evidence of scale collapse or explosion, and no global translation of the distribution\. In contrast, for the targeted tokens subject to local adjustment, the proposed distance effectively reduces their relative advantage in ranking\. Overall, these results demonstrate that our approach performs safe, localized control with strong generalization across model architectures and data distributions\.
### B\.4Ablation onμt\\mu\_\{t\}
#### Summary\.
In this section, we ablate the vocabulary\-wide mean logit termμt\\mu\_\{t\}to examine its role in defining a calibrated suppression distance\. Whenμt\\mu\_\{t\}is removed, the distance signal is computed directly from raw logits, effectively measuring each targeted token’s distance to zero rather than its relative advantage over the global logit distribution\. This produces an overly large and poorly calibrated suppression signal\. Even when modulated by the difficulty regressor, the resulting control remains too aggressive, substantially reducing token usage but causing clear accuracy degradation, especially on harder benchmarks such as AIME2024 and AIME2025\. These results demonstrate that theμt\\mu\_\{t\}term is essential for converting raw logits into a relative, distribution\-aware distance, enabling localized and appropriately scaled suppression rather than crude truncation\.
As shown in Table[22](https://arxiv.org/html/2606.07108#A2.T22), we ablate the entireμt\\mu\_\{t\}term and retain only the raw logits as the distance signal, with a regressor used to modulate the suppression strength\. The results indicate that this distance is overly large, as it corresponds to the linear distance of the logits from their original values to zero\. Even with regressor\-based scaling, the resulting suppression remains excessively strong, causing the model to degenerate toward conventional efficient reasoning methods with aggressive truncation\.
Table 22:Ablation onμt\\mu\_\{t\}forDeepSeek\-R1\-Distill\-Qwen\-7B\.
### B\.5Avg@K Performance Analysis
#### Summary\.
In this section, we evaluate the robustness of our method on small\-scale mathematical reasoning benchmarks using Avg@30 with standard deviations\. The results show that our approach consistently reduces token consumption across AIME2024, AIME2025, and AMC23 while maintaining or even improving average accuracy for both DeepSeek\-R1\-Distill\-Qwen\-7B and Qwen3\-4B\-Thinking\-2507\. These gains indicate that the proposed control mechanism improves reasoning efficiency without sacrificing performance, even under repeated sampling evaluation\. At the same time, the relatively large standard deviations observed on small benchmarks highlight the importance of Avg@30 evaluation, as single\-run results may be unstable and insufficient for reliably characterizing performance on limited test sets\.
As shown in Table[23](https://arxiv.org/html/2606.07108#A2.T23), we further evaluate the Avg@30 performance of our method on small\-scale benchmarks\. We find that our method is able to consistently reduce token consumption while preserving, and in some cases even improving, accuracy\. We also observe relatively large standard deviations on small datasets, which highlights the necessity of Avg@30 evaluation for providing a more reliable assessment\.
Table 23:Avg@30 performance with standard deviation on mathematical reasoning benchmarks\.AIME2024AIME2025AMC23MethodAvg@30↑\\uparrowStd\#Tok↓\\downarrowStdAvg@30↑\\uparrowStd\#Tok↓\\downarrowStdAvg@30↑\\uparrowStd\#Tok↓\\downarrowStdDeepSeek\-R1\-Distill\-Qwen\-7BBaseline53\.10\.064313358120537\.80\.043314471117590\.10\.03136243646Ours54\.70\.046051018398137\.90\.04591123997191\.20\.02943741480Qwen3\-4B\-Thinking\-2507Baseline83\.10\.03842127965780\.20\.05012255395899\.80\.006211145391Ours85\.30\.03721863264182\.20\.03981982086399\.90\.00469034381
### B\.6Time Latency Analysis
#### Summary\.
In this section, we analyze the inference efficiency of our method on MMLU using DeepSeek\-R1\-Distill\-Qwen\-7B\. As a lightweight and training\-free approach, our method achieves strong latency efficiency while maintaining high throughput comparable to prompt\-based baselines\. Unlike rollback\- or trial\-and\-error\-based methods such as DEER and Dynasor\-CoT, our approach avoids repeated decoding, leading to more efficient inference\. It also does not rely on auxiliary models, unlike TrimR and FlashThinking, thereby avoiding additional memory overhead\. Empirically, our method reduces average token usage and latency relative to the baseline, while achieving the best Pass@1 among compared methods\. These results demonstrate that our approach offers a favorable balance between accuracy, token efficiency, latency, and deployment simplicity\.
We analyze the inference latency of our method onDeepSeek\-R1\-Distill\-Qwen\-7Bevaluated onMMLU\. As shown in Table[24](https://arxiv.org/html/2606.07108#A2.T24), as a lightweight, training\-free approach, our method achieves throughput comparable to prompt\-based baselines, while substantially outperforming lightweight alternatives such as CoD\(Xuet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib5)\)in terms of accuracy\. In contrast to methods that rely on iterative rollback or trial\-and\-error strategies \(e\.g\., DEER\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\)and Dynasor\-CoT\(Fuet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib31)\)\), our approach avoids repeated decoding and therefore yields significantly better latency efficiency\. Moreover, unlike TrimR\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28)\)and FlashThinking\(Jianget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib29)\), our method does not require any additional auxiliary models, and thus incurs no extra memory overhead\.
Table 24:Overall comparison of accuracy and inference efficiency onMMLUusingDeepSeek\-R1\-Distill\-Qwen\-7B\. Lower is better for time and tokens, while higher is better for Pass@1 and throughput\.
### B\.7Analysis of Reflection Token Sensitivity
#### Summary\.
In this section, we further investigate the choice of reflection\-token vocabulary\. Our goal is to examine whether DyCon depends on a particular manually predefined token list, or whether its effectiveness is preserved under alternative choices of reflection\-related tokens\. We show that DyCon is not tied to a specific token set: replacing the original vocabulary with the token set proposed by SEAL\(Chenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib30)\)leads to comparable performance across models and benchmarks\. We further study whether the reflection\-token vocabulary can be optimized in a model\-specific manner, and find that an evolutionary refinement strategy can yield additional improvements in both accuracy and inference efficiency\.
The predefined reflection\-token list used in our main experiments is not an essential component of DyCon\. Instead, it serves as a conventional instantiation following prior work on manipulating thinking or reflection\-related tokens\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\)\. Conceptually, DyCon only requires a set of tokens that approximately correspond to reflective reasoning behaviors, since its control mechanism operates by dynamically modulating the logits of such tokens\. Therefore, the method does not rely on any particular handcrafted vocabulary, but rather on the broader principle that reflection\-related token probabilities can be adjusted to regulate the model’s reasoning behavior\.
To examine the sensitivity of DyCon to different token\-set choices, we replace our initial reflection\-token list with the token set proposed by SEAL\(Chenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib30)\)\. As shown in Table[25](https://arxiv.org/html/2606.07108#A2.T25), DyCon achieves comparable performance under this alternative vocabulary across different models and benchmarks\. In several cases, the SEAL\-based token set even leads to slightly higher Pass@1 or lower average token consumption\. These results suggest that DyCon is robust to reasonable choices of reflection\-token vocabularies, and that its gains do not come from overfitting to a particular manually selected token list\.
Table 25:Robustness of DyCon under different reflection\-token vocabularies\. We report Pass@1 and average output tokens in the format of Pass@1 / Tok\.Beyond robustness to existing token sets, we further investigate whether a more suitable reflection\-token vocabulary can be automatically identified for a specific model\. This is motivated by the observation that different reasoning models may express reflection through slightly different lexical patterns\. Therefore, while a general reflection\-token list is sufficient for DyCon to be effective, a model\-adaptive token set may further improve the controllability of the reasoning process\.
Specifically, we adopt an evolutionary strategy initialized with frequently occurring reflection\-related tokens\. Inspired by dynamic context learning\(Liet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib13)\), the search process iteratively applies token selection, mutation, and crossover\. We use model accuracy on a held\-out Math validation set as the optimization signal, so that the resulting token set is selected according to its downstream effect on reasoning performance rather than by manual inspection alone\.
As reported in Table[26](https://arxiv.org/html/2606.07108#A2.T26), the optimized reflection\-token set further improves DyCon on Qwen3\-4B\-Thinking\-2507\. Compared with the baseline, DyCon\-Optimized improves Pass@1 on Math\-500, AIME24, and AIME25, while also reducing the average number of generated tokens\. These results indicate that although DyCon is already robust to different reasonable token vocabularies, model\-specific token refinement can provide additional benefits\. This also suggests a promising direction for future work: instead of relying on manually designed reflection\-token lists, one can develop more principled optimization objectives and search strategies to automatically discover effective control vocabularies\.
Table 26:Performance of DyCon with an optimized reflection\-token set usingQwen3\-4B\-Thinking\-2507\. We report Pass@1 and average output tokens in the format of Pass@1 / Tok\.The optimized token set further improves both accuracy and reasoning efficiency compared with the original list\. This suggests that although DyCon is not sensitive to a specific predefined vocabulary, model\-specific token refinement can still provide additional benefits\. Designing more principled optimization objectives and more advanced token\-selection strategies remains a promising direction for future work\.
### B\.8Analysis of Noisy Difficulty Proxy
#### Summary\.
In this section, we provide a detailed analysis of the difficulty proxy used in DyCon\. Since fine\-grained dynamic difficulty labels are generally unavailable, DyCon uses generation length as a practical proxy for model\-perceived reasoning difficulty\. We first analyze the stability of this proxy through an outlier study over generation lengths across different difficulty levels\. As shown in Table[27](https://arxiv.org/html/2606.07108#A2.T27), length\-based outliers account for only a small proportion of samples, suggesting that the overall distributional signal is stable\. We then conduct a complementary Pass@1\-based analysis by grouping samples according to output length and measuring the correlation between group\-level length and accuracy\. As reported in Table[28](https://arxiv.org/html/2606.07108#A2.T28), longer generations are consistently associated with lower Pass@1, providing further evidence that generation length captures meaningful difficulty\-related information\.
Reasoning difficulty is inherently dynamic during generation\. A problem may appear easy at the beginning but become harder when the model encounters intermediate uncertainty, or conversely become easier after a key reasoning step is resolved\. However, most existing datasets only provide static and coarse\-grained difficulty annotations, such as the five discrete difficulty levels in MATH\. These annotations are useful for interpretability, but they cannot fully describe the model’s evolving perception of difficulty during the reasoning process\. Therefore, rather than relying on exact difficulty labels, DyCon exploits a statistically meaningful proxy that can be observed during generation\.
We use output length as such a proxy\. The intuition is that when a model perceives a problem as more difficult, it typically spends more tokens exploring intermediate steps, verifying partial results, correcting mistakes, or searching for alternative reasoning paths\. This does not imply that every long response is necessarily difficult or every short response is necessarily easy\. Instead, the claim is distributional: across a sufficiently large set of samples, generation length provides a useful signal for estimating model\-perceived difficulty\.
To quantify the stability of this signal, we conduct an outlier analysis based on generation length\. For each model and each MATH difficulty level, we compute the mean generation length, the first quartileq25q\_\{25\}, the third quartileq75q\_\{75\}, and the interquartile range\. We define length outliers as samples whose generation length lies outside the standard IQR interval:
\[q25−1\.5⋅IQR,q75\+1\.5⋅IQR\],whereIQR=q75−q25\.\[q\_\{25\}\-1\.5\\cdot\\mathrm\{IQR\},\\;q\_\{75\}\+1\.5\\cdot\\mathrm\{IQR\}\],\\quad\\mathrm\{where\}\\quad\\mathrm\{IQR\}=q\_\{75\}\-q\_\{25\}\.\(38\)Table[27](https://arxiv.org/html/2606.07108#A2.T27)reports the statistics for four representative reasoning models\. Across models and difficulty levels, the outlier ratio remains relatively small, indicating that the length distribution is not dominated by rare abnormal generations\. More importantly, the mean generation length generally increases with the annotated difficulty level, supporting the use of length as a stable aggregate signal\.
Table 27:Outlier analysis of generation length across MATH difficulty levels\. For each model and difficulty level, we report the mean generation length, the first quartile, the third quartile, and the outlier ratio computed using the IQR rule\.The results in Table[27](https://arxiv.org/html/2606.07108#A2.T27)suggest that generation length provides a stable distribution\-level signal\. Nevertheless, correlation with the manually annotated MATH difficulty levels is limited by the coarse granularity of the labels\. The MATH dataset uses only five discrete levels, whereas reasoning length is a continuous variable with substantial natural variance\. As a result, a moderate Spearman correlation with static difficulty labels does not necessarily imply that generation length is a weak proxy\. It may instead reflect a mismatch between a coarse human annotation scheme and the model’s continuous, instance\-specific perception of difficulty\.
To obtain a more direct measure of difficulty, we further analyze the relationship between generation length and Pass@1\. Pass@1 reflects whether the model solves a problem correctly, and thus provides a performance\-based view of problem difficulty\. We group samples by output length and compute the average length and Pass@1 within each group\. For thekk\-th length groupBkB\_\{k\}, we compute:
l¯k=1\|Bk\|∑i∈Bkli,Pass@1\(Bk\)=1\|Bk\|∑i∈Bkyi\.\\bar\{l\}\_\{k\}=\\frac\{1\}\{\|B\_\{k\}\|\}\\sum\_\{i\\in B\_\{k\}\}l\_\{i\},\\quad\\mathrm\{Pass@1\}\(B\_\{k\}\)=\\frac\{1\}\{\|B\_\{k\}\|\}\\sum\_\{i\\in B\_\{k\}\}y\_\{i\}\.\(39\)wherelil\_\{i\}denotes the output length of sampleii, andyiy\_\{i\}is a binary correctness indicator\. We then compute the Pearson and Spearman correlations betweenl¯k\\bar\{l\}\_\{k\}andPass@1\(Bk\)\\mathrm\{Pass@1\}\(B\_\{k\}\)\. For the overall results, we merge samples from all evaluated datasets and compute the correlation on the combined set\.
As shown in Table[28](https://arxiv.org/html/2606.07108#A2.T28), the correlation between grouped output length and Pass@1 is consistently negative across datasets, models, and different group sizes\. This indicates that longer generations are generally associated with lower accuracy, which is consistent with the interpretation that longer reasoning often reflects higher model\-perceived difficulty\. The correlations remain strong under different choices of\|Bk\|\|B\_\{k\}\|, showing that the trend is not an artifact of a particular grouping resolution\.
Table 28:Correlation between grouped output length and Pass@1\. Samples are grouped by generation length, and Pearson/Spearman correlations are computed between the average length and Pass@1 of each group\. Negative values indicate that longer generations are associated with lower accuracy\.Overall, the analyses in Table[27](https://arxiv.org/html/2606.07108#A2.T27)and Table[28](https://arxiv.org/html/2606.07108#A2.T28)provide complementary evidence for using generation length as a proxy for model\-perceived difficulty\. The outlier analysis shows that the signal is stable at the distribution level, while the Pass@1\-based analysis shows that the signal is strongly associated with actual model performance\. These results support the design choice of DyCon: when explicit dynamic difficulty annotations are unavailable, generation length provides a practical, observable, and empirically grounded supervision signal for learning difficulty\-aware control\.
This analysis also clarifies the role of the length\-based proxy\. DyCon does not assume that output length is a perfect difficulty label for every individual sample\. Instead, it uses length as a scalable statistical signal that reflects the model’s reasoning effort in aggregate\. Developing more precise supervision for dynamic difficulty estimation remains an important direction for future work, but the current evidence suggests that generation length is already a reliable and useful proxy for difficulty\-aware reasoning control\.
### B\.9Analysis of Cross\-Lingual Generalization
#### Summary\.
In this section, we analyze whether DyCon can be applied across different languages\. Although DyCon modulates reflection\-related tokens during generation, the method does not require a fixed English\-only token list\. For each target language, we replace the original reflection\-token vocabulary with a concise set of reflection\-related tokens in that language, while keeping the difficulty regressor unchanged\. We evaluate this setting on MGSM\(Shiet al\.,[2022](https://arxiv.org/html/2606.07108#bib.bib63)\)using Qwen3\-4B\-Thinking\-2507\. As shown in Table[29](https://arxiv.org/html/2606.07108#A2.T29), DyCon consistently reduces token usage across English, Chinese, French, German, and Japanese, while maintaining comparable or slightly improved accuracy\. We further examine whether an English\-fitted difficulty regressor produces similar difficulty estimates on non\-English inputs\. As reported in Table[30](https://arxiv.org/html/2606.07108#A2.T30), the predicted difficulty scores on English and Chinese are close to their corresponding ground\-truth scores, suggesting that similar difficulty\-estimation behavior can emerge across languages\.
DyCon contains two components that are relevant to cross\-lingual transfer: the difficulty estimator and the reflection\-token vocabulary\. The difficulty estimator predicts the model’s current reasoning difficulty from internal representations, while the reflection\-token vocabulary determines which token logits are modulated during generation\. The second component is naturally language\-dependent, since different languages express reflective reasoning through different surface forms\. Therefore, when applying DyCon to a new language, we only replace the reflection\-token list with a small set of language\-specific reflection\-related tokens, without modifying the difficulty regressor\.
This setting allows us to test whether DyCon can retain its efficiency benefits under multilingual generation\. Importantly, we do not perform additional tuning, refitting, or language\-specific calibration for the regressor\. The only adaptation is the substitution of the reflection\-token vocabulary\. Therefore, the results reflect whether the original difficulty estimator can provide a useful control signal when paired with appropriate reflection\-token mappings in different languages\.
Table[29](https://arxiv.org/html/2606.07108#A2.T29)reports the multilingual results on MGSM\. Across all evaluated languages, DyCon consistently reduces the average number of generated tokens\. For English, DyCon improves Pass@1 from 95\.6 to 96\.8 while reducing the average token count from 1483 to 1116\. For Chinese, French, German, and Japanese, DyCon preserves the baseline accuracy while substantially reducing token usage\. These results indicate that DyCon can be effectively extended to multilingual reasoning tasks by adapting the reflection\-token vocabulary\.
Table 29:Cross\-lingual evaluation of DyCon on MGSM using Qwen3\-4B\-Thinking\-2507\. We report Pass@1 and average output tokens in the format of Pass@1 / Tok\.To further understand this behavior, we separately analyze the difficulty estimator\. Specifically, we evaluate whether a regressor fitted on English data can produce reasonable difficulty estimates when applied to Chinese inputs\. Table[30](https://arxiv.org/html/2606.07108#A2.T30)compares the regressor\-predicted difficulty scores with the corresponding ground\-truth difficulty scores on English and Chinese\. The predicted score is 0\.50 for English and 0\.47 for Chinese, which closely matches the ground\-truth scores of 0\.51 and 0\.46, respectively\. This suggests that, at least empirically, the English\-fitted regressor can produce difficulty estimates on Chinese that are similar to the corresponding ground\-truth difficulty values\.
Table 30:Cross\-lingual evaluation of the English\-fitted difficulty regressor\. We report the mean regressor\-predicted difficulty scores and the corresponding ground\-truth difficulty scores on English and Chinese\.Overall, Table[29](https://arxiv.org/html/2606.07108#A2.T29)shows that DyCon can reduce reasoning length across multiple languages without degrading accuracy, while Table[30](https://arxiv.org/html/2606.07108#A2.T30)provides preliminary evidence that English and Chinese exhibit similar difficulty\-estimation behavior under the same regressor\. We do not claim that the underlying difficulty estimator is theoretically language\-agnostic\. Rather, the empirical results suggest that there may exist a cross\-lingual correspondence between difficulty representations in different languages, and that an appropriate reflection\-token mapping may be sufficient for DyCon to transfer across languages in practice\. Formalizing this correspondence and developing a principled theory of cross\-lingual difficulty estimation remain promising directions for future research\.
### B\.10Analysis of Regressor Refinement
#### Summary\.
In this section, we analyze the effect of the fit–refine–refit procedure on difficulty estimation and downstream DyCon performance\. The key question is whether the improvement from refinement simply comes from removing redundant or noisy trajectories\. Our results suggest that this is not the case\. As shown in Table[31](https://arxiv.org/html/2606.07108#A2.T31), directly removing length\-based outlier trajectories does not improve performance and can even degrade accuracy on challenging benchmarks\. In contrast, moderate trajectory refinement improves both regressor quality and downstream inference efficiency, as shown in Table[32](https://arxiv.org/html/2606.07108#A2.T32)\. However, this improvement is not monotonic: excessive refinement further increases the regressor’sR2R^\{2\}but hurts downstream performance\. These results indicate that refinement should be understood as a controlled reshaping of the reasoning\-trajectory distribution rather than simple denoising\.
The difficulty regressor in DyCon is fitted on reasoning trajectories generated by the base model\. Therefore, the quality and distribution of these trajectories directly influence what kind of difficulty signal the regressor learns\. A natural hypothesis is that long or atypical trajectories may introduce noise, and that removing such trajectories should improve difficulty estimation\. However, this interpretation is overly simplistic\. Reasoning trajectories with unusually long generations are not necessarily invalid or harmful; they may correspond to harder problems, failed attempts, self\-corrections, or atypical reasoning patterns that are important for modeling the full behavior of the base model\.
To test whether removing such trajectories is beneficial, we conduct an IQR\-based outlier removal experiment\. Specifically, we remove length\-based outlier trajectories before fitting the difficulty regressor and then evaluate DyCon on downstream benchmarks\. Table[31](https://arxiv.org/html/2606.07108#A2.T31)reports the results\. Compared with standard DyCon, removing outliers slightly reduces the average token count on Math\-500 but lowers Pass@1\. More importantly, it substantially degrades performance on AIME24, where Pass@1 drops from 86\.7 to 80\.0\. This suggests that outlier trajectories are not merely noise\. Instead, they may contain informative examples of complex or atypical reasoning behavior\. This observation is also consistent with prior findings that abnormal or heavy\-tailed samples can carry important learning signals rather than being reducible to simple noise\(Gurbuzbalabanet al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib64)\)\.
Table 31:Effect of removing length\-based outlier trajectories before fitting the difficulty regressor\. We report Pass@1 and average output tokens in the format of Pass@1 / Tok\. Removing outliers does not consistently improve performance and can hurt accuracy on challenging benchmarks\.The results in Table[31](https://arxiv.org/html/2606.07108#A2.T31)show that refinement should not be viewed as a procedure for simply discarding noisy samples\. Instead, the fit–refine–refit procedure modifies the structure of the reasoning trajectories while preserving their connection to the original model behavior\. After an initial DyCon pass, the generated trajectories tend to become more concise and structured\. Such trajectories may provide a clearer supervision signal for fitting the difficulty regressor, because the remaining reasoning steps are less dominated by unnecessary repetition while still reflecting the model’s problem\-solving process\. This is consistent with prior work showing that shorter but complete reasoning traces can serve as effective learning signals\(Wuet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib65)\)\.
To further understand this effect, we evaluate multiple refinement iterations\. Table[32](https://arxiv.org/html/2606.07108#A2.T32)reports the regressorR2R^\{2\}and downstream performance after successive refinement rounds\. The second iteration improves the regressorR2R^\{2\}from 0\.8008 to 0\.9073 and also improves downstream results: Math\-500 increases from 96\.2 to 96\.6, AIME25 increases from 76\.7 to 80\.0, and token usage is further reduced on all three benchmarks\. This indicates that moderate refinement can improve the quality of the fitted difficulty estimator and make DyCon more efficient\.
However, the third iteration reveals an important limitation\. Although the regressorR2R^\{2\}further increases to 0\.9276, downstream performance does not continue to improve\. In particular, AIME25 drops from 80\.0 to 73\.3, and token usage increases compared with the second iteration\. This discrepancy indicates that a higherR2R^\{2\}on refined trajectories does not necessarily imply better inference\-time control\. The reason is that excessive refinement can shift the fitting distribution away from the base model’s original reasoning distribution\. Since DyCon is ultimately applied during the model’s actual inference process, the regressor must remain aligned with the trajectories that the model naturally produces\. When refinement becomes too strong, the regressor may fit the refined data better while becoming less suitable for controlling the original inference behavior\.
Table 32:Effect of iterative fit–refine–refit\. We report the regressorR2R^\{2\}, Pass@1, and average output tokens\. Moderate refinement improves both regressor quality and downstream performance, while excessive refinement increasesR2R^\{2\}but degrades generalization\.Overall, Table[31](https://arxiv.org/html/2606.07108#A2.T31)and Table[32](https://arxiv.org/html/2606.07108#A2.T32)together suggest that the benefit of regressor refinement does not come from simply removing noisy or redundant reasoning\. Direct outlier removal can discard useful atypical trajectories and harm downstream performance\. In contrast, moderate refinement improves the structure of reasoning trajectories while maintaining sufficient alignment with the base model’s inference distribution\. Excessive refinement, however, can introduce a distribution mismatch: the regressor becomes better at fitting the refined trajectories, but less effective for controlling the model under its natural inference behavior\.
These findings position iterative refinement as an optional enhancement to DyCon rather than a necessary correction to the original pipeline\. The standard DyCon procedure already provides strong performance, while one additional refinement round can further improve efficiency and accuracy when the refined trajectories remain close to the original model distribution\. Designing principled criteria for determining when to stop refinement is an interesting direction for future work\.
### B\.11Analysis of Unidirectional Logit Suppression
#### Summary\.
In this section, we analyze the effect of the modulation direction in DyCon\. The main version of DyCon adopts a unidirectional logit\-suppression strategy, where reflection\-related token logits are selectively suppressed according to the estimated difficulty\. To better understand this design choice, we compare it with a bidirectional variant that can both suppress and amplify reflection\-token logits\. As shown in Table[33](https://arxiv.org/html/2606.07108#A2.T33), bidirectional modulation can further improve accuracy on several benchmarks, but it also substantially increases the number of generated tokens\. In contrast, the original unidirectional DyCon achieves a more favorable efficiency–accuracy trade\-off by preserving comparable accuracy while producing significantly shorter reasoning outputs\.
The goal of DyCon is not to maximize accuracy at any computational cost, but to improve reasoning efficiency while maintaining or improving task performance\. This objective motivates the use of unidirectional suppression\. When the estimated difficulty is low, suppressing reflection\-related tokens encourages the model to avoid unnecessary continuation and terminate reasoning more efficiently\. When the estimated difficulty is high, the suppression is weakened, allowing the model to preserve sufficient reasoning capacity\. This design provides a conservative form of control: it primarily reduces excessive reasoning rather than actively forcing the model to reason more\.
A natural alternative is bidirectional modulation, where the method suppresses reflection\-related tokens under low estimated difficulty and amplifies them under high estimated difficulty\. This variant can encourage more exploration on difficult instances and may therefore improve accuracy\. This bidirectional control resembles the design philosophy of ReBalance\(Liet al\.,[2026](https://arxiv.org/html/2606.07108#bib.bib66)\), which also adjusts reasoning behavior in both directions to balance performance and reasoning cost\. However, such bidirectional modulation can also increase the tendency of the model to generate longer reasoning traces, especially when the difficulty estimator assigns high scores\. We evaluate this bidirectional variant to understand whether the additional accuracy gain justifies the extra token cost\.
Table[33](https://arxiv.org/html/2606.07108#A2.T33)reports the comparison between the original DyCon and the bidirectional variant on DeepSeek\-R1\-Distill\-Qwen\-7B and Qwen3\-4B\-Thinking\-2507\. The results show a clear trade\-off\. For DeepSeek\-R1\-Distill\-Qwen\-7B, Bidirectional\-DyCon improves Pass@1 from 92\.0 to 92\.6 on Math\-500, from 53\.3 to 56\.7 on AIME24, and from 36\.7 to 40\.0 on AIME25 compared with standard DyCon\. However, it also generates more tokens on all three benchmarks\. A similar pattern is observed for Qwen3\-4B\-Thinking\-2507: Bidirectional\-DyCon improves accuracy, especially on AIME25, but its token usage becomes much closer to the baseline\.
Table 33:Comparison between unidirectional DyCon and a bidirectional modulation variant\. We report Pass@1 and average output tokens in the format of Pass@1 / Tok\. Bidirectional modulation improves accuracy in several cases but requires longer reasoning outputs, while the original unidirectional DyCon provides a stronger efficiency–accuracy trade\-off\.The results in Table[33](https://arxiv.org/html/2606.07108#A2.T33)indicate that the modulation direction directly controls the trade\-off between accuracy and efficiency\. Bidirectional modulation is more aggressive: by amplifying reflection\-related tokens on difficult instances, it can increase the chance of solving challenging problems, but this often comes with longer reasoning trajectories\. Unidirectional suppression is more efficiency\-oriented: it mainly removes unnecessary reflection when the model is estimated to be in a low\-difficulty state, while avoiding excessive intervention on difficult problems\.
Therefore, the original design of DyCon prioritizes reasoning efficiency under controlled accuracy constraints\. This choice is aligned with the central goal of the method: reducing overthinking without substantially sacrificing task performance\. The bidirectional variant is still useful as an alternative when the application prioritizes accuracy over token efficiency, but the unidirectional version offers a more balanced default setting for efficient inference\.
### B\.12Analysis of Regressor Complexity
#### Summary\.
In this section, we analyze whether using a more complex difficulty regressor improves DyCon\. The main implementation of DyCon adopts a simple linear regressor, which provides an efficient and stable way to decode difficulty from hidden states\. To examine whether this design is overly restrictive, we replace the default linear regressor with a two\-layer MLP whose hidden dimensions are 1024 and 512\. As shown in Table[34](https://arxiv.org/html/2606.07108#A2.T34), the MLP achieves competitive performance and can further improve accuracy in some cases\. However, it does not consistently provide a clearly superior efficiency–accuracy trade\-off over the simple linear regressor\. This suggests that the difficulty signal used by DyCon is already largely accessible through a simple linear readout from model representations\.
The difficulty estimator in DyCon maps intermediate hidden states to a scalar difficulty score\. A natural question is whether this mapping requires a more expressive nonlinear model\. In principle, a deeper regressor may capture more complex interactions among hidden dimensions and thus fit the training trajectories more accurately\. However, increased regressor complexity may also introduce additional sensitivity to the fitting distribution, increase implementation cost, and provide limited benefit if the relevant difficulty information is already well organized in the representation space\.
To study this question, we compare the default ordinary least squares \(OLS\) regressor with a two\-layer MLP regressor\. The MLP uses hidden dimensions of 1024 and 512, while all other components of DyCon remain unchanged\. Table[34](https://arxiv.org/html/2606.07108#A2.T34)reports the downstream performance on Math\-500, AIME2024, and AIME2025 using Qwen3\-4B\-Thinking\-2507\.
Table 34:Effect of regressor complexity on DyCon\. We compare the base model, DyCon with the default OLS regressor, and DyCon with a two\-layer MLP regressor\. We report Pass@1 and average output tokens in the format of Pass@1 / Tok\.The results in Table[34](https://arxiv.org/html/2606.07108#A2.T34)show that the MLP regressor is effective: it improves Math\-500 from 96\.2 to 96\.6, reduces token usage from 6092 to 5505 compared with the OLS version, and improves AIME2025 from 76\.7 to 80\.0 while also reducing the number of generated tokens\. These results indicate that nonlinear regressors can serve as a valid alternative within the DyCon framework\.
At the same time, the gains from the MLP are moderate rather than transformative\. The simple OLS regressor already improves AIME2024 accuracy from 83\.3 to 86\.7 and substantially reduces token usage across all benchmarks compared with the base model\. Moreover, on AIME2024, the MLP obtains the same Pass@1 as OLS, with the main difference being a further reduction in token count\. This suggests that most of the useful difficulty signal can already be extracted by a lightweight linear readout\.
Overall, Table[34](https://arxiv.org/html/2606.07108#A2.T34)supports two conclusions\. First, DyCon is robust to the choice of regressor: replacing the linear regressor with a more expressive MLP preserves, and in some cases improves, downstream performance\. Second, the strong performance of OLS suggests that the model’s hidden states already encode difficulty\-related information in a largely linearly decodable form\. Therefore, we use the simple linear regressor as the default choice because it is lightweight, stable, and sufficient for obtaining strong efficiency–accuracy trade\-offs\. More complex regressors remain a possible extension, especially when additional validation data are available for controlling overfitting and distribution sensitivity\.
### B\.13Analysis of Effectiveness on Non\-Reasoning Models
#### Summary\.
In this section, we analyze the behavior of DyCon\-related difficulty estimation on non\-reasoning instruction\-tuned models\. DyCon is primarily designed to mitigate overthinking in reasoning\-oriented models, where excessive reflection and unnecessarily long reasoning traces are common\. In contrast, non\-reasoning models such as Qwen2\.5\-Instruct typically generate much shorter outputs and often do not exhibit the same degree of redundant reasoning\. As shown in Table[35](https://arxiv.org/html/2606.07108#A2.T35), Qwen2\.5\-7B\-Instruct produces substantially shorter outputs than reasoning models, but its accuracy is also much lower on challenging mathematical benchmarks\. This suggests that the main limitation of such models is often insufficient reasoning rather than excessive reasoning\. Nevertheless, the difficulty\-estimation component of DyCon remains meaningful: as shown in Table[36](https://arxiv.org/html/2606.07108#A2.T36), regressors trained on Math can still recover reasonable dataset\-level difficulty trends for non\-reasoning models\. These results suggest that while reflection suppression is less beneficial for non\-reasoning models, difficulty estimation may still be useful for adaptive model routing or compute allocation\.
DyCon targets the overthinking phenomenon in reasoning models\. In such models, the generation process often contains long reflective segments, repeated verification, backtracking, and redundant intermediate reasoning\. Suppressing reflection\-related tokens under low estimated difficulty can therefore reduce unnecessary computation while preserving, or even improving, accuracy\. This setting is different for non\-reasoning instruction\-tuned models\. These models usually produce shorter answers and may not generate sufficiently detailed reasoning traces in the first place\. Therefore, there is less redundant reflection to suppress\.
Table[35](https://arxiv.org/html/2606.07108#A2.T35)illustrates this behavior using Qwen2\.5\-7B\-Instruct\. Compared with reasoning models evaluated in the main experiments, Qwen2\.5\-7B\-Instruct uses far fewer tokens on Math\-500, AIME2024, and AIME2025\. However, its Pass@1 is also much lower, especially on AIME2024 and AIME2025\. This indicates that short generation alone is not necessarily desirable: for non\-reasoning models, shorter outputs often reflect incomplete reasoning rather than efficient reasoning\. Consequently, directly applying reflection suppression to such models is expected to bring limited benefit, because their primary bottleneck is not overthinking but under\-reasoning\.
Table 35:Behavior of a non\-reasoning instruction\-tuned model on mathematical reasoning benchmarks\. We report Pass@1 and average output tokens in the format of Pass@1 / Tok\. The model produces short outputs but achieves much lower accuracy on challenging benchmarks, suggesting that insufficient reasoning is the main bottleneck\.Although suppression is less suitable for non\-reasoning models, the underlying difficulty\-estimation assumption still holds to a meaningful extent\. Specifically, we examine whether hidden states from non\-reasoning models encode information about remaining generation length, which serves as a proxy for model\-perceived difficulty\. Regressors trained on Math achieve stable fitting quality, with approximatelyR2≈0\.64R^\{2\}\\approx 0\.64andMAE≈0\.06\\mathrm\{MAE\}\\approx 0\.06–0\.070\.07\. This indicates that even when the model does not produce long reasoning traces, its hidden representations still contain useful signals related to expected reasoning effort\.
To further evaluate this point, we compare the predicted difficulty scores with ground\-truth difficulty scores across datasets\. Table[36](https://arxiv.org/html/2606.07108#A2.T36)reports the dataset\-level difficulty trends for Qwen2\.5\-7B\-Instruct and Qwen2\.5\-1\.5B\-Instruct\. For both models, the predicted scores are closely aligned with the ground\-truth values on Math\-500 and GSM8K, and also reasonably track the higher difficulty of AIME2024 and AIME2025\. This suggests that the regressor can still recover meaningful relative difficulty information from non\-reasoning models\.
Table 36:Difficulty\-estimation trends on non\-reasoning instruction\-tuned models\. We report mean regressor\-predicted difficulty scores and the corresponding ground\-truth difficulty scores\. Higher values indicate higher estimated reasoning difficulty\.Overall, Table[35](https://arxiv.org/html/2606.07108#A2.T35)and Table[36](https://arxiv.org/html/2606.07108#A2.T36)show that non\-reasoning models differ from reasoning models in two important ways\. First, they already generate relatively short outputs, so reflection suppression has limited room to reduce redundant reasoning\. Second, despite their shorter and often incomplete reasoning traces, their hidden states still encode useful difficulty\-related information\. Therefore, while DyCon’s suppression mechanism is most effective for reasoning models with pronounced overthinking, its difficulty estimator can still be valuable for non\-reasoning models\.
One potential application is adaptive routing\. For example, a lightweight non\-reasoning model could first estimate the difficulty of an input; if the estimated difficulty is low, the system may allow the non\-reasoning model to answer directly, while high\-difficulty cases can be routed to a stronger reasoning model or allocated more inference compute\. In this sense, DyCon’s difficulty\-estimation component may serve as a general lightweight signal for adaptive inference, even when reflection suppression itself is not the primary intervention\.
## Appendix CRelated Work
#### From Parameter Scaling to Reasoning Scaling\.
Classical scaling laws establish that model performance follows power\-law relationships with model size, data, and compute\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.07108#bib.bib37)\)\. Following this paradigm, recent large models, such as GPT\-4o\(Hurstet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib36)\)and DeepSeek\-V3\(Liuet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib35)\), have achieved remarkable success largely through massive parameter and compute scaling\. This scaling momentum has also propagated beyond text\-only NLP into multimodal and vision\-language domains, reshaping tasks from reasoning segmentation, open\-vocabulary perception, and language\-driven adaptation to multimodal reasoning, visual\-token compression, scene generation, and intervention\-based reliability improvement\(Laiet al\.,[2024a](https://arxiv.org/html/2606.07108#bib.bib67); Yanget al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib79); Shaoet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib83); Yanget al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib84); Wanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib90),[c](https://arxiv.org/html/2606.07108#bib.bib91); Liet al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib85); Yanget al\.,[2025c](https://arxiv.org/html/2606.07108#bib.bib77); Huanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib92); Penget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib89)\)\. Meanwhile, foundation model designs have further influenced visual perception and representation learning, including semantic segmentation, few\-shot segmentation, scene text detection, long\-tailed recognition, 3D understanding, contrastive learning, point prompt learning, and 2D\-3D representation learning\(Tianet al\.,[2020](https://arxiv.org/html/2606.07108#bib.bib68); Laiet al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib69); Tianet al\.,[2019](https://arxiv.org/html/2606.07108#bib.bib70); Cuiet al\.,[2022](https://arxiv.org/html/2606.07108#bib.bib71); Jianget al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib72); Penget al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib73); Tianet al\.,[2022b](https://arxiv.org/html/2606.07108#bib.bib75); Cuiet al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib76); Luoet al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib78); Penget al\.,[2024b](https://arxiv.org/html/2606.07108#bib.bib80); Tianet al\.,[2022a](https://arxiv.org/html/2606.07108#bib.bib81),[2023](https://arxiv.org/html/2606.07108#bib.bib82); Wanget al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib87); Ninget al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib88); Wuet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib93); Zhanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib94); Huanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib95)\)\.
However, the marginal gains from parameter scaling often come with prohibitive computational costs\. Consequently, a complementary paradigm, reasoning scaling, has emerged, which improves model capability by expanding the depth and structure of inference\-time reasoning rather than merely increasing model width\. Starting from Chain\-of\-Thought prompting\(Weiet al\.,[2022](https://arxiv.org/html/2606.07108#bib.bib7)\), this trajectory has evolved into more structured reasoning and search mechanisms, such as self\-correction\(Kumaret al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib38)\), Tree\-of\-Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib39)\), and Graph\-of\-Thoughts\(Bestaet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib40)\), and has been further refined by preference optimization and continual adaptation techniques for long\-chain reasoning\(Laiet al\.,[2024b](https://arxiv.org/html/2606.07108#bib.bib74); Penget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib96),[2024a](https://arxiv.org/html/2606.07108#bib.bib86)\)\. While these methods enable smaller models to approach the performance of larger ones, they often introduce excessive inference overhead due to overthinking\. To address this issue, we proposeDyCon\. Instead of further enlarging models or blindly extending reasoning chains, DyCon dynamically estimates residual reasoning demand from hidden states and adaptively regulates reasoning termination, reducing unnecessary thinking tokens while preserving answer quality\.
#### Large Reasoning Models\.
Building on this line of work, a new class of large reasoning models has recently emerged, including the DeepSeek\-R1 series\(Guoet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib1)\)and OpenAI o1 series\(Jaechet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib34)\)\. These models generate explicit intermediate reasoning before producing final answers, enabling iterative deliberation and improved problem decomposition\. As a result, they achieve substantially improved performance on complex reasoning tasks\.
#### Efficient Reasoning\.
Despite their strong reasoning capability, large reasoning models \(LRMs\) still face notable challenges\. In particular, excessively long reasoning processes \(often referred to as*overthinking*\) introduce substantial computational overhead\. A central question is therefore how to preserve strong reasoning ability while reducing reasoning length to improve efficiency\. This motivates the line of work on*efficient reasoning*\. Among existing approaches, the most direct and widely adopted strategy is prompt\-based control of reasoning behavior, including static prompt designs such as BTC\(Dinget al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib43)\), CoD\(Xuet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib5)\), CCoT\(Renze and Guven,[2024](https://arxiv.org/html/2606.07108#bib.bib41)\), CCoT\-2\-45\(Nayabet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib42)\), and NoThinking\(Maet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib4)\), as well as dynamic prompt methods such as ThinkPilot\(Liet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib13)\)\. Beyond prompt\-based methods, training\-based approaches also constitute an important direction for efficient reasoning\. These methods leverage supervised fine\-tuning \(SFT\) or reinforcement learning \(RL\) to explicitly encourage shorter chains of thought while preserving reasoning accuracy\. Representative work along this line includes C3oT\(Kanget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib44)\), as well as SFT\- and RL\-based approaches for chain\-of\-thought compression and distillation\(Arora and Zanette,[2025](https://arxiv.org/html/2606.07108#bib.bib45); Munkhbatet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib47); Shenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib46)\)\. In addition, leveraging latent reasoning constitutes another important direction for efficient reasoning\. Rather than explicitly generating full chains of thought, these methods operate on latent or implicit reasoning representations, aiming to reduce token\-level reasoning overhead while retaining reasoning capability\. Representative approaches include SoftThinking\(Zhanget al\.,[2025c](https://arxiv.org/html/2606.07108#bib.bib49)\)and SoftCoT\(Xuet al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib50)\)\. Early\-exit methods constitute another active direction for efficient reasoning\. For example, TrimR\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28)\)and FlashThinking\(Jianget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib29)\)employ external large models to monitor the reasoning process and trigger early termination\. In contrast, DEER\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\)leverages the model’s internal confidence signals to decide when to exit, while Dynasor\-CoT\(Fuet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib31)\)uses agreement across multiple sampled answers to guide early termination\. These methods demonstrate the effectiveness of early\-exit strategies for reducing reasoning cost\.
## Appendix DDetails On Experimental Settings
### D\.1Decoding and Sampling Settings
To ensure optimal model performance, we follow the original model configurations and experimental settings adopted in\(Guoet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib1); Team,[2025](https://arxiv.org/html/2606.07108#bib.bib2); Yanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib3)\)\. For the Qwen3\-4B\-Thinking\-2507 model, we set the temperature to 0\.6, Top\-ppto 0\.95, Top\-kkto 20, and Min\-ppto 0, with the maximum output length fixed at 81,920 tokens\. For the DeepSeek\-R1\-Distill\-Qwen\-7B, QwQ\-32B, Qwen3\-8B and Qwen3\-14B models, we adopt the same sampling configuration \(temperature = 0\.6, Top\-pp= 0\.95, Top\-kk= 20, Min\-pp= 0\), while setting the maximum output length to 32,768 tokens\. All experiments are conducted with a fixed random seed of 42 to ensure reproducibility\.
### D\.2Token Lists Used for Suppression
For reproducibility, we adopt the same predefined token lists for suppression as used in NoWait\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\), as summarized in Table[37](https://arxiv.org/html/2606.07108#A4.T37), enabling a direct and fair comparison without introducing additional design choices\.
Table 37:Predefined token phrases used for suppression, following NoWait\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\)\.Predefined Token Phraseswait, alternatively, hmm, but, however, alternative, another, check, double\-check, oh, maybe, verify, other, again, now, ah, any
### D\.3Implementation Details
We implement our method using both the native HuggingFace Transformers library\(Wolfet al\.,[2019](https://arxiv.org/html/2606.07108#bib.bib56)\)and vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib57)\)\. Unless otherwise stated, all experimental results reported in this paper are based on the HuggingFace Transformers implementation\(Wolfet al\.,[2019](https://arxiv.org/html/2606.07108#bib.bib56)\)\.
### D\.4Details on Benchmarks
Math\-500\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.07108#bib.bib17)\): A difficulty\-balanced mathematical reasoning benchmark comprising 500 problems, with each instance labeled according to a five\-level difficulty hierarchy \(Level 1 to Level 5\)\.
GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib18)\): A grade\-school mathematics reasoning benchmark comprising 1,319 problems, on which most instruction\-tuned models already achieve high accuracy\.
AIME2024\(AI\-MO,[2024a](https://arxiv.org/html/2606.07108#bib.bib19)\): A set of 30 challenging problems from the American Invitational Mathematics Examination, with difficulty substantially exceeding that of the AMC series and typically requiring extended multi\-step reasoning\.
AIME2025\(OpenCompass,[2025](https://arxiv.org/html/2606.07108#bib.bib20)\): A collection of 30 challenging problems from the American Invitational Mathematics Examination, commonly regarded as an extension of AIME2024 and similarly demanding complex, multi\-step reasoning\.
AMC23\(AI\-MO,[2024b](https://arxiv.org/html/2606.07108#bib.bib21)\): Problems from the AMC \(American Mathematics Competition\), one of the most influential pre\-college mathematics competitions worldwide, consisting of 40 problems and typically regarded as lower in difficulty compared to AIME\-level benchmarks\.
Olympiad Bench\(Heet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib22)\): A collection of 675 challenging Olympiad\-style problems drawn from international mathematical olympiad competitions, typically requiring deep and rigorous multi\-step reasoning\.
MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.07108#bib.bib27)\): A large\-scale, multi\-task benchmark consisting of multiple\-choice questions drawn from a wide range of knowledge domains\. The benchmark spans the humanities, social sciences, and the hard sciences\. In this work, we adopt the abstract mathematics subset to evaluate models’ mathematical reasoning abilities, comprising 100 problems of relatively low difficulty\.
GPQA Diamond\(Reinet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib23)\): A challenging scientific multiple\-choice benchmark comprising 198 questions authored by domain experts in biology, physics, and chemistry\.
LiveCodeBench\(Jainet al\.,[2024](https://arxiv.org/html/2606.07108#bib.bib25)\): A code evaluation benchmark consisting of 400 programming problems drawn from diverse sources, including LeetCode, AtCoder, and Codeforces\. We use version v1 in our experiments\.
StrategyQA\(Gevaet al\.,[2021](https://arxiv.org/html/2606.07108#bib.bib24)\): A creative and diverse yes–no question benchmark that requires implicit multi\-step reasoning\. The dataset contains 2,290 questions and is generally of low difficulty\.
TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.07108#bib.bib26)\): A reading comprehension benchmark composed of question–answer–evidence triples\. In this work, we disable retrieval\-augmented generation \(RAG\) to assess the model’s intrinsic knowledge and reasoning capabilities\. From the original test split, we randomly sample 20% of the examples for evaluation, resulting in a subset of 3,589 knowledge\-oriented questions of moderate difficulty\.
CommonSenseQA\(Talmoret al\.,[2019](https://arxiv.org/html/2606.07108#bib.bib32)\): A multiple\-choice question answering benchmark that requires diverse types of commonsense knowledge to identify the correct answer\. Each instance consists of one correct option and four distractors, with a total of 1,221 questions\.
### D\.5Details of Baseline Methods
In our performance comparison, we evaluate the proposed method against a broad range of representative efficient reasoning approaches across multiple paradigms\. Specifically, we consider: \(1\)*steering\-based*methods, including SEAL\(Chenet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib30)\), Controlling Thinking Speed\(Linet al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib33)\)and Manifold Steering\(Huanget al\.,[2025d](https://arxiv.org/html/2606.07108#bib.bib15)\); \(2\)*prompt\-based*methods, including CoD\(Xuet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib5)\), NoThinking\(Maet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib4)\), and ThinkPilot\(Liet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib13)\); \(3\)*early\-exit–based*methods, including DEER\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07108#bib.bib14)\), TrimR\(Linet al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib28)\), Dynasor\-CoT\(Fuet al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib31)\)and FlashThinking\(Jianget al\.,[2025](https://arxiv.org/html/2606.07108#bib.bib29)\); and \(4\)*output\-based*methods, represented by NoWait\(Wanget al\.,[2025a](https://arxiv.org/html/2606.07108#bib.bib16)\)\.
### D\.6Details on Prompts\.
Math\-500, AIME2024, AIME2025, AMC23, GSM8K, Olympiad\-Bench, and MMLU:<\|System\|\> Please reason step by step, and place the final answer inside \\boxed\{\}\.<\|User\|\> \[question\]
GPQA Diamond, CommonSenseQA:<\|System\|\> Please reason step by step, and place the final answer inside \\boxed\{\}\.<\|User\|\> \[question\]Answer with the choice letter only, in \\boxed\{\}\. Do not include option text\.
StrategyQA:<\|System\|\> You answer binary commonsense questions\. Think step by step, then output exactly one final line: \\boxed\{Yes\} or \\boxed\{No\}\.<\|User\|\> \[question\]Answer with \\boxed\{Yes\} or \\boxed\{No\} only\.
LiveCodeBench:<User\> \#\#\#\#\#\# Instruction: You will be given a question \(problem specification\) and will generate a correct Python program that matches the specification and passes all tests\. You will NOT return anything except for the program\. Question: \[problem\] Ensure that when the python program runs, it reads the inputs, runs the algorithm and writes output to STDOUT\. python \#\# YOUR CODE HERE \#\#\#\#\#\# Response:<\|im\_end\|\><\|im\_start\|\>assistant<\|think\|
TriviaQA:<\|System\|\> Please answer the question\.Directly provide the final answer inside <answer\> and </answer\>, without any explanation or additional text\.Example: <answer\> London </answer\><\|User\|\> \[question\]
Step\-wise Difficulty Self\-Assessment Prompt:Let me quickly rate this problem’s difficulty \(1=almost solved, 2=some uncertainty remains, 3=missing key step\) based on the reasoning so far\. Difficulty =
### D\.7Hardware Configuration\.
All experiments were performed on NVIDIA RTX PRO 6000 \(Blackwell Server Edition\) GPUs to ensure a consistent hardware environment\.
## Appendix ECase Study
Figure 15:Qualitative case study on an easy GSM8K problem forQwen3\-4B\-Thinking\-2507\. The difficulty regressor stays low from the beginning and further decreases as the core computation is completed, yielding a short, stable reasoning trajectory\.Figure 16:Qualitative case study on a hard AIME problem forQwen3\-4B\-Thinking\-2507\. The figure shows the step\-wise reasoning transcript with difficulty regressor annotations\. The regressor remains near 1\.0 for most of the trajectory and only drops to∼\\sim0\.5 after a late key insight, indicating that the model resolves the core difficulty only near the end of reasoning\.Similar Articles
DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning
This paper identifies that failures in visual reasoning often stem from breakdowns in dynamic cross-modal coordination between visual and textual evidence during chain-of-thought generation. It introduces DyCo-RL, a reinforcement learning framework that rewards effective cross-modal coordination, leading to improved reasoning performance.
Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility
Introduces the Data-Model Compatibility (DMC) metric to evaluate how well a reasoning dataset aligns with a student model during distillation. Experiments show DMC strongly correlates with distillation performance and that dynamically selecting datasets based on DMC further improves reasoning capabilities.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD dynamically adapts reasoning efficiency during training by using online calibration of correctness-efficiency trade-offs and adaptive problem-specific length targets, improving mathematical reasoning accuracy and reducing output length.
Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
This paper tests whether varying inference-time reasoning effort affects the alignment between large reasoning models' chain-of-thought lengths and human reaction times. Results show alignment is invariant to effort perturbations, suggesting it is a training-time achievement.
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.