CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models
Summary
CAT introduces a framework that leverages model self-certainty signals to autonomously adjust reasoning length based on problem difficulty, reducing overthinking and improving inference efficiency for large reasoning models.
View Cached Full Text
Cached at: 07/02/26, 05:38 AM
# CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models
Source: [https://arxiv.org/html/2607.00862](https://arxiv.org/html/2607.00862)
Qizhi Jiang1Shuo Wang1Pei Ke1,2,∗Yuhang Song1Ke Qin1,2 1Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China, Chengdu, China 2Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province \{jiangqizhi, 202422900227\}@std\.uestc\.edu\.cn, kepei@uestc\.edu\.cn songyuhang@std\.uestc\.edu\.cn, qinke@uestc\.edu\.cn
###### Abstract
Large Reasoning Models \(LRMs\) have achieved remarkable success on complex tasks by leveraging long chain\-of\-thought \(CoT\) trajectories, yet they frequently exhibit overthinking on simple queries, resulting in significant token overhead and reduced inference efficiency\. However, existing compression methods predominantly apply uniform length reduction or rely on coarse\-grained difficulty estimation, often leading to performance degradation on difficult problems\. To address this limitation, we propose Confidence\-Adaptive Thinking \(CAT\), a framework that incorporates the model’s intrinsic self\-certainty signals as confidence into the preference optimization process, which autonomously modulates reasoning lengths based on problem difficulty\. Experimental results show that CAT consistently outperforms state\-of\-the\-art baselines on reasoning accuracy across multiple benchmarks on different base models\. Our work enables LRMs to effectively compress confident responses while deliberating on uncertain ones, offering a potentially robust solution for balancing accuracy and latency in practical industrial scenarios\.
CAT: Confidence\-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models
Qizhi Jiang1Shuo Wang1Pei Ke1,2,∗Yuhang Song1Ke Qin1,21Laboratory of Intelligent Collaborative Computing,University of Electronic Science and Technology of China, Chengdu, China2Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province\{jiangqizhi, 202422900227\}@std\.uestc\.edu\.cn, kepei@uestc\.edu\.cnsongyuhang@std\.uestc\.edu\.cn, qinke@uestc\.edu\.cn
††∗Corresponding author\.## 1Introduction
Recently, large reasoning models \(LRMs\) have rapidly emerged and made substantial progress on complex natural language processing \(NLP\) tasks, as exemplified by OpenAI\-o1OpenAI \([2024](https://arxiv.org/html/2607.00862#bib.bib1)\)and DeepSeek\-R1DeepSeek\-AI \([2025](https://arxiv.org/html/2607.00862#bib.bib2)\)\. These models are equipped with the ability to generate long reasoning chains, demonstrating strong potential on challenging reasoning problems such as mathematical competitionsXuet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib31)\)\. However, while LRMs heavily rely on long chain\-of\-thought \(CoT\) traces to perform well on difficult tasks, they tend to produce redundant reasoning and self\-reflection for simple inputs, incurring pronounced overthinking and token overheadChenet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib4)\); Fenget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib7)\); Liuet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib5)\); Suiet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib6)\)\. This behavior leads to verbose thought chains that increase computation cost and reduce overall inference efficiency\. Accordingly, how to enable LRMs to dynamically adjust token consumption based on the input difficulty has attracted increasing attention, determining the practical industrial usability of LRMs in terms of the balance between accuracy and latencyShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\)\.
Most of the existing approaches focus on reasoning compression and length control predominantly, which treat shortening reasoning chains as the primary objectiveQuet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib8)\)and apply a uniform reduction of reasoning tokens to all the queriesXiaet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib9)\); Chenet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib4)\); Maet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib10)\); Munkhbatet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib11)\)\. While such methods can substantially decrease generation length, they often incur non\-trivial performance degradation on difficult problems, since complex tasks still require sufficient reasoning depths and lengths to sustain accurate answersMuennighoffet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib13)\); Zenget al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib12)\)\. Another line of work resorts to difficulty\-adaptive reasoning to mitigates the imbalance between overthinking for easier instances and underthinking for harder ones\. This category of methods tends to dynamically adjust the budget of output tokens based on the model performanceShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\)\.
However, existing works on adaptive reasoning still face a severe challenge of coarse\-grained difficulty estimation\. Current methods utilize the accuracy of model outputs to measure the problem difficulty and roughly determine the output lengthShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\)\. We argue that this coarse\-grained estimation heavily relies on external labels and provides a partial assessment merely on the answer, rather than measuring the quality of the whole reasoning chains generated by LRMs\.
To address this limitation, we propose CAT \(Confidence\-AdaptiveThinking\), an adaptive reasoning framework driven by the model’s intrinsic confidence\. Inspired by recent works on the quality estimation from the model’s internal token distributionsFuet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib19)\); Genget al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib20)\); Fadeevaet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib21)\), our main idea is to leverage self\-certaintyKanget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib18)\)as the intrinsic fine\-grained indicator to distinguish high\-quality reasoning trajectories from erroneous ones\.Firstly, CAT employs self\-certainty as the model’s intrinsic confidence metric to estimate the quality of generated reasoning trajectories, which reflects the problem difficulty\. Based on the separation of confidence and lengths between different trajectories, we further construct preference data to make the model capture the relationship between problem difficulties and output lengths\.Secondly, we devise a confidence\-weighted preference optimization \(CWPO\) method, which weights the vanilla preference optimization objective with confidence\. This encourages the model to compress reasoning steps under high confidence while retaining necessary exploration otherwise, thereby mitigating overthinking for simple cases and maintaining reasoning performance especially for hard ones\.
In summary, our main contributions are111Our codes are available at[https://github\.com/Jiang9732/CAT\-code](https://github.com/Jiang9732/CAT-code)\.:
- •We introduce the confidence\-adaptive thinking \(CAT\) framework that shifts the paradigm of efficient reasoning from external supervision to intrinsic confidence awareness\. CAT enables reasoning models to autonomously perceive problem difficulty and modulate their thinking depth\.
- •We propose the confidence\-weighted preference optimization \(CWPO\) objective that dynamically weights the vanilla objective based on the calibration ratio of confidence to length\. CWPO mitigates overthinking while preserving the model’s ability to explore complex reasoning paths if necessary\.
- •We conduct extensive experiments across three challenging benchmarks and show superior performance of CAT over state\-of\-the\-art baselines on the balance between inference efficiency and reasoning accuracy\.
## 2Related Work
Efficient Reasoning in LRMs\.Recent studies have increasingly focused on the phenomenon of overthinking in large reasoning modelsSuiet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib6)\); Wuet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib28)\); Wanget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib29)\)\. Existing efficient reasoning methods can generally be categorized into two streams\. The first involves training strategies to equip LRMs with the ability to generate concise reasoning chains, spanning from supervised fine\-tuningCuiet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib30)\); Xiaet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib9)\)to reinforcement learningShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\); Aggarwal and Welleck \([2025](https://arxiv.org/html/2607.00862#bib.bib32)\); Luoet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib33)\); Yuet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib34)\)\. The second category comprises inference\-time methods, including promptingHanet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib35)\); Renze and Guven \([2024](https://arxiv.org/html/2607.00862#bib.bib36)\); Nayabet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib37)\), task routingChuanget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib38)\); Onget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib39)\), latent space compressionHaoet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib41)\); Shenet al\.\([2025b](https://arxiv.org/html/2607.00862#bib.bib40)\), and dynamic decodingSunet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib42)\); Zhang \([2025](https://arxiv.org/html/2607.00862#bib.bib44)\)\.
Compared with existing works on training methods of efficient reasoning, our work utilizes the model’s confidence as the estimation of problem difficulty, instead of solely depending on external reward models and extrinsic metrics\. This makes the full usage of the model’s intrinsic property to achieve adaptive reasoning\.
Confidence Utilization in LRMs\. Recent works have shown that the model confidence potentially indicate the quality of reasoning chainsFuet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib19)\); Genget al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib20)\); Kanget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib18)\); Fadeevaet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib21)\)\. As one of the representative metrics to reflect confidence, self\-certaintyKanget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib18)\)has been primarily applied to Best\-of\-N selectionFuet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib19)\)\. For comparison, our work uses self\-certainty as the model’s confidence to self\-evaluate the quality of generated reasoning chains, which guides the adaptive thinking via preference optimization, instead of merely injecting it into the inference stage\.
Figure 1:Overview of the CAT framework\.
## 3Methodology
### 3\.1Task Definition and Method Overview
Given an input questionxx, our goal is to acquire a reasoning trajectoryyythat contains a multi\-step reasoning process and a final answer\. Under the precondition of accuracy,yyis required to become short for simple problems while being long for hard ones if necessary\.
An overview of our framework is presented in Figure[1](https://arxiv.org/html/2607.00862#S2.F1)\. Firstly, we sample multiple reasoning trajectories for each question and compute their path\-level self\-certainty scores as confidence via a dedicated forward pass \(Section[3\.2\.1](https://arxiv.org/html/2607.00862#S3.SS2.SSS1)\)\. Secondly, we construct preference pairs based on the confidence and lengths, and apply dynamic selection to prioritize more informative supervision \(Section[3\.2\.1](https://arxiv.org/html/2607.00862#S3.SS2.SSS1)\)\. Finally, we fine\-tune the base LRM with a confidence\-weighted preference optimization objective, which incorporates confidence and lengths to further modulate the preference strength, achieving conditional length regulation \(Section[3\.2\.2](https://arxiv.org/html/2607.00862#S3.SS2.SSS2)\)\.
### 3\.2Confidence\-Adaptive Thinking
Our confidence\-adaptive thinking framework consists of two stages, including confidence\-aware preference labeling and confidence\-weighted preference optimization\. While the first stage aims to incorporate confidence as intrinsic signals to construct fine\-grained preference pairs, the second stage further utilizes confidence to further enhance the preference optimization objective\.
#### 3\.2\.1Confidence\-Aware Preference Labeling
To build the preference dataset, we first sampleKKreasoning trajectories\{y\(k\)\}k=1K\\\{y^\{\(k\)\}\\\}\_\{k=1\}^\{K\}for the questionxxfrom the base reasoning model, each of which is a token sequencey\(k\)=\(y1\(k\),…,ynk\(k\)\)y^\{\(k\)\}=\\left\(y^\{\(k\)\}\_\{1\},\\dots,y^\{\(k\)\}\_\{n\_\{k\}\}\\right\)with the length ofnkn\_\{k\}\. The goal of this stage is to construct a preference dataset𝒟=\{\(x,yw,yl,s\)\}\\mathcal\{D\}=\\\{\(x,y\_\{w\},y\_\{l\},s\)\\\}, whereywy\_\{w\}andyly\_\{l\}denote the winning and losing trajectories for the same inputxx, andssindicates the confidence\-calibrated preference score\.
##### Self\-Certainty as Intrinsic Confidence\.
To capture the model’s intrinsic confidence during reasoning, we followKanget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib18)\)to employ self\-certainty, which can also serve as a trajectory\-level quality measure\. Formally, assuming that𝒑θ\(⋅∣x,y≤i\)\\bm\{p\}\_\{\\theta\}\(\\cdot\\mid x,y\_\{\\leq i\}\)denotes the next\-token distribution atii\-th position,VVindicates the vocabulary size, and𝒰\\mathcal\{U\}represents the uniform distribution overVV, self\-certainty \(SC\) can be computed as follows:
SC\(x,y\)=−1nV∑i=1n∑j=1Vlog\(V⋅𝒑θ\(j∣x,y≤i\)\)\\text\{SC\}\(x,y\)\\;=\\;\-\\frac\{1\}\{nV\}\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{V\}\\log\\\!\\Big\(V\\cdot\\bm\{p\}\_\{\\theta\}\(j\\mid x,y\_\{\\leq i\}\)\\Big\)\(1\)which corresponds to measuring the KL divergenceDKL\(𝒰∥pθ\(⋅∣x,y≤i\)\)D\_\{\\mathrm\{KL\}\}\\\!\\bigl\(\\mathcal\{U\}\\,\\\|\\,p\_\{\\theta\}\(\\cdot\\mid x,y\_\{\\leq i\}\)\\bigr\)and averaging this quantity overii\. Intuitively, a larger divergence from the uniform distribution implies a more peaked \(and thus more certain\) predictive distribution, leading to higher SC\. Conversely, a distribution closer to uniform is flatter, indicating greater uncertainty and yielding lower SC\.
##### Preference Pair Construction\.
We consider three important factors of each trajectory to construct the preference dataset: \(i\) the correctness of the answer, \(ii\) the length, and \(iii\) the intrinsic confidence based on SC in Eq\. \([1](https://arxiv.org/html/2607.00862#S3.E1)\)\. We emphasize that SC is a complementary to external factors, which estimates trajectory\-level fine\-grained qualities and determines the strength of pairwise preferences\.
Inspired byShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\), we categorize preference pairs into two types:Conciseness Pairs \(CPs\), formed by two correct trajectories where the preferred one is shorter; andDeliberation Pairs \(DPs\), formed by two incorrect traces where the preferred one is longer\. Unlike prior approaches that calibrate preference strength using per\-question fixed budgets or external difficulty estimation, CAT uses only model\-internal evidence to modulate pairwise preference scoress\.
For each input questionxxand itsKKcandidate reasoning paths, we consider the margin between both lengths and self\-certainty to acquire the preference score asss\. Specifically, given a candidate pair\(x,yw,yl\)\(x,y\_\{w\},y\_\{l\}\), we first compute the margin in terms of self\-certainty, lengths, and correctness:
Δr\\displaystyle\\Delta r=r\(yw\)−r\(yl\)\\displaystyle\\;=\\;r\(y\_\{w\}\)\-r\(y\_\{l\}\)\(2\)ΔSC\\displaystyle\\Delta\\mathrm\{SC\}=SC\(x,yw\)−SC\(x,yl\)\\displaystyle\\;=\\;\\mathrm\{SC\}\(x,y\_\{w\}\)\-\\mathrm\{SC\}\(x,y\_\{l\}\)whereSC\(⋅\)\\mathrm\{SC\}\(\\cdot\)can be acquired by Eq\.\([1](https://arxiv.org/html/2607.00862#S3.E1)\) andr\(⋅\)r\(\\cdot\)is a factor with respect to reasoning lengths and correctness:
r\(y\)=\{\+1\|y\|ifyis correct−1\|y\|ifyis incorrectr\(y\)=\\begin\{cases\}\+\\frac\{1\}\{\|y\|\}&\\text\{if $y$ is correct\}\\\\ \-\\frac\{1\}\{\|y\|\}&\\text\{if $y$ is incorrect\}\\end\{cases\}\(3\)This design assigns the highest reward to short, correct paths while imposing the lightest penalty on long, incorrect paths\. Conversely, short but incorrect paths receive the most severe penalty\.
For CPs, our intent is to favor short and confident solutions and reject long and unconfident ones\. We therefore multiplyrrandSC\\mathrm\{SC\}so that a pair receives stronger strength precisely when the winning pathywy\_\{w\}is not only much more efficient but also more internally decisive:
sCP\(x,yw,yl\)=Δr⋅ΔSCs\_\{\\text\{CP\}\}\(x,y\_\{w\},y\_\{l\}\)\\;=\\;\\Delta r\\cdot\\Delta\\mathrm\{SC\}\(4\)
For DPs, we want to prefer long and unconfident attempts over short and confident failures, discouraging premature yet decisive mistakes\. Accordingly, we reverse the confidence term, making wrong trajectories with larger certainty receive stronger penalties:
sDP\(x,yw,yl\)=Δr⋅\(−ΔSC\)s\_\{\\text\{DP\}\}\(x,y\_\{w\},y\_\{l\}\)\\;=\\;\\Delta r\\cdot\\big\(\-\\Delta\\mathrm\{SC\}\\big\)\(5\)
In both cases, larger scores of preference pairs indicate potentially stronger and more discriminative preference signals for subsequent optimization\. Thus, we devise aDynamic Pruningstrategy to select the preference optimization dataset based onss\. Concretely, for each queryxx, we rank the CP and DP sets by their scoressin descending order, respectively, and retain only those pairs whose scores fall within the top three highest score levels\. We then pool candidates from all the queries and sort them globally based on the preference scoress, truncating the list by removing the bottomτ\\taufraction, whereτ\\taudenotes the truncation ratio\. Finally, to prevent over\-representing queries that produce many high\-scoring pairs, we enforce a per\-query cap and retain at most one CP and one DP per query in the final preference dataset\.
Table 1:Accuracy \(Acc\), the mean response length over all trajectories \(Len\) and trajectories with correct final answers \(C\-Len\), the percentage reduction inLenrelative to the base model \(CR\), and the percentage reduction inC\-Lenrelative to the base model \(C\-CR\) on three benchmark datasets, respectively\.
#### 3\.2\.2Confidence\-Weighted Preference Optimization \(CWPO\)
To adjust the model’s reasoning depth conditionally on its internal certainty, rather than applying a uniform length bias to all the samples, we propose theConfidence\-Weighted Preference Optimization \(CWPO\)objective that pioneers the use of intrinsic self\-certainty directly within the alignment loss landscape\. Compared with the vanilla SimPO objectiveMenget al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib17)\), we dynamically modulated the scaling factors of the winning and losing terms\. Formally, the CWPO loss is computed as:
ℒCWPO\(πθ\)=−𝔼\(x,yw,yl\)∼𝒟\[logσ\(\\displaystyle\\mathcal\{L\}\_\{\\text\{CWPO\}\}\(\\pi\_\{\\theta\}\)=\-\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\Bigl\[\\log\\sigma\\Bigl\(\(6\)βw\|yw\|logπθ\(yw\|x\)−βl\|yl\|logπθ\(yl\|x\)−γ\)\]\\displaystyle\\tfrac\{\\beta\_\{w\}\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\-\\tfrac\{\\beta\_\{l\}\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\-\\gamma\\Bigr\)\\Bigr\]where the dynamic weightsβw\\beta\_\{w\}andβl\\beta\_\{l\}are acquired by the originalβbase\\beta\_\{base\}in SimPO and acalibration ratio\(ρ\\rho\) based on self\-certainty and lengths:
ρ\(x,y\)=SC\(x,y\)\|y\|α\\rho\(x,y\)\\;=\\;\\frac\{\\text\{SC\}\(x,y\)\}\{\|y\|^\{\\alpha\}\}\(7\)whereα∈\(0,1\)\\alpha\\in\(0,1\)is a length\-aware exponent to keepSC\\mathrm\{SC\}and\|y\|\|y\|on a comparable scale for numerical stability\. This ratio imposes an additional tunable length penalty so that confidence\-guided scaling can better align gradient allocation with efficiency\.
The CWPO loss sets different weights for conciseness pairs \(CPs\) and deliberation pairs \(DPs\): For CPs, where the model compares two correct responses, we defineβw=βbase⋅σ\(ρ\(x,yw\)\)\\beta\_\{w\}=\\beta\_\{base\}\\cdot\\sigma\(\\rho\(x,y\_\{w\}\)\), while symmetrically scaling the loser’s weight using the inverse ratioβl=βbase⋅σ\(ρ\(x,yl\)−1\)\\beta\_\{l\}=\\beta\_\{base\}\\cdot\\sigma\(\\rho\(x,y\_\{l\}\)^\{\-1\}\)\. This specifically incentivizes the model to commit to reasoning paths that are both correct and concise\. Conversely, for DPs, we focus on penalizing short and erroneous answers with unearned confidence\. We set the penalty weightβl=βbase⋅σ\(ρ\(x,yl\)\)\\beta\_\{l\}=\\beta\_\{base\}\\cdot\\sigma\(\\rho\(x,y\_\{l\}\)\)and the winner’s weightβw=βbase⋅σ\(ρ\(x,yw\)−1\)\\beta\_\{w\}=\\beta\_\{base\}\\cdot\\sigma\(\\rho\(x,y\_\{w\}\)^\{\-1\}\)\. By integrating these internal signals, CWPO moves beyond static length penalties, allowing the model to autonomously judge when to compress reasoning and when to deliberate, achieving a balance between efficiency and accuracy\.
## 4Experiments
### 4\.1Settings
Models and Datasets\.We conduct comparative experiments on three LRMs: DeepSeek\-R1\-Distill\-Qwen\-7B \(R1\-7B\) / 1\.5B \(R1\-1\.5B\)DeepSeek\-AI \([2025](https://arxiv.org/html/2607.00862#bib.bib2)\)and Qwen3\-8BYanget al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib22)\)\. For the training dataset, followingQiaoet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib15)\), we randomly select 2,000 questions from the MATH training setHendryckset al\.\([2021](https://arxiv.org/html/2607.00862#bib.bib23)\), maintaining diversity in both difficulty and response length\.
Benchmarks\.We followShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\)to select three benchmarks, including MATH\-500Lightmanet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib24)\), AIME24MAA \([2024](https://arxiv.org/html/2607.00862#bib.bib45)\), and GPQAReinet al\.\([2023](https://arxiv.org/html/2607.00862#bib.bib25)\)\.
Baselines\.We select several state\-of\-the\-art methods for efficient reasoning as baselines, includingOverThinkChenet al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib4)\),DASTShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\), andConCISEQiaoet al\.\([2025](https://arxiv.org/html/2607.00862#bib.bib15)\)\. For ConCISE, we choose the best\-performing alternative ConCISESimPO\{\}\_\{\\text\{SimPO\}\}as the comparison baseline\. All of these methods are under the paradigm of preference optimization with the SimPO objective\.
Implementation Details\.FollowingShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\), we generate 20 candidate responses per question in our training set, and set the maximum sequence length to 4,096 tokens\. Based on the hyperparameter analysis in Appendix[B\.2](https://arxiv.org/html/2607.00862#A2.SS2), the truncation ratio \(τ\\tau\) was set to 0\.15\. The preference optimization is conducted within the SimPO frameworkMenget al\.\([2024](https://arxiv.org/html/2607.00862#bib.bib17)\)\. We adopt low\-rank adaption \(LoRA\)Huet al\.\([2022](https://arxiv.org/html/2607.00862#bib.bib47)\)with the rank ofr=32r=32, scaling factor ofα=64\\alpha=64, dropout rate of 0\.05\. The training epoch is 1 while the batch size is 16\. The learning rate is 5e\-5 for DeepSeek\-R1\-Distill\-Qwen\-7B and Qwen3\-8B, and 5e\-6 for DeepSeek\-R1\-Distill\-Qwen\-1\.5B, as the weaker 1\.5B backbone yields more DPs after CAPL and thus benefits from more conservative optimization\. All the experiments are conducted on 2 NVIDIA A800 GPUs\. More training details are provided in Appendix[A\.1](https://arxiv.org/html/2607.00862#A1.SS1)\. Decoding processes are executed using the the OpenR1 evaluation scriptsHugging Face \([2025](https://arxiv.org/html/2607.00862#bib.bib46)\), with comprehensive decoding details provided in Appendix[A\.2](https://arxiv.org/html/2607.00862#A1.SS2)\. The experimental results are presented with mean values over 3 runs\.
\(a\)The distribution of SC for correct and incorrect responses\.
\(b\)Box plots of SC across varying response lengths\.
Figure 2:Analysis of Self\-Certainty \(SC\) distributions regarding response correctness and robustness to length on the MATH dataset \(Level 4\), derived from 20 reasoning paths per question generated by Qwen3\-8B \(LmaxL\_\{max\}= 4096\)\.
### 4\.2Results and Analysis
#### 4\.2\.1Overall Results
The results in Table[1](https://arxiv.org/html/2607.00862#S3.T1)show that CAT achieves the highest accuracy \(exceeding the backbone model\) on all the three benchmarks while maintaining an acceptable compression rate \(CR\), suggesting that CAT can allocate reasoning steps adaptively to obtain better performance\. Although OverThink and ConCISE attain the most substantial compression rates, they still incur an unavoidable loss in accuracy relative to the backbone model\. DAST and CAT exhibit similar balancing trends between task performance and compression, as both aim to achieve adaptive compression while preserving the model’s reasoning capability\. Compared with DAST, CAT delivers higher for all the base models, and achieves higher CR and C\-CR in most settings\. These results suggest that CAT is more effective at adaptive reasoning, demonstrating effectiveness of our proposed confidence\-aware adaptive reasoning approach based on the model’s intrinsic signals\.
Table 2:Ablation study of Confidence\-Aware Preference Labeling and Confidence\-Weighted Preference Optimization on DeepSeek\-R1\-Distill\-Qwen\-7B\.
#### 4\.2\.2Ablation Study
To assess the key components in CAT, including Confidence\-Aware Preference Labeling \(CAPL\) and Confidence\-Weighted Preference Optimization \(CWPO\), we conduct an detailed ablation study by removing either CAPL \(w/o CAPL, where preferences are scored only byΔr\\Delta r\) or CWPO \(w/o CWPO, where we replace CWPO with vanilla SimPO\)\. The results in Table[2](https://arxiv.org/html/2607.00862#S4.T2)show that all these parts contribute to the final performance\. We observe that constructing preference pairs solely based on length differences \(w/o CAPL\) yields a higher compression ratio but leads to a larger degradation in reasoning performance on most tasks\. This observation highlights the importance of high\-quality training data that provides difficulty\-adaptive reasoning signals\.
Due to the paper limit, we further explore the effect of Self\-Certainty on CAT in Appendix[B\.1](https://arxiv.org/html/2607.00862#A2.SS1)\.
Figure 3:Case study on DeepSeek\-R1\-Distill\-Qwen\-7B\. All the three methods solve the problem correctly\. Compared with the backbone model and DAST that shorten the reasoning chain but lower self\-certainty, CAT further reduces the reasoning length and yields higher self\-certainty\.
#### 4\.2\.3Analysis of Self\-Certainty
To better understand how self\-certainty helps the model achieve an optimal balance between reasoning accuracy and length, we conduct a detailed analysis on the reasoning trajectories generated by Qwen3\-8B of the MATH dataset\.
SC effectively distinguishes correct and incorrect reasoning paths\.In Figure[2\(a\)](https://arxiv.org/html/2607.00862#S4.F2.sf1), we analyze the distribution of self\-certainty for correct and incorrect responses on MATH \(Level 4\)\. The distributions for correct and incorrect responses concentrate around distinct means, with correct responses exhibiting a higher mean\. This suggests that SC can effectively distinguish correct from incorrect reasoning trajectories and is strongly correlated with response quality\.
SC is robust to reasoning lengths\.We analyze self\-certainty across responses of varying lengths\. As illustrated in Figure[2\(b\)](https://arxiv.org/html/2607.00862#S4.F2.sf2), SC is not noticeably affected by response lengths: across the entire length range, the median \(blue line\) shows only a very slight downward trend, which is largely attributable to the fact that shorter responses contain a higher proportion of correct trajectories, indicating that SC is stable regardless of reasoning lengths\.
#### 4\.2\.4Generalization Across Preference Optimization Methods
To test the generalization ability of our method, we further apply our method to DPO in addition to SimPO, and assess the performance on DeepSeek\-R1\-Distill\-Qwen\-7B using the same preference pairs constructed by CAPL\. The DPO\-version CWPO objective \(denoted as CWPODPO\{\}\_\{\\text\{DPO\}\}\) is slightly different from vanilla CWPO, which is detailed in Appendix[C](https://arxiv.org/html/2607.00862#A3)\. The results in Table[3](https://arxiv.org/html/2607.00862#S4.T3)indicate that CWPODPO\{\}\_\{\\text\{DPO\}\}beats standard DPO in most of the metrics on three benchmarks, demonstrating the promising generalization ability to different preference optimization methods\.
Table 3:Results of DPO and CWPODPO\{\}\_\{\\text\{DPO\}\}on DeepSeek\-R1\-Distill\-Qwen\-7B\.
#### 4\.2\.5Case Study
To intuitively illustrate how CAT affects reasoning behaviors, we present a case study on DeepSeek\-R1\-Distill\-Qwen\-7B in Figure[3](https://arxiv.org/html/2607.00862#S4.F3)\. We observe that all the three methods reach the correct answer but exhibit different reasoning lengths and self\-certainty\. The backbone model repeatedly verifies the same inequalities and explicitly checks invalid alternatives\. DAST reduces the reasoning length with lower self\-certainty, while still retaining additional verification beyond the core derivation\. In comparison, CAT achieves the shortest reasoning chain with higher self\-certainty\. It does not eliminate reflection entirely: after deriving the feasible interval, it only keeps a brief validity check rather than exploring invalid candidates\. This qualitative case supports the design of CAT, which incorporates self\-certainty together with correctness and length signals to favor concise and confident correct reasoning paths\.
## 5Conclusion
This work proposes confidence\-adaptive thinking \(CAT\), which addresses the pronounced overthinking and token overhead in large reasoning models through intrinsic confidence awareness\. CAT integrates self\-certainty as LRMs’ intrinsic confidence to enable them to compress confident responses while deliberating on uncertain ones\. Extensive experiments demonstrate that CAT consistently achieves a superior balance between inference efficiency and accuracy\.
## Limitations
While CAT demonstrates a superior balance between reasoning accuracy and efficiency, we identify the following areas for future improvement:
##### Path\-Level Aggregation\.
Our current framework utilizes path\-level Self\-Certainty to score reasoning traces\. While this metric effectively differentiates high\-quality responses, aggregating token\-level signals into a single scalar for the entire sequence may overlook variations in confidence at specific reasoning steps\. Future work could explore integrating token positions with their specific Self\-Certainty scores to enable more precise step\-level compression\.
##### Domain\-Specific Evaluation\.
Our experiments focus on STEM disciplines such as mathematics and physics that allow rigorous correctness verification\. Although Self\-Certainty is an intrinsic signal independent of ground truth, our preference labeling strategy currently utilizes verification results\. We aim to extend this approach to open\-ended generation tasks where Self\-Certainty can guide alignment without reliance on external answers\.
##### Offline Optimization Paradigm\.
CAT employs Confidence\-Weighted Preference Optimization on static datasets constructed from pre\-sampled trajectories\. This offline setting limits the ability of the policy to dynamically update its confidence estimates during the training process\. Future research will investigate transitioning from offline optimization to online reinforcement learning variants, allowing the model to iteratively refine its reasoning efficiency through continuous interaction\.
## Acknowledgments
This work was supported by Sichuan Science and Technology Program \(2025ZNSFSC1488\), Noncommunicable Chronic Diseases\-National Science and Technology Major Project \(2023ZD0501806\), Fundamental Research Funds for the Central Universities \(ZYGX2025XJ041\), and CIPS\-SMP\-Zhipu Large Model Fund \(CIPS\-SMP20250314\)\.
## References
- L1: controlling how long A reasoning model thinks with reinforcement learning\.CoRRabs/2503\.04697\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.04697),[Document](https://dx.doi.org/10.48550/ARXIV.2503.04697),2503\.04697Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang, R\. Wang, Z\. Tu, H\. Mi, and D\. Yu \(2024\)Do NOT think that much for 2\+3=? on the overthinking of o1\-like llms\.CoRRabs/2412\.21187\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.21187),[Document](https://dx.doi.org/10.48550/ARXIV.2412.21187),2412\.21187Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p1.1),[§1](https://arxiv.org/html/2607.00862#S1.p2.1),[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p3.1)\.
- Y\. Chuang, L\. Yu, G\. Wang, L\. Zhang, Z\. Liu, X\. Cai, Y\. Sui, V\. Braverman, and X\. B\. Hu \(2025\)Confident or seek stronger: exploring uncertainty\-based on\-device LLM routing from benchmarking to generalization\.CoRRabs/2502\.04428\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.04428),[Document](https://dx.doi.org/10.48550/ARXIV.2502.04428),2502\.04428Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- Y\. Cui, P\. He, J\. Zeng, H\. Liu, X\. Tang, Z\. Dai, Y\. Han, C\. Luo, J\. Huang, Z\. Li, S\. Wang, Y\. Xing, J\. Tang, and Q\. He \(2025\)Stepwise perplexity\-guided refinement for efficient chain\-of\-thought reasoning in large language models\.InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,pp\. 18581–18597\.External Links:[Link](https://aclanthology.org/2025.findings-acl.956/)Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.CoRRabs/2501\.12948\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.12948),[Document](https://dx.doi.org/10.48550/ARXIV.2501.12948),2501\.12948Cited by:[§A\.2](https://arxiv.org/html/2607.00862#A1.SS2.p1.1),[§1](https://arxiv.org/html/2607.00862#S1.p1.1),[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p1.1)\.
- E\. Fadeeva, A\. Rubashevskii, A\. Shelmanov, S\. Petrakov, H\. Li, H\. Mubarak, E\. Tsymbalov, G\. Kuzmin, A\. Panchenko, T\. Baldwin, P\. Nakov, and M\. Panov \(2024\)Fact\-checking the output of large language models via token\-level uncertainty quantification\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 9367–9385\.External Links:[Link](https://aclanthology.org/2024.findings-acl.558/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.558)Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p4.1),[§2](https://arxiv.org/html/2607.00862#S2.p3.1)\.
- S\. Feng, G\. Fang, X\. Ma, and X\. Wang \(2025\)Efficient reasoning models: A survey\.Trans\. Mach\. Learn\. Res\.2025\.External Links:[Link](https://openreview.net/forum?id=sySqlxj8EB)Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p1.1)\.
- Y\. Fu, X\. Wang, Y\. Tian, and J\. Zhao \(2025\)Deep think with confidence\.CoRRabs/2508\.15260\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.15260),[Document](https://dx.doi.org/10.48550/ARXIV.2508.15260),2508\.15260Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p4.1),[§2](https://arxiv.org/html/2607.00862#S2.p3.1)\.
- J\. Geng, F\. Cai, Y\. Wang, H\. Koeppl, P\. Nakov, and I\. Gurevych \(2024\)A survey of confidence estimation and calibration in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), NAACL 2024, Mexico City, Mexico, June 16\-21, 2024,pp\. 6577–6595\.External Links:[Link](https://doi.org/10.18653/v1/2024.naacl-long.366),[Document](https://dx.doi.org/10.18653/V1/2024.NAACL-LONG.366)Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p4.1),[§2](https://arxiv.org/html/2607.00862#S2.p3.1)\.
- T\. Han, Z\. Wang, C\. Fang, S\. Zhao, S\. Ma, and Z\. Chen \(2025\)Token\-budget\-aware LLM reasoning\.InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Findings of ACL, Vol\.ACL 2025,pp\. 24842–24855\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1274/)Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. Weston, and Y\. Tian \(2024\)Training large language models to reason in a continuous latent space\.CoRRabs/2412\.06769\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.06769),[Document](https://dx.doi.org/10.48550/ARXIV.2412.06769),2412\.06769Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual,J\. Vanschoren and S\. Yeung \(Eds\.\),External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html)Cited by:[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p4.3)\.
- Hugging Face \(2025\)Open r1: a fully open reproduction of deepseek\-r1\.External Links:[Link](https://github.com/huggingface/open-r1)Cited by:[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p4.3)\.
- Z\. Kang, X\. Zhao, and D\. Song \(2025\)Scalable best\-of\-n selection for large language models via self\-certainty\.CoRRabs/2502\.18581\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.18581),[Document](https://dx.doi.org/10.48550/ARXIV.2502.18581),2502\.18581Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p4.1),[§2](https://arxiv.org/html/2607.00862#S2.p3.1),[§3\.2\.1](https://arxiv.org/html/2607.00862#S3.SS2.SSS1.Px1.p1.5)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p2.1)\.
- Y\. Liu, J\. Wu, Y\. He, H\. Gao, H\. Chen, B\. Bi, J\. Zhang, Z\. Huang, and B\. Hooi \(2025\)Efficient inference for large reasoning models: A survey\.CoRRabs/2503\.23077\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.23077),[Document](https://dx.doi.org/10.48550/ARXIV.2503.23077),2503\.23077Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p1.1)\.
- H\. Luo, L\. Shen, H\. He, Y\. Wang, S\. Liu, W\. Li, N\. Tan, X\. Cao, and D\. Tao \(2025\)O1\-pruner: length\-harmonizing fine\-tuning for o1\-like reasoning pruning\.ArXivabs/2501\.12570\.External Links:[Link](https://api.semanticscholar.org/CorpusID:275790112)Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- X\. Ma, G\. Wan, R\. Yu, G\. Fang, and X\. Wang \(2025\)CoT\-valve: length\-compressible chain\-of\-thought tuning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 6025–6035\.External Links:[Link](https://aclanthology.org/2025.acl-long.300/)Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p2.1)\.
- C\. MAA \(2024\)American invitational mathematics examination\-aime 2024\.Cited by:[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p2.1)\.
- Y\. Meng, M\. Xia, and D\. Chen \(2024\)SimPO: simple preference optimization with a reference\-free reward\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/e099c1c9699814af0be873a175361713-Abstract-Conference.html)Cited by:[§3\.2\.2](https://arxiv.org/html/2607.00862#S3.SS2.SSS2.p1.1),[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p4.3)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. J\. Candès, and T\. Hashimoto \(2025\)S1: simple test\-time scaling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 20275–20321\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.1025),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.1025)Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p2.1)\.
- T\. Munkhbat, N\. Ho, S\. H\. Kim, Y\. Yang, Y\. Kim, and S\. Yun \(2025\)Self\-training elicits concise reasoning in large language models\.InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Findings of ACL, Vol\.ACL 2025,pp\. 25127–25152\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1289/)Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p2.1)\.
- S\. Nayab, G\. Rossolini, G\. C\. Buttazzo, N\. Manes, and F\. Giacomelli \(2024\)Concise thoughts: impact of output length on LLM reasoning and cost\.CoRRabs/2407\.19825\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.19825),[Document](https://dx.doi.org/10.48550/ARXIV.2407.19825),2407\.19825Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- I\. Ong, A\. Almahairi, V\. Wu, W\. Chiang, T\. Wu, J\. E\. Gonzalez, M\. W\. Kadous, and I\. Stoica \(2025\)RouteLLM: learning to route llms from preference data\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=8sSqNntaMr)Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- OpenAI \(2024\)OpenAI o1 system card\.CoRRabs/2412\.16720\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.16720),[Document](https://dx.doi.org/10.48550/ARXIV.2412.16720),2412\.16720Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p1.1)\.
- Z\. Qiao, Y\. Deng, J\. Zeng, D\. Wang, L\. Wei, G\. Wang, F\. Meng, J\. Zhou, J\. Ren, and Y\. Zhang \(2025\)ConCISE: confidence\-guided compression in step\-by\-step efficient reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,pp\. 8010–8029\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.405),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.405)Cited by:[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p3.1)\.
- X\. Qu, Y\. Li, Z\. Su, W\. Sun, J\. Yan, D\. Liu, G\. Cui, D\. Liu, S\. Liang, J\. He, P\. Li, W\. Wei, J\. Shao, C\. Lu, Y\. Zhang, X\. Hua, B\. Zhou, and Y\. Cheng \(2025\)A survey of efficient reasoning for large reasoning models: language, multimodality, and beyond\.CoRRabs/2503\.21614\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.21614),[Document](https://dx.doi.org/10.48550/ARXIV.2503.21614),2503\.21614Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p2.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)GPQA: A graduate\-level google\-proof q&a benchmark\.CoRRabs/2311\.12022\.External Links:[Link](https://doi.org/10.48550/arXiv.2311.12022),[Document](https://dx.doi.org/10.48550/ARXIV.2311.12022),2311\.12022Cited by:[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p2.1)\.
- M\. Renze and E\. Guven \(2024\)The benefits of a concise chain of thought on problem\-solving in large language models\.In2nd International Conference on Foundation and Large Language Models, FLLM 2024, Dubai, United Arab Emirates, November 26\-29, 2024,pp\. 476–483\.External Links:[Link](https://doi.org/10.1109/FLLM63129.2024.10852493),[Document](https://dx.doi.org/10.1109/FLLM63129.2024.10852493)Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- Y\. Shen, J\. Zhang, J\. Huang, S\. Shi, W\. Zhang, J\. Yan, N\. Wang, K\. Wang, Z\. Liu, and S\. Lian \(2025a\)DAST: difficulty\-adaptive slow\-thinking for large reasoning models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 \- Industry Track, Suzhou, China, November 4\-9, 2025,pp\. 2322–2331\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-industry.160),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-INDUSTRY.160)Cited by:[§A\.1](https://arxiv.org/html/2607.00862#A1.SS1.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2607.00862#S1.p1.1),[§1](https://arxiv.org/html/2607.00862#S1.p2.1),[§1](https://arxiv.org/html/2607.00862#S1.p3.1),[§2](https://arxiv.org/html/2607.00862#S2.p1.1),[§3\.2\.1](https://arxiv.org/html/2607.00862#S3.SS2.SSS1.Px2.p2.1),[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p3.1),[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p4.3)\.
- Z\. Shen, H\. Yan, L\. Zhang, Z\. Hu, Y\. Du, and Y\. He \(2025b\)CODI: compressing chain\-of\-thought into continuous space via self\-distillation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 677–693\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.36),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.36)Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- Y\. Sui, Y\. Chuang, G\. Wang, J\. Zhang, T\. Zhang, J\. Yuan, H\. Liu, A\. Wen, S\. Zhong, N\. Zou, H\. Chen, and X\. Hu \(2025\)Stop overthinking: A survey on efficient reasoning for large language models\.Trans\. Mach\. Learn\. Res\.2025\.External Links:[Link](https://openreview.net/forum?id=HvoG8SxggZ)Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p1.1),[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- H\. Sun, M\. Haider, R\. Zhang, H\. Yang, J\. Qiu, M\. Yin, M\. Wang, P\. L\. Bartlett, and A\. Zanette \(2024\)Fast best\-of\-n decoding via speculative rejection\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/3950f6bf5c2eb7435ecf58eaa85cc8c2-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- Y\. Wang, Q\. Liu, J\. Xu, T\. Liang, X\. Chen, Z\. He, L\. Song, D\. Yu, J\. Li, Z\. Zhang, R\. Wang, Z\. Tu, H\. Mi, and D\. Yu \(2025\)Thoughts are all over the place: on the underthinking of o1\-like llms\.CoRRabs/2501\.18585\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.18585),[Document](https://dx.doi.org/10.48550/ARXIV.2501.18585),2501\.18585Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- Y\. Wu, Y\. Wang, T\. Du, S\. Jegelka, and Y\. Wang \(2025\)When more is less: understanding chain\-of\-thought length in llms\.CoRRabs/2502\.07266\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.07266),[Document](https://dx.doi.org/10.48550/ARXIV.2502.07266),2502\.07266Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- H\. Xia, C\. T\. Leong, W\. Wang, Y\. Li, and W\. Li \(2025\)TokenSkip: controllable chain\-of\-thought compression in LLMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 3351–3363\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.165/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.165),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p2.1),[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- F\. Xu, Q\. Hao, C\. Shao, Z\. Zong, Y\. Li, J\. Wang, Y\. Zhang, J\. Wang, X\. Lan, J\. Gong, T\. Ouyang, F\. Meng, Y\. Yan, Q\. Yang, Y\. Song, S\. Ren, X\. Hu, J\. Feng, C\. Gao, and Y\. Li \(2025\)Toward large reasoning models: a survey of reinforced reasoning with large language models\.Patterns6\(10\),pp\. 101370\.External Links:ISSN 2666\-3899,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.patter.2025.101370),[Link](https://www.sciencedirect.com/science/article/pii/S2666389925002181)Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.CoRRabs/2505\.09388\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.09388),[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),2505\.09388Cited by:[§4\.1](https://arxiv.org/html/2607.00862#S4.SS1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, T\. Fan, G\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, W\. Dai, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.CoRRabs/2503\.14476\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.14476),[Document](https://dx.doi.org/10.48550/ARXIV.2503.14476),2503\.14476Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
- Z\. Zeng, Q\. Cheng, Z\. Yin, B\. Wang, S\. Li, Y\. Zhou, Q\. Guo, X\. Huang, and X\. Qiu \(2024\)Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective\.CoRRabs/2412\.14135\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.14135),[Document](https://dx.doi.org/10.48550/ARXIV.2412.14135),2412\.14135Cited by:[§1](https://arxiv.org/html/2607.00862#S1.p2.1)\.
- J\. Zhang \(2025\)Confidence\-aware reasoning: optimizing self\-guided thinking trajectories in large reasoning models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 \- Industry Track,pp\. 2081–2095\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-industry.146),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-INDUSTRY.146)Cited by:[§2](https://arxiv.org/html/2607.00862#S2.p1.1)\.
## Appendix AExperimental Details
\(a\)Difficulty diversity of training set\.
\(b\)Length diversity of training set\.
Figure 4:Difficulty and length distributions illustrating the diversity of the question set\.### A\.1Training details
##### Training Set\.
While the construction methodology of the training set is detailed in Section 4, Figures[4\(a\)](https://arxiv.org/html/2607.00862#A1.F4.sf1)and[4\(b\)](https://arxiv.org/html/2607.00862#A1.F4.sf2)illustrate its diversity in terms of difficulty and length\. Specifically, the difficulty distribution aligns with theLevelmetric of the 2000 selected questions from the MATH dataset\. The length distribution reflects the reasoning chains generated by DeepSeek\-R1\-Distill\-Qwen\-7B, obtained by sampling 20 paths per question with a temperature of 0\.6 and a top\_p of 0\.95\.
##### Training Configurations\.
Following Dynamic Pruning with a truncation ratio ofτ=0\.15\\tau=0\.15, our constructed preference pairs for R1\-7B yielded 1,765 Conciseness Pairs \(CPs; 93\.1%\) and 130 Deliberation Pairs \(DPs; 6\.9%\), a distribution similar to that reported in DASTShenet al\.\([2025a](https://arxiv.org/html/2607.00862#bib.bib14)\)\. Similarly, for R1\-1\.5B, we identified 1,742 CPs \(88\.8%\) and 221 DPs \(11\.2%\)\. For Qwen3\-8B, the resulting dataset comprised 1,517 CPs \(97\.4%\) and 40 DPs \(2\.6%\)\. These results indicate that for models with stronger reasoning capabilities, such as Qwen3\-8B, Confidence\-Aware Preference Labeling generates a higher proportion of CPs to facilitate the learning of conciseness\. Conversely, for models with weaker reasoning abilities, such as R1\-1\.5B, the method produces more DPs to encourage cautious exploration\. Furthermore, following DAST, the original SimPO hyperparameters were set toβ=200\\beta=200andγ=1\\gamma=1for all three models\. All baselines use comparable data budgets to CAT\.
### A\.2Evaluation Details
In our evaluation setup, we use a unified decoding configuration for all experiments, with temperature = 0\.6 and top\_p = 0\.95DeepSeek\-AI \([2025](https://arxiv.org/html/2607.00862#bib.bib2)\)\. The maximum generation length is capped at 32,768 tokens for all the models\.
\(a\)Impact of varying the Length\-Aware Exponentα\\alphaon model performance\.
\(b\)Impact of varying the Truncation Ratioτ\\tauon model performance\.
Figure 5:Hyperparameter analysis\.\(a\)Accuracy Performance on Difficulty Levels\.
\(b\)Average Token Length on Difficulty Levels\.
Figure 6:Performance comparison on MATH\-500 across different difficulty levels\.
## Appendix BAdditional Experiments
### B\.1Ablation Study on Self\-Certainty
Table 4:Ablation study of Self\-Certainty \(SC\) on DeepSeek\-R1\-Distill\-Qwen\-7B\.To further investigate the effects of Self\-Certainty \(SC\) in CAT, we conduct an additional ablation study to remove SC in CWPO or CAPL & CWPO, respectively\. Note that removing SC in CAPL only is equivalent to w/o CAPL in Table[2](https://arxiv.org/html/2607.00862#S4.T2), which is not repeatedly analyzed in this section\.
Table[4](https://arxiv.org/html/2607.00862#A2.T4)shows that SC matters in both data construction and model optimization\. Although removing SC commonly brings a higher compression rate, the reasoning accuracy largely degrades on all the benchmarks\. This directly supports that SC contributes non\-redundant value to our method\.
Figure 7:Average Self\-Certainty \(SC\) across difficulty levels on the training set, derived from 20 reasoning paths per question generated by Qwen3\-8B\. A clear downward trend in SC is evident for all, correct, and incorrect trajectories as the difficulty level increases from Level 1 to Level 5\. This indicates that SC effectively captures problem difficulty beyond correctness, serving as a reliable signal for difficulty\-aware control\.
### B\.2Analysis of Hyperparameters
To examine the effects of the Length\-Aware Exponent\(α\\alpha\) and the Truncation Ratio \(τ\\tau\), we performed a grid search on MATH\-500 using the R1\-7B model for validation\. As shown in Figure[5](https://arxiv.org/html/2607.00862#A1.F5), R1\-7B achieves the optimal results whenα\\alpha=0\.5 andτ\\tau=0\.15\. In both single\-parameter sweeps, these settings yield the best Acc and compression ratio\. For the other backbone models, we conducted the same hyperparameter search and selected the final hyperparameters accordingly\.
### B\.3Validation of SC as a Difficulty Indicator
Because MATH Level 4 does not provide a sufficiently diverse difficulty distribution, we analyze SC across different difficulty levels on the training set\. As shown in Figure[7](https://arxiv.org/html/2607.00862#A2.F7), as the difficulty level increases, the average SC over all trajectories \(including both correct and incorrect ones\) exhibits a clear downward trend\. This suggests that, beyond separating correct from incorrect trajectories, SC also captures differences in problem difficulty\. This property is central to CAT: it enables difficulty\-aware control\.
### B\.4Performance Analysis Across Difficulty Levels
We analyze the performance of CAT, DAST, and CONCISE on the MATH\-500 dataset using the R1\-7B model across varying difficulty levels\. As shown in Figure[6](https://arxiv.org/html/2607.00862#A1.F6)\. CAT achieves the highest accuracy in the two most challenging difficulty tiers, demonstrating a substantial advantage over other baselines and highlighting its robust reasoning capabilities for complex problems\. While DAST also exhibits difficulty adaptability, it underperforms CAT in both accuracy and length compression\. Furthermore, although CONCISE achieves the most significant length reduction, its reasoning performance deteriorates sharply as problem difficulty increases, creating a marked gap compared to the other methods and indicating a lack of capability in handling complex reasoning tasks\.
## Appendix CDetails of CWPODPO\{\}\_\{\\text\{DPO\}\}Objective
To examine whether the benefits of CWPO\-style SC weighting can generalize beyond the original CWPO setting, we incorporate the same SC\-based dynamic weighting strategy into the DPO objective\. The resulting objective, denoted as CWPODPO\{\}\_\{\\text\{DPO\}\}, is formulated as follows:
ℒCWPODPO\(πθ;πref\)=−𝔼\(x,yw,yl\)∼𝒟\[logσ\(\\displaystyle\\mathcal\{L\}\_\{\\text\{CWPO$\{\}\_\{\\text\{DPO\}\}$\}\}\(\\pi\_\{\\theta\};\\pi\_\{\\text\{ref\}\}\)=\-\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\Bigl\[\\log\\sigma\\Bigl\(\(8\)βwlogπθ\(yw\|x\)πref\(yw\|x\)−βllogπθ\(yl\|x\)πref\(yl\|x\)\)\]\\displaystyle\\beta\_\{w\}\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\}\{\\pi\_\{\\text\{ref\}\}\(y\_\{w\}\|x\)\}\-\\beta\_\{l\}\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\{\\pi\_\{\\text\{ref\}\}\(y\_\{l\}\|x\)\}\\Bigr\)\\Bigr\]Here, we use the same definitions ofβw\\beta\_\{w\}andβl\\beta\_\{l\}as in Section[3\.2\.2](https://arxiv.org/html/2607.00862#S3.SS2.SSS2)\.Similar Articles
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
The paper introduces CALIBER, a method for calibrating confidence in reasoning language models by eliciting confidence estimates both before and after reasoning, with supervision targets matched to the information state. It achieves significant reductions in Expected Calibration Error (up to 52.5%) and strong Brier scores and AUROC across multiple benchmarks.
Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning
This paper introduces AdvCluster, an automated framework to identify and categorize reasoning advantages of larger language models over smaller ones across math, physics, chemistry, and programming benchmarks. The study finds that larger models excel at constraint-guided reasoning—identifying and organizing constraints to rule out infeasible paths and verify intermediate steps.
Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models
This paper proposes ASAG, a training-free method that adaptively stops reasoning in large reasoning models based on attention distributions, reducing token usage by ~40% while improving accuracy by 3.2% on benchmarks using DeepSeek-R1-Distill and Qwen3 models.