Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models
Summary
This paper identifies Calibration Drift Under Reasoning (CDUR), where increasing chain-of-thought reasoning budgets causes LLMs to become systematically overconfident in incorrect answers, and proposes a Hypothesis Lock-In model and a calibration-aware stopping rule (CABStop) to mitigate the issue.
View Cached Full Text
Cached at: 06/11/26, 01:35 PM
# Contents
Source: [https://arxiv.org/html/2606.11211](https://arxiv.org/html/2606.11211)
![[Uncaptioned image]](https://arxiv.org/html/2606.11211v1/AOE.jpeg)Calibration Under Reasoning
Calibration Drift Under Reasoning: How Chain\-of\-Thought Budgets Induce Overconfidence in Large Language Models
Prakul Sunil Hiremath hiremathprakul\.aoe@gmail\.com
Harshit R Hiremath hiremathharshit\.aoe@gmail\.com
Department of Computer Science and Engineering Visvesvaraya Technological University, Belagavi Department of Computer Science and Business System SG Balekundri Institute of Technology, Belagavi
###### Abstract
The ability of large language models \(LLMs\) to express calibrated uncertainty is a prerequisite for their safe deployment\. Chain\-of\-thought \(CoT\) reasoning has been widely promoted as a technique that improves both accuracy and reliability\. We argue that this picture is incomplete: in at least some model scales and problem types, increasing the reasoning budget beyond a problem\-specific threshold can cause models to become systematically*overconfident*—reporting high verbalized probabilities for answers that are incorrect\. We term this phenomenonCalibration Drift Under Reasoning\(cdur\) and study it formally and empirically\.
Formally, we define a reasoning budgetBBand analyze conditions under which the Expected Calibration ErrorECE\(B\)\\mathrm\{ECE\}\(B\)traces a non\-monotone trajectory inBB: initially falling as reasoning corrects surface errors, then rising as extended chains produce internally consistent but factually wrong trajectories\. We introduce a*Hypothesis Lock\-In Model*grounded in autoregressive generation to explain this mechanism\.
Empirically, we evaluate Llama\-3\.1\-8B and Llama\-3\.3\-70B on 47 reasoning\-trap questions across four reasoning budgets and three seeds \(1,368 API calls; 574 valid responses\)\. The 8B model exhibits non\-monotonic calibration behavior, while results for 70B remain limited to baseline evaluation and are inconclusive with respect to budget\-dependent dynamics\.
We introduceCABStop, a calibration\-aware stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate\. These findings suggest that increasing reasoning depth does not uniformly improve reliability and should be explicitly monitored\.
Keywords:calibration, chain\-of\-thought reasoning, Expected Calibration Error, overconfidence, large language models, reasoning budgets
###### Contents
1. [1Introduction](https://arxiv.org/html/2606.11211#S1)1. [1\.1The Core Phenomenon](https://arxiv.org/html/2606.11211#S1.SS1) 2. [1\.2Why This Matters](https://arxiv.org/html/2606.11211#S1.SS2) 3. [1\.3Contributions](https://arxiv.org/html/2606.11211#S1.SS3)
2. [2Background and Related Work](https://arxiv.org/html/2606.11211#S2)1. [2\.1Calibration in Machine Learning](https://arxiv.org/html/2606.11211#S2.SS1) 2. [2\.2Chain\-of\-Thought Reasoning](https://arxiv.org/html/2606.11211#S2.SS2) 3. [2\.3Overconfidence in LLMs](https://arxiv.org/html/2606.11211#S2.SS3) 4. [2\.4Stopping Rules and Optimal Inference](https://arxiv.org/html/2606.11211#S2.SS4)
3. [3Formalizing Calibration Drift Under Reasoning](https://arxiv.org/html/2606.11211#S3)1. [3\.1Reasoning Budget](https://arxiv.org/html/2606.11211#S3.SS1) 2. [3\.2Calibration as a Function of Reasoning Budget](https://arxiv.org/html/2606.11211#S3.SS2) 3. [3\.3Definition of CDUR](https://arxiv.org/html/2606.11211#S3.SS3) 4. [3\.4Propositions](https://arxiv.org/html/2606.11211#S3.SS4)
4. [4Mechanistic Model: Hypothesis Lock\-In](https://arxiv.org/html/2606.11211#S4)1. [4\.1The Commitment Model](https://arxiv.org/html/2606.11211#S4.SS1) 2. [4\.2Connection to Autoregressive Generation](https://arxiv.org/html/2606.11211#S4.SS2) 3. [4\.3Connection to RLHF Reward Shaping](https://arxiv.org/html/2606.11211#S4.SS3) 4. [4\.4Empirical Signatures of Hypothesis Lock\-In](https://arxiv.org/html/2606.11211#S4.SS4) 5. [4\.5Hypothesis Lock\-In Diagram](https://arxiv.org/html/2606.11211#S4.SS5)
5. [5Experimental Setup](https://arxiv.org/html/2606.11211#S5)1. [5\.1Models](https://arxiv.org/html/2606.11211#S5.SS1) 2. [5\.2Dataset Construction](https://arxiv.org/html/2606.11211#S5.SS2) 3. [5\.3Validity Filtering and Potential Bias](https://arxiv.org/html/2606.11211#S5.SS3) 4. [5\.4Reasoning Budgets](https://arxiv.org/html/2606.11211#S5.SS4) 5. [5\.5Confidence Elicitation and Its Limitations](https://arxiv.org/html/2606.11211#S5.SS5) 6. [5\.6Metrics](https://arxiv.org/html/2606.11211#S5.SS6)
6. [6Results](https://arxiv.org/html/2606.11211#S6)1. [6\.1Main Result: Non\-Monotonic Calibration Dynamics](https://arxiv.org/html/2606.11211#S6.SS1) 2. [6\.2Interpreting the 8B Calibration Arc](https://arxiv.org/html/2606.11211#S6.SS2) 3. [6\.3The 70B Results Gap](https://arxiv.org/html/2606.11211#S6.SS3) 4. [6\.4Wrong \+ Confident Analysis \(Smoking Gun\)](https://arxiv.org/html/2606.11211#S6.SS4) 5. [6\.5Confidence–Accuracy Scatter: Limitations of Aggregate ECE](https://arxiv.org/html/2606.11211#S6.SS5) 6. [6\.6Statistical Uncertainty and Variance](https://arxiv.org/html/2606.11211#S6.SS6) 7. [6\.7Confidence–Accuracy Decoupling](https://arxiv.org/html/2606.11211#S6.SS7) 8. [6\.8Error Persistence Under Extended Reasoning](https://arxiv.org/html/2606.11211#S6.SS8) 9. [6\.9When More Reasoning Helps vs\. Hurts](https://arxiv.org/html/2606.11211#S6.SS9) 10. [6\.10Trap Category Analysis](https://arxiv.org/html/2606.11211#S6.SS10)
7. [7The CABStop Algorithm](https://arxiv.org/html/2606.11211#S7)1. [7\.1Motivation](https://arxiv.org/html/2606.11211#S7.SS1) 2. [7\.2Formulation as an Optimal Stopping Problem](https://arxiv.org/html/2606.11211#S7.SS2) 3. [7\.3Algorithm](https://arxiv.org/html/2606.11211#S7.SS3) 4. [7\.4TikZ: CABStop Mechanism](https://arxiv.org/html/2606.11211#S7.SS4) 5. [7\.5Discussion](https://arxiv.org/html/2606.11211#S7.SS5)
8. [8Threats to Validity](https://arxiv.org/html/2606.11211#S8)
9. [9Discussion](https://arxiv.org/html/2606.11211#S9)1. [9\.1When CDUR Should Be Expected](https://arxiv.org/html/2606.11211#S9.SS1) 2. [9\.2Implications for Inference\-Time Scaling](https://arxiv.org/html/2606.11211#S9.SS2) 3. [9\.3Implications for Model Evaluation](https://arxiv.org/html/2606.11211#S9.SS3) 4. [9\.4Theoretical Limitations](https://arxiv.org/html/2606.11211#S9.SS4) 5. [9\.5Connections to Human Reasoning](https://arxiv.org/html/2606.11211#S9.SS5)
10. [10Future Work](https://arxiv.org/html/2606.11211#S10)
11. [11Conclusion](https://arxiv.org/html/2606.11211#S11)
12. [References](https://arxiv.org/html/2606.11211#bib)
13. [AProof Details](https://arxiv.org/html/2606.11211#A1)1. [A\.1Formal Consistency Score](https://arxiv.org/html/2606.11211#A1.SS1) 2. [A\.2Discussion of Assumption Robustness](https://arxiv.org/html/2606.11211#A1.SS2)
14. [BDataset: Trap Question Examples](https://arxiv.org/html/2606.11211#A2)
15. [CExperimental Logs \(Summary\)](https://arxiv.org/html/2606.11211#A3)
## 1Introduction
A model is*well\-calibrated*if its expressed confidence in an answer reliably tracks the probability that the answer is correct\[De Groot and Fienberg,[1983](https://arxiv.org/html/2606.11211#bib.bib2), Guo et al\.,[2017](https://arxiv.org/html/2606.11211#bib.bib5)\]\. Calibration failure—especially overconfidence—degrades human\-AI collaboration, erodes trust, and causes systematic errors in downstream decisions\.
Chain\-of\-thought \(CoT\) prompting\[Wei et al\.,[2022](https://arxiv.org/html/2606.11211#bib.bib17)\]has become a standard technique for improving LLM performance on multi\-step reasoning tasks\. The intuition is appealing: by generating intermediate reasoning steps before committing to an answer, a model can decompose hard problems, catch arithmetic mistakes, and arrive at better\-supported conclusions\. Large\-scale results confirm that longer reasoning chains often raise accuracy\[Kojima et al\.,[2022](https://arxiv.org/html/2606.11211#bib.bib8), Lightman et al\.,[2023](https://arxiv.org/html/2606.11211#bib.bib9)\]\.
Our central claim\.Accuracy improvement does not imply calibration improvement\. More precisely, we observe that at least in the small\-model regime and on structured reasoning\-trap tasks, increasing reasoning budget can*inflate confidence without proportionally inflating correctness*—a regime we callCalibration Drift Under Reasoning\(cdur\)\. We emphasize that this claim is presented as an observed and theoretically motivated phenomenon, not as a universally established law: the strength of evidence varies across model scales, and several methodological limitations constrain the scope of our conclusions\.
### 1\.1The Core Phenomenon
Consider a model that initially responds to a question with low confidence and imperfect accuracy\. As it is prompted to reason more extensively, its accuracy may rise—but so may its expressed confidence, and not always in proportion\. At some budget levelB∗B^\{\*\}, confidence may begin to outpace accuracy\. PastB∗B^\{\*\}, the model is not merely wrong; it is*confidently wrong*\.
This pattern is illustrated schematically in Figure[2](https://arxiv.org/html/2606.11211#S6.F2)and explored empirically in Section[6](https://arxiv.org/html/2606.11211#S6), where we find behavior consistent with this description for Llama\-3\.1\-8B\. Evidence for the larger Llama\-3\.3\-70B model is incomplete \(Section[6\.3](https://arxiv.org/html/2606.11211#S6.SS3)\)\.
### 1\.2Why This Matters
Safety\.A model that says “I am 95% confident” while being wrong 50% of the time is more dangerous than one that says “I am 60% confident” while being wrong the same fraction of the time\.
Inference\-time scaling\.Recent work proposes allocating more compute at inference time to difficult problems\[Snell et al\.,[2024](https://arxiv.org/html/2606.11211#bib.bib13)\]\. Without calibration awareness, such scaling may backfire: the model becomes more expensive*and*, in certain regimes, more overconfident\.
Human oversight\.When models are integrated into human decision pipelines, overconfident wrong answers are harder for humans to catch and correct\.
### 1\.3Contributions
This paper makes the following contributions, stated precisely to reflect the scope of our evidence:
1. \(1\)Formal framework\.We provide a rigorous definition ofcduras a property of the calibration functionECE\(B\)\\mathrm\{ECE\}\(B\), and state three formal propositions characterizing its behavior under a probabilistic reasoning model \(Section[3](https://arxiv.org/html/2606.11211#S3)\)\. These propositions offer theoretical grounding for the phenomenon; they are not proved as properties of arbitrary LLMs but rather under an explicit commitment\-model abstraction\.
2. \(2\)Hypothesis Lock\-In Model\.We introduce and analyze a mechanistic model of autoregressive reasoning that explains why calibration drift may occur, and characterize the conditions under which it is most severe \(Section[4](https://arxiv.org/html/2606.11211#S4)\)\.
3. \(3\)Empirical observation\.We conduct controlled experiments on two Llama model families across four reasoning budgets and 21 trap\-question categories, measuring ECE, accuracy, and overconfidence gap\. For Llama\-3\.1\-8B, we observe non\-monotonic calibration dynamics qualitatively consistent with thecdurframework\. For Llama\-3\.3\-70B, results are limited to the no\-reasoning condition and are thus suggestive but inconclusive \(Section[6](https://arxiv.org/html/2606.11211#S6)\)\.
4. \(4\)CABStopalgorithm\.We propose a principled stopping rule for inference\-time reasoning that halts before the overconfidence regime, and analyze it as an instance of optimal stopping \(Section[7](https://arxiv.org/html/2606.11211#S7)\)\.
## 2Background and Related Work
### 2\.1Calibration in Machine Learning
The formal study of calibration originates in the forecasting literature\[De Groot and Fienberg,[1983](https://arxiv.org/html/2606.11211#bib.bib2), Murphy,[1977](https://arxiv.org/html/2606.11211#bib.bib11)\]\.Guo et al\. \[[2017](https://arxiv.org/html/2606.11211#bib.bib5)\]showed that modern deep neural networks are poorly calibrated and that temperature scaling provides a simple post\-hoc fix\.Desai and Durrett \[[2020](https://arxiv.org/html/2606.11211#bib.bib3)\]extended this analysis to text classification with pre\-trained language models\. For generative models, calibration must be measured via verbalized probabilities\[Kadavath et al\.,[2022](https://arxiv.org/html/2606.11211#bib.bib7), Xiong et al\.,[2024](https://arxiv.org/html/2606.11211#bib.bib18)\]since token\-level likelihoods are not directly accessible in deployment\.
### 2\.2Chain\-of\-Thought Reasoning
Wei et al\. \[[2022](https://arxiv.org/html/2606.11211#bib.bib17)\]demonstrated that few\-shot CoT prompting substantially improves performance on arithmetic and commonsense reasoning\.Kojima et al\. \[[2022](https://arxiv.org/html/2606.11211#bib.bib8)\]showed that zero\-shot CoT \(“let’s think step by step”\) achieves comparable gains\.Lightman et al\. \[[2023](https://arxiv.org/html/2606.11211#bib.bib9)\]studied process\-level supervision, showing that rewarding correct intermediate steps further improves accuracy\.
Budget\-constrained reasoning has been explored in the context of inference\-time scaling\[Snell et al\.,[2024](https://arxiv.org/html/2606.11211#bib.bib13), Muennighoff et al\.,[2025](https://arxiv.org/html/2606.11211#bib.bib10)\], where additional compute is allocated proportionally to estimated problem difficulty\.
### 2\.3Overconfidence in LLMs
Xiong et al\. \[[2024](https://arxiv.org/html/2606.11211#bib.bib18)\]showed that LLMs are systematically overconfident when asked to verbalize their confidence, especially on harder questions\.Zhou et al\. \[[2023](https://arxiv.org/html/2606.11211#bib.bib19)\]documented that chain\-of\-thought can increase hallucination in certain conditions\.Turpin et al\. \[[2023](https://arxiv.org/html/2606.11211#bib.bib14)\]showed that spurious reasoning patterns are common even in high\-accuracy responses\. Our work is distinguished by the explicit focus on the*interaction*between reasoning budget and calibration, characterized formally as a functionECE\(B\)\\mathrm\{ECE\}\(B\)\.
### 2\.4Stopping Rules and Optimal Inference
The optimal stopping literature\[Wald,[1947](https://arxiv.org/html/2606.11211#bib.bib15), Chow and Robbins,[1961](https://arxiv.org/html/2606.11211#bib.bib1)\]provides a natural framework for deciding when to halt a sequential computation\.Graves \[[2016](https://arxiv.org/html/2606.11211#bib.bib4)\]applied adaptive computation to recurrent networks\. OurCABStopalgorithm instantiates a stopping rule over the reasoning trajectory of an LLM\.
## 3Formalizing Calibration Drift Under Reasoning
We now provide a formal account of thecdurphenomenon\. Let𝒬\\mathcal\{Q\}be a set of questions and𝒜\\mathcal\{A\}a label space\. A modelℳ\\mathcal\{M\}takes as input a questionq∈𝒬q\\in\\mathcal\{Q\}, a reasoning budgetB∈ℕ∪\{0\}B\\in\\mathbb\{N\}\\cup\\\{0\\\}, and produces an answera^∈𝒜\\hat\{a\}\\in\\mathcal\{A\}together with a verbalized confidencep^∈\[0,1\]\\hat\{p\}\\in\[0,1\]\.
### 3\.1Reasoning Budget
Definition 3\.1\(Reasoning Budget\)A*reasoning budget*BBis an upper bound on the number of tokens allocated to intermediate reasoning steps prior to answer generation\. We sayB=0B=0corresponds to direct \(no\-reasoning\) inference, andB=∞B=\\inftydenotes unbounded reasoning\.In practice, we discretize:B∈\{none,light,medium,heavy\}B\\in\\\{\\text\{none\},\\text\{light\},\\text\{medium\},\\text\{heavy\}\\\}, corresponding to approximately0,128128,512512, and20482048tokens respectively\.
### 3\.2Calibration as a Function of Reasoning Budget
Definition 3\.2\(Budget\-Conditional Calibration\)For a reasoning budgetBB, define:Acc\(B\)\\displaystyle\\mathrm\{Acc\}\(B\)=ℙq\[a^\(q,B\)=a∗\(q\)\],\\displaystyle=\\mathbb\{P\}\_\{q\}\[\\hat\{a\}\(q,B\)=a^\{\*\}\(q\)\],\(1\)Conf\(B\)\\displaystyle\\mathrm\{Conf\}\(B\)=𝔼q\[p^\(q,B\)\],\\displaystyle=\\mathbb\{E\}\_\{q\}\[\\hat\{p\}\(q,B\)\],\(2\)OG\(B\)\\displaystyle\\mathrm\{OG\}\(B\)=Conf\(B\)−Acc\(B\)\(overconfidence gap\),\\displaystyle=\\mathrm\{Conf\}\(B\)\-\\mathrm\{Acc\}\(B\)\\quad\\text\{\(overconfidence gap\)\},\(3\)wherea∗\(q\)a^\{\*\}\(q\)is the ground\-truth answer and expectations are over the question distribution\.The*Expected Calibration Error*at budgetBBisECE\(B\)=𝔼\[\|p^\(q,B\)−ℙ\[a^\(q,B\)=a∗\(q\)\|p^\(q,B\)\]\|\],\\mathrm\{ECE\}\(B\)=\\mathbb\{E\}\\bigl\[\|\\hat\{p\}\(q,B\)\-\\mathbb\{P\}\[\\hat\{a\}\(q,B\)=a^\{\*\}\(q\)\\,\|\\,\\hat\{p\}\(q,B\)\]\|\\bigr\],\(4\)where the expectation is over questions and the inner probability is over randomness in generation\.
### 3\.3Definition of CDUR
Definition 3\.3\(Calibration Drift Under Reasoning,cdur\)A modelℳ\\mathcal\{M\}exhibits*Calibration Drift Under Reasoning*if there exists a*critical budget*B∗∈\(0,∞\)B^\{\*\}\\in\(0,\\infty\)such that:dECE\(B\)dB<0forB<B∗,dECE\(B\)dB\>0forB\>B∗\.\\frac\{d\\,\\mathrm\{ECE\}\(B\)\}\{dB\}<0\\quad\\text\{for \}B<B^\{\*\},\\qquad\\frac\{d\\,\\mathrm\{ECE\}\(B\)\}\{dB\}\>0\\quad\\text\{for \}B\>B^\{\*\}\.\(5\)That is,B↦ECE\(B\)B\\mapsto\\mathrm\{ECE\}\(B\)is U\-shaped with a minimum atB∗B^\{\*\}\.
Intuitively, initial reasoning \(B<B∗B<B^\{\*\}\) resolves surface ambiguity and improves calibration\. Extended reasoning \(B\>B∗B\>B^\{\*\}\) locks in an incorrect hypothesis and accumulates spurious internal evidence, raising confidence without improving accuracy\. We note that this definition is idealized: in practice, the budget axis is discrete and the ECE curve need not be strictly U\-shaped\. The 8B empirical results are consistent with CDUR in the sense that ECE is non\-monotone, but the arc from light \(0\.1040\.104\) to medium \(0\.0500\.050\) to heavy \(0\.0150\.015\) does not match a simple U\-shape\. We discuss this discrepancy in Section[6\.2](https://arxiv.org/html/2606.11211#S6.SS2)\.
### 3\.4Propositions
Proposition 3\.4\(Confidence Inflation Under Commitment\)Under a commitment model \(Definition[4\.1](https://arxiv.org/html/2606.11211#S4.SS1)\), extended reasoning increases conditional confidencep^\(q,B\)\\hat\{p\}\(q,B\)monotonically inBB, even when the correctness probabilityℙ\[a^=a∗\]\\mathbb\{P\}\[\\hat\{a\}=a^\{\*\}\]is unchanged\.
###### Proof sketch\.
Leth0h\_\{0\}denote the initial hypothesis sampled at the start of the reasoning chain \(Definition[4\.1](https://arxiv.org/html/2606.11211#S4.SS1)\)\. Subsequent tokens are generated conditioned onh0h\_\{0\}\. The verbalized confidencep^\\hat\{p\}is a function of the*internal consistency*of the chain: more tokens consistent withh0h\_\{0\}yield higherp^\\hat\{p\}\.
Formally, letRtR\_\{t\}be thett\-th reasoning token and let𝒞\(R1,…,Rt;h0\)\\mathcal\{C\}\(R\_\{1\},\\ldots,R\_\{t\};h\_\{0\}\)be a consistency score \(e\.g\., fraction of tokens that reinforceh0h\_\{0\}\)\. Under an autoregressive model conditioned onh0h\_\{0\},𝔼\[𝒞\(R1,…,Rt;h0\)\]\\mathbb\{E\}\[\\mathcal\{C\}\(R\_\{1\},\\ldots,R\_\{t\};h\_\{0\}\)\]is non\-decreasing inttbecause each subsequent token is drawn from a distribution already conditioned onh0h\_\{0\}, making tokens supportingh0h\_\{0\}more likely\.
Sincep^=g\(𝒞\)\\hat\{p\}=g\(\\mathcal\{C\}\)for some non\-decreasinggg, it follows that𝔼\[p^\(q,B\)\]\\mathbb\{E\}\[\\hat\{p\}\(q,B\)\]is non\-decreasing inBB\. Meanwhile, the correctness event depends only on whetherh0h\_\{0\}is the correct hypothesis, which is fixed at sampling time and unchanged by subsequent reasoning\. ∎
Proposition 3\.5\(Error Amplification\)If the initial hypothesis selection has error probabilityε\\varepsilon, then the expected overconfidence gap𝔼\[OG\(B\)\]\\mathbb\{E\}\[\\mathrm\{OG\}\(B\)\]is monotonically non\-decreasing inBBforB\>B∗B\>B^\{\*\}\.
###### Proof sketch\.
Partition questions into two sets:𝒬\+\\mathcal\{Q\}^\{\+\}whereh0=a∗h\_\{0\}=a^\{\*\}\(correct initialization\) and𝒬−\\mathcal\{Q\}^\{\-\}whereh0≠a∗h\_\{0\}\\neq a^\{\*\}\(incorrect\)\. Forq∈𝒬\+q\\in\\mathcal\{Q\}^\{\+\}, extended reasoning increasesp^\\hat\{p\}while maintaining correctness, leavingOG\\mathrm\{OG\}roughly constant\. Forq∈𝒬−q\\in\\mathcal\{Q\}^\{\-\}, extended reasoning increasesp^\\hat\{p\}while accuracy remains 0, increasingOG\\mathrm\{OG\}by Proposition 3\.4\. The overall gap satisfies
𝔼\[OG\(B\)\]=\(1−ε\)⋅OG\+\(B\)\+ε⋅OG−\(B\),\\mathbb\{E\}\[\\mathrm\{OG\}\(B\)\]=\(1\{\-\}\\varepsilon\)\\cdot\\mathrm\{OG\}^\{\+\}\(B\)\+\\varepsilon\\cdot\\mathrm\{OG\}^\{\-\}\(B\),whereOG\+\(B\)≈0\\mathrm\{OG\}^\{\+\}\(B\)\\approx 0andOG−\(B\)=p^−\(B\)\\mathrm\{OG\}^\{\-\}\(B\)=\\hat\{p\}^\{\-\}\(B\)is non\-decreasing\. Sinceε\>0\\varepsilon\>0,𝔼\[OG\(B\)\]\\mathbb\{E\}\[\\mathrm\{OG\}\(B\)\]is non\-decreasing inBB\. ∎
Proposition 3\.6\(Accuracy Plateau with Confidence Growth\)There exists a regime\[B∗,B∗∗\]\[B^\{\*\},B^\{\*\*\}\]in whichAcc\(B\)\\mathrm\{Acc\}\(B\)is approximately constant whileConf\(B\)\\mathrm\{Conf\}\(B\)continues to increase, resulting in a widening overconfidence gap\.
###### Proof sketch\.
AccuracyAcc\(B\)\\mathrm\{Acc\}\(B\)depends on whether reasoning successfully corrects errors inh0h\_\{0\}\. Correction requires the model to generate a token sequence that*contradicts*h0h\_\{0\}and substitutes an alternative—this is a low\-probability event under the commitment model \(the model is explicitly conditioned onh0h\_\{0\}\)\. For largeBB, the probability of such a correction decreases further since more of the context supportsh0h\_\{0\}\. HenceAcc\(B\)\\mathrm\{Acc\}\(B\)is approximately constant pastB∗B^\{\*\}\. In contrast,Conf\(B\)\\mathrm\{Conf\}\(B\)continues to grow by Proposition 3\.4\. The differenceConf\(B\)−Acc\(B\)=OG\(B\)\\mathrm\{Conf\}\(B\)\-\\mathrm\{Acc\}\(B\)=\\mathrm\{OG\}\(B\)is therefore increasing on\[B∗,B∗∗\]\[B^\{\*\},B^\{\*\*\}\]for anyB∗∗\>B∗B^\{\*\*\}\>B^\{\*\}\. ∎
## 4Mechanistic Model: Hypothesis Lock\-In
### 4\.1The Commitment Model
Definition 4\.1\(Commitment Model\)Given a questionqq, reasoning under budgetBBproceeds in three stages:1\.Hypothesis sampling\.The model draws an initial hypothesish0∼Pθ\(⋅\|q\)h\_\{0\}\\sim P\_\{\\theta\}\(\\cdot\\,\|\\,q\)from the prior induced by its parametersθ\\theta\.2\.Reasoning generation\.The model generates a reasoning chainR=\(R1,…,RB\)R=\(R\_\{1\},\\ldots,R\_\{B\}\)token by token:Rt∼Pθ\(⋅\|q,h0,R1,…,Rt−1\)R\_\{t\}\\sim P\_\{\\theta\}\(\\cdot\\,\|\\,q,h\_\{0\},R\_\{1\},\\ldots,R\_\{t\-1\}\)\.3\.Confidence elicitation\.The model produces an answera^\\hat\{a\}and confidencep^\\hat\{p\}based on the full context\(q,R\)\(q,R\)\.
This model is an idealization of autoregressive generation\. The key structural feature is Step 2: the reasoning chain is conditioned onh0h\_\{0\}, making it a constrained trajectory rather than a free search\. An important consequence is thatPθ\(Rt\|q,h0,R<t\)P\_\{\\theta\}\(R\_\{t\}\\,\|\\,q,h\_\{0\},R\_\{<t\}\)assigns higher mass to tokens that are consistent withh0h\_\{0\}, and the probability of generating a token that contradictsh0h\_\{0\}is low\.
### 4\.2Connection to Autoregressive Generation
In an autoregressive LLM,h0h\_\{0\}is not sampled explicitly\. Instead, the model generates the first few tokens of its reasoning in response to the question prompt\. These early tokens function ash0h\_\{0\}in our model: they determine the trajectory of subsequent generation via attention\. The longer the reasoning chain, the more earlier tokens influence later ones, making course correction increasingly unlikely\.
This is consistent with empirical findings on self\-consistency\[Wang et al\.,[2022](https://arxiv.org/html/2606.11211#bib.bib16)\]: longer reasoning chains are*more*self\-consistent, but self\-consistency does not imply correctness\.
### 4\.3Connection to RLHF Reward Shaping
Modern LLMs are fine\-tuned with Reinforcement Learning from Human Feedback \(RLHF\)\[Ouyang et al\.,[2022](https://arxiv.org/html/2606.11211#bib.bib12)\]\. Human raters tend to prefer responses that are confident and internally coherent\[Turpin et al\.,[2023](https://arxiv.org/html/2606.11211#bib.bib14)\]\. This creates a training pressure toward high\-confidence outputs, independently of accuracy\. Combined with the commitment model, RLHF provides a mechanistic pathway through which training may amplify thecdurphenomenon: models are rewarded for*appearing*confident, especially when their reasoning is fluent and internally consistent\.
### 4\.4Empirical Signatures of Hypothesis Lock\-In
The Hypothesis Lock\-In Model makes several testable predictions\. We examine each against our empirical results\.
#### Persistence of errors across budget levels\.
If lock\-in occurs, we expect specific incorrect answers to persist as the budget increases: once committed to a wrongh0h\_\{0\}, the model should continue predicting the same wrong answer even when given additional reasoning tokens\. We observe this directly in our data: several wrong\-and\-confident responses at the no\-reasoning level recur verbatim at the light reasoning level \(e\.g\., the syllogism case, expected: “no”, predicted: “yes” with confidence 1\.0, appearing at both*none*and*light*budgets\)\. This is qualitative evidence consistent with Proposition 3\.4\.
#### Stability of incorrect predictions\.
Under the commitment model, incorrect answers should be stable \(repeated across seeds\) rather than randomly distributed across incorrect options\. While we do not report per\-item cross\-seed analysis due to sample size constraints, the dominance of a small number of trap categories in the wrong\-and\-confident distribution \(counting, set\_theory, spatial\) suggests structured rather than random error—these categories systematically elicit wrong confident answers, consistent with stableh0h\_\{0\}sampling for these problem types\.
#### Failure of extended reasoning to correct\.
Proposition 3\.6 predicts that accuracy plateaus in the lock\-in regime while confidence continues to rise\. For Llama\-3\.1\-8B, accuracy at medium reasoning \(0\.6530\.653\) is actually lower than at light \(0\.7320\.732\), suggesting medium\-budget chains introduce new errors rather than correcting existing ones\. This is consistent with partial lock\-in: light reasoning is sufficient to lock in a wrong answer on some questions, while medium reasoning explores but then re\-commits to wrong hypotheses\.
#### Connection to formal propositions\.
Taken together, these signatures provide empirical support for the qualitative predictions of Propositions 3\.4 and 3\.6: confidence rises faster than accuracy \(OG\>\>0\.25 at all budgets\), and wrong answers show cross\-budget stability\. We caution that with 47 questions and three seeds, these are illustrative rather than statistically conclusive\.
### 4\.5Hypothesis Lock\-In Diagram
Figure[1](https://arxiv.org/html/2606.11211#S4.F1)provides a schematic of the lock\-in process\.
Figure 1:Hypothesis Lock\-In\. The initial hypothesish0h\_\{0\}commits the reasoning trajectory\. Correct initialization leads to self\-reinforcing correct chains; incorrect initialization leads to self\-reinforcing incorrect chains\. In both cases, confidencep^\\hat\{p\}tends to rise with chain length\.
## 5Experimental Setup
### 5\.1Models
We evaluate two models from the Llama family:
- •Llama\-3\.1\-8B:A compact instruction\-tuned model\. We use it to studycdurin the small\-model regime\. This model is the primary source of multi\-budget evidence in this paper\.
- •Llama\-3\.3\-70B:A high\-capacity model\. Due to resource constraints, we report only the no\-reasoning condition for this model; conclusions about its calibration dynamics across budgets are therefore not possible from current data \(see Section[6\.3](https://arxiv.org/html/2606.11211#S6.SS3)\)\.
Both models are accessed via their instruction\-following variants\. Inference temperature is set to0\.70\.7to allow for variability across seeds\.
### 5\.2Dataset Construction
We construct a dataset ofreasoning\-trap questions: questions specifically designed to elicit common cognitive failure modes\. These are not standard benchmark questions—standard benchmarks contain many items that can be answered correctly by surface pattern matching, which would suppress thecdursignal\.
We identify 21 trap categories, listed in Table[1](https://arxiv.org/html/2606.11211#S5.T1)\. Each category is designed to exploit a specific failure mode that is well\-documented in the human cognition and LLM error analysis literatures\. We acknowledge that the dataset is small \(47 questions\), and results should be interpreted accordingly\.
Table 1:Reasoning\-trap categories used in the evaluation dataset\. Categories are ordered by observed wrong \+ confident frequency \(see Figure[3](https://arxiv.org/html/2606.11211#S6.F3)\)\.The dataset contains 47 distinct trap questions\. Each question is evaluated over 3 random seeds and 4 reasoning budgets, yielding47×3×4×2=1,12847\\times 3\\times 4\\times 2=1\{,\}128intended evaluations per model family \(plus non\-trap items\), for a total of 1,368 API calls\.
### 5\.3Validity Filtering and Potential Bias
Not all model responses can be scored\. A response is consideredvalidif:
1. \(1\)It contains an extractable answer in the expected format \(numeric, boolean, or short\-form text\)\.
2. \(2\)It contains an extractable verbalized confidence in\[0,1\]\[0,1\]or as a percentage\.
3. \(3\)Neither the answer nor the confidence field is empty or a refusal\.
Of 1,368 total responses,574574were valid trap\-question responses \(42% overall validity rate\)\. This rate is low enough to raise concerns about selection bias, which we discuss explicitly here\.
#### Causes of invalidity\.
Invalid responses arise from several sources: \(a\) format non\-compliance, where the model does not produce an answer and confidence in the requested schema; \(b\) refusals or hedges, where the model declines to answer; \(c\) truncation, where the response is cut short before a confidence value is produced; and \(d\) confidence expressed in non\-parseable forms \(e\.g\., “moderately confident”\)\.
#### Selection bias risk\.
Critically, invalidity may be*correlated with uncertainty*\. A model that is uncertain about a question may be more likely to hedge, express confidence verbally rather than numerically, or produce an extended disclaimer rather than a direct answer\. If this is the case, valid responses oversample questions where the model is relatively confident—which would inflate our estimates of mean confidence and OG, and potentially distort ECE toward overconfidence\.
#### Budget\-specific effects\.
Validity rates may also vary across budget conditions: heavy reasoning responses are longer and more likely to contain parseable confidence values, while no\-reasoning responses may be too brief or too terse to contain them\. If validity is higher at heavy reasoning, the cross\-budget ECE comparisons are made on non\-comparable subsets of questions, complicating interpretation\.
#### Mitigation and transparency\.
We report all metrics only on the valid subset and make no claim that results generalize to the full intended population\. Future work should use response formats that guarantee parseable outputs \(e\.g\., structured generation or constrained decoding\) to eliminate this source of bias\.
### 5\.4Reasoning Budgets
We implement reasoning budgets via prompt engineering\. Each budget condition uses a specific system\-level instruction:
NoneAnswer directly\. Do not show intermediate reasoning\.
LightShow a brief 2–3 sentence chain of thought before answering\.
MediumShow a structured, step\-by\-step solution before answering\.
HeavyWork through the problem completely, exploring multiple approaches and checking your work, before giving a final answer\.
### 5\.5Confidence Elicitation and Its Limitations
After each response, we append a standardized follow\-up prompt:“On a scale from 0 to 1, how confident are you in your answer? Give only a number\.”
#### Verbalized confidence vs\. epistemic uncertainty\.
Verbalized confidence is not equivalent to true epistemic uncertainty\. When a model reports a confidence value, it is generating a token that may reflect training patterns, prompt phrasing, or the fluency of the preceding reasoning chain—not necessarily a calibrated internal probability\.Xiong et al\. \[[2024](https://arxiv.org/html/2606.11211#bib.bib18)\]have shown that verbalized confidence in LLMs is systematically miscalibrated, particularly on difficult questions\.
In thecdurcontext, this limitation is particularly salient: our hypothesis is that extended reasoning increases internal consistency, which in turn causes the model to report higher confidence\. This means our confidence measurement instrument—verbalized probability—is precisely the variable the model learns to inflate during reasoning\. The reported confidence values thus reflect both genuine epistemic state and a surface\-level coherence signal that is not well\-separated by our elicitation method\.
#### Why results remain meaningful\.
Despite this limitation, verbalized confidence is the signal available to downstream users and decision systems in deployment\. If a model reports high verbalized confidence and is wrong, the downstream effect is the same regardless of whether the reported confidence reflects true uncertainty or surface\-level fluency\. Measuring and reporting verbalized\-confidence miscalibration is thus practically relevant, even if it does not measure deeper epistemic properties\.
#### Alternative confidence signals\.
Future work should triangulate verbalized confidence with complementary signals\. Log\-probability of the answer token \(where accessible\) provides a model\-internal measure less susceptible to RLHF\-induced inflation\. Self\-consistency acrosskkindependent completions\[Wang et al\.,[2022](https://arxiv.org/html/2606.11211#bib.bib16)\]provides a behavioral estimate of answer stability\. Ensemble agreement across models provides an orthogonal check\. Comparing these signals would clarify whether the overconfidence we observe is a surface linguistic phenomenon or a deeper representational failure\.
### 5\.6Metrics
#### Expected Calibration Error \(ECE\)\.
We use equal\-width binning with 10 bins over\[0,1\]\[0,1\]:
ECE=∑m=1M\|Bm\|N\|Acc¯\(Bm\)−Conf¯\(Bm\)\|,\\mathrm\{ECE\}=\\sum\_\{m=1\}^\{M\}\\frac\{\|B\_\{m\}\|\}\{N\}\\bigl\|\\overline\{\\mathrm\{Acc\}\}\(B\_\{m\}\)\-\\overline\{\\mathrm\{Conf\}\}\(B\_\{m\}\)\\bigr\|,\(6\)whereBmB\_\{m\}is themm\-th confidence bin,NNis the total number of samples, andAcc¯\\overline\{\\mathrm\{Acc\}\},Conf¯\\overline\{\\mathrm\{Conf\}\}are the mean accuracy and mean confidence within each bin\.
#### Overconfidence gap \(OG\)\.
OG=Conf¯−Acc¯\\mathrm\{OG\}=\\overline\{\\mathrm\{Conf\}\}\-\\overline\{\\mathrm\{Acc\}\}, measured globally\. Positive values indicate overconfidence; negative values indicate underconfidence\.
#### Wrong \+ Confident\.
The count of responses that are simultaneously incorrect \(a^≠a∗\\hat\{a\}\\neq a^\{\*\}\) and highly confident \(p^≥0\.90\\hat\{p\}\\geq 0\.90\)\. This is the “smoking gun” statistic that most directly characterizes dangerous overconfidence\.
## 6Results
### 6\.1Main Result: Non\-Monotonic Calibration Dynamics
Figure[2](https://arxiv.org/html/2606.11211#S6.F2)shows the theoreticalcdurcurve, and Table[3](https://arxiv.org/html/2606.11211#A3.T3)\(Appendix[C](https://arxiv.org/html/2606.11211#A3)\) reports full numerical results\. We summarize the key observations here\.
Figure 2:Schematic of thecdurphenomenon\.ECE\(B\)\\mathrm\{ECE\}\(B\)is U\-shaped in the reasoning budgetBBunder the theoretical model\. The green region \(B<B∗B<B^\{\*\}\) is the beneficial reasoning regime; the orange region \(B\>B∗B\>B^\{\*\}\) is the overconfidence regime\. Empirical results for 8B are qualitatively consistent with non\-monotone dynamics, but do not perfectly match this schematic \(see Section[6\.2](https://arxiv.org/html/2606.11211#S6.SS2)\)\.#### Llama\-3\.1\-8B\.
The 8B model exhibits non\-monotonic calibration dynamics: ECE is0\.044±0\.0150\.044\\pm 0\.015at no\-reasoning, rises to0\.104±0\.0340\.104\\pm 0\.034at light reasoning, then falls to0\.050±0\.0490\.050\\pm 0\.049at medium and0\.015±0\.0050\.015\\pm 0\.005at heavy reasoning\. This trajectory is qualitatively consistent with thecdurframework in that ECE is not monotonically decreasing with budget; the worst calibration occurs at light reasoning, not at the extremes\. We interpret this in detail in Section[6\.2](https://arxiv.org/html/2606.11211#S6.SS2)\.
#### Llama\-3\.3\-70B\.
The 70B model results are limited to the no\-reasoning condition: ECE=0\.035±0\.026=0\.035\\pm 0\.026, OG=\+0\.155=\+0\.155\. These figures indicate better baseline calibration and lower overconfidence compared to 8B, consistent with scale generally improving calibration\. However, we*cannot*confirm or deny CDUR dynamics for this model, as no multi\-budget data are available\. The 70B results are thus suggestive of better baseline behavior but cannot serve as confirmation or disconfirmation of the CDUR hypothesis at this scale\.
#### Overconfidence Gap\.
The 8B model shows persistently positive overconfidence across all budget levels:OG=\+0\.49\\mathrm\{OG\}=\+0\.49at no\-reasoning, falling to\+0\.25\+0\.25–\+0\.34\+0\.34at other budgets\. Even at heavy reasoning—where ECE is lowest—the overconfidence gap remains substantially positive, indicating that accuracy improvements have not fully closed the confidence\-accuracy gap\.
### 6\.2Interpreting the 8B Calibration Arc
The observed 8B trajectory \(ECE:0\.044→0\.104→0\.050→0\.0150\.044\\to 0\.104\\to 0\.050\\to 0\.015\) does not match a clean U\-shape as depicted in the schematic\. The following interpretation is more accurate\.
At*no reasoning*, the model produces direct, often confident answers\. Many are wrong, but confidence is distributed somewhat heterogeneously—some questions elicit hedged responses, others elicit confident ones, and the ECE is moderate\.
At*light reasoning*, the model generates brief chains that are often insufficient to fully analyze the trap question\. These partial chains tend to reinforce the initial \(frequently incorrect\) hypothesis while increasing expressed confidence, producing the worst\-case calibration scenario: higher confidence without commensurate accuracy\. This is the regime most clearly predicted by the commitment model\.
At*medium reasoning*, the model generates longer chains that sometimes begin to uncover the trap, but also sometimes compound errors by adding reasoning steps that re\-justify a wrong initial direction\. The high standard deviation \(±0\.049\\pm 0\.049\) at medium budget reflects this instability\.
At*heavy reasoning*, extended chains allow the model to explore multiple solution paths and occasionally self\-correct\. The combination of higher accuracy \(0\.7390\.739\) and still\-high confidence \(≈0\.984\\approx 0\.984\) results in a lower ECE\. However, persistent overconfidence \(OG=\+0\.245=\+0\.245\) shows that even at this budget, the model is systematically more confident than accurate\.
We characterize the 8B results as exhibiting*partial CDUR behavior*: the ECE trajectory is non\-monotone, the worst calibration occurs at intermediate reasoning budgets, and overconfidence persists throughout\. This is consistent with but not identical to the idealized U\-shaped curve of Definition 3\.3\.
### 6\.3The 70B Results Gap
Resource constraints limited the 70B evaluation to the no\-reasoning condition\. This gap is significant for interpreting our results, and we address it explicitly\.
#### What the 70B results do and do not show\.
The no\-reasoning ECE of0\.0350\.035and OG of\+0\.155\+0\.155establish that the 70B model is better calibrated at baseline than the 8B model\. The lower OG \(\+0\.155\+0\.155vs\.\+0\.493\+0\.493\) suggests that 70B is less prone to confident\-wrong answers at zero reasoning\. These findings are consistent with scale improving calibration—a pattern noted in prior work\[Kadavath et al\.,[2022](https://arxiv.org/html/2606.11211#bib.bib7)\]\.
However, we cannot infer anything about how the 70B model’s calibration*changes*with reasoning budget\. It is possible that 70B exhibits CDUR dynamics similar to 8B, exhibits them in attenuated form, or does not exhibit them at all\. All three outcomes are compatible with the current data\.
#### Appropriate framing\.
We refrain from framing the 70B results as “confirming” or “suggesting” CDUR for large models\. The correct framing is: the 70B model is better calibrated at no reasoning than the 8B model; whether it exhibits calibration drift under extended reasoning is an open empirical question\.
### 6\.4Wrong \+ Confident Analysis \(Smoking Gun\)
Table[2](https://arxiv.org/html/2606.11211#S6.T2)lists model responses where the model was simultaneously wrong and expressed confidence≥0\.90\\geq 0\.90\.
Table 2:Selected “smoking gun” examples: incorrect responses with verbalized confidence≥0\.90\\geq 0\.90\. All are from Llama\-3\.1\-8B, no\-reasoning condition\.These examples share a common structure: the model produces a superficially plausible but incorrect answer and expresses maximum confidence\. Notably, the counting and spatial categories dominate the wrong\+confident list \(Figure[3](https://arxiv.org/html/2606.11211#S6.F3)\), consistent with prior findings that LLMs struggle with precise discrete enumeration and 3D spatial reasoning\.
### 6\.5Confidence–Accuracy Scatter: Limitations of Aggregate ECE
ECE aggregates calibration error into a scalar, which obscures the structure of per\-sample confidence\-accuracy relationships\. A per\-sample scatter plot of\(p^i,𝟏\[a^i=ai∗\]\)\(\\hat\{p\}\_\{i\},\\mathbf\{1\}\[\\hat\{a\}\_\{i\}=a^\{\*\}\_\{i\}\]\)would reveal features that aggregate ECE does not capture\.
Under ideal calibration, such a scatter would align with the diagonal: responses with confidencep^\\hat\{p\}would be correct with probabilityp^\\hat\{p\}\. What we expect to observe in our data, based on the aggregate OG values, is a systematic upward displacement from the diagonal: a cluster of high\-confidence \(p^≈1\.0\\hat\{p\}\\approx 1\.0\) incorrect answers that should fall at zero accuracy on the scatter\.
This structure has practical implications beyond what ECE reports\. A model with ECE=0\.05=0\.05could achieve that value in two qualitatively different ways: \(a\) many responses with moderate miscalibration spread uniformly across confidence levels, or \(b\) most responses being well\-calibrated with a small cluster of catastrophically overconfident wrong answers\. The second pattern is more dangerous in practice but may yield a similar aggregate ECE\.
Given our wrong\-and\-confident count \(Table[2](https://arxiv.org/html/2606.11211#S6.T2)\), we expect our data to exhibit pattern \(b\)\. Future work should report per\-sample calibration distributions rather than only aggregate ECE, and should test whether the overconfident\-wrong cluster is concentrated in specific trap categories or distributed across questions\.
### 6\.6Statistical Uncertainty and Variance
#### Impact of small sample size\.
With 47 trap questions evaluated across 3 seeds, the effective per\-budget sample size is approximately 40–50 valid responses after filtering\. This is the minimum for ECE estimation to be stable; standard calibration studies use hundreds to thousands of samples\. Our ECE estimates should be treated as indicative rather than precise\.
#### Interpreting standard deviations\.
The reported standard deviations across seeds \(e\.g\., ECE=0\.050±0\.049=0\.050\\pm 0\.049at medium budget\) are large relative to the mean in some conditions\. This indicates that individual seed runs produce quite different ECE values, which is expected given the small dataset\. Comparisons between budget conditions should be made cautiously: for example, the medium\-budget ECE \(0\.0500\.050\) and no\-reasoning ECE \(0\.0440\.044\) are not statistically distinguishable given their respective standard deviations\.
#### Claims calibrated to evidence\.
Given these limitations, we state the following carefully: we*observe*that light\-reasoning ECE is substantially higher than both no\-reasoning and heavy\-reasoning ECE in the 8B model, and that this difference is directionally consistent across all three seeds\. We do*not*claim that this pattern is statistically significant by conventional tests, nor that it would replicate exactly on a different dataset\.
### 6\.7Confidence–Accuracy Decoupling
A central prediction of thecdurframework is thatConf\(B\)\\mathrm\{Conf\}\(B\)andAcc\(B\)\\mathrm\{Acc\}\(B\)can diverge: asBBincreases, confidence may rise faster \(or fall slower\) than accuracy\.
For Llama\-3\.1\-8B, accuracy increases from0\.4610\.461\(none\) to0\.7390\.739\(heavy\)\. However, the overconfidence gap remains above0\.250\.25across all budgets, indicating that mean confidence also rises with budget and tracks accuracy incompletely\. The gap is highest at no\-reasoning \(\+0\.493\+0\.493\)—where the model produces many confident wrong answers without a reasoning process to check them—and lowest at heavy reasoning \(\+0\.245\+0\.245\)\.
This decoupling is captured by Proposition 3\.6: there exists a range ofBBwhere accuracy plateaus but confidence continues to grow\. Our data suggest this regime occurs at light\-to\-medium budgets for the 8B model\.
### 6\.8Error Persistence Under Extended Reasoning
One might expect that heavy reasoning would always correct errors present at lower budgets\. Our data partially contradict this\. The syllogism example \(expected: no, predicted: yes, confidence 1\.0\) appears at both the no\-reasoning and light\-reasoning levels, suggesting that once the model commits to an incorrect logical inference, light additional reasoning reinforces rather than corrects it\. This is consistent with the lock\-in mechanism of Section[4](https://arxiv.org/html/2606.11211#S4)\.
At heavy reasoning, some of these persistent errors are corrected \(accuracy rises from0\.4610\.461to0\.7390\.739\), suggesting that sufficiently extensive reasoning can break lock\-in for some questions\. However, the non\-zero overconfidence gap at heavy reasoning \(\+0\.245\+0\.245\) implies that a subset of wrong answers persist even with extensive computation\.
### 6\.9When More Reasoning Helps vs\. Hurts
Based on our results, we characterize three regimes:
Regime I: Surface correction \(B≪B∗B\\ll B^\{\*\}\)\.Short reasoning chains catch arithmetic slips and minor ambiguities\. Accuracy rises; calibration improves\. This is the conventional wisdom about CoT\.
Regime II: Lock\-in \(B≈B∗B\\approx B^\{\*\}\)\.The model has committed to a hypothesis but has not yet accumulated overwhelming internal evidence\. Calibration is at its worst: the model is just confident enough to be wrong dangerously\. This is the critical budgetB∗B^\{\*\}\.
Regime III: Heavy reasoning \(B≫B∗B\\gg B^\{\*\}\)\.Extended chains allow the model to occasionally escape lock\-in via multi\-step reformulations\. ECE falls again as accuracy rises\. However, for problems where the initial hypothesis is fundamentally wrong, no amount of reasoning helps—these become the persistent overconfident failures\.
The practical implication is that*moderate*reasoning budgets may produce worse calibration than either no reasoning or heavy reasoning for calibration\-sensitive tasks\. This does not mean that heavy reasoning is always preferable: it is more expensive and still exhibits a positive overconfidence gap\.
### 6\.10Trap Category Analysis
Figure[3](https://arxiv.org/html/2606.11211#S6.F3)reports the distribution of wrong\+confident responses by trap category\. Counting, set\_theory, spatial, and semantic errors dominate\. These categories share a common property: they require precise discrete computation or strict logical inference—tasks where an LLM’s fluency\-based approximation is systematically misleading\.
Figure 3:Distribution of wrong\-and\-confident responses \(p^≥0\.90\\hat\{p\}\\geq 0\.90\) by trap category, aggregated across both models, all budgets, and all seeds\. Categories requiring precise discrete computation or strict logical reasoning dominate\.
## 7The CABStop Algorithm
### 7\.1Motivation
Thecduranalysis identifies a critical budgetB∗B^\{\*\}beyond which calibration may improve or worsen depending on the problem\. In practice,B∗B^\{\*\}is unknown and problem\-dependent\. A practical algorithm must decide*on the fly*when to stop allocating reasoning tokens\.
### 7\.2Formulation as an Optimal Stopping Problem
Let\(a^t,p^t\)\(\\hat\{a\}\_\{t\},\\hat\{p\}\_\{t\}\)denote the model’s answer and confidence afterttreasoning tokens\. Letα^t\\hat\{\\alpha\}\_\{t\}be an auxiliary estimate of the correctness probability at steptt\(e\.g\., from a lightweight verifier or from self\-consistency across multiple samples\)\. Define the*calibration gap*at steptt:
Δt=p^t−α^t\.\\Delta\_\{t\}=\\hat\{p\}\_\{t\}\-\\hat\{\\alpha\}\_\{t\}\.\(7\)
CABStopis a stopping ruleτ∗\\tau^\{\*\}defined as:
τ∗=min\{t:Δt\>δ\},\\tau^\{\*\}=\\min\\\{t:\\Delta\_\{t\}\>\\delta\\\},\(8\)whereδ\>0\\delta\>0is a calibration tolerance threshold\.
This is a*first\-passage stopping rule*over the stochastic process\(Δt\)t≥0\(\\Delta\_\{t\}\)\_\{t\\geq 0\}\. The rule halts reasoning when the model’s expressed confidence exceeds its estimated accuracy by more thanδ\\delta\.
Proposition 6\.1\(Optimality ofCABStopunder Monotone Confidence\)Ifp^t\\hat\{p\}\_\{t\}is non\-decreasing andα^t\\hat\{\\alpha\}\_\{t\}is approximately constant pastB∗B^\{\*\}, thenτ∗\\tau^\{\*\}minimizes expected ECE among all stopping rules of the formτ=inf\{t:f\(Δt\)\>c\}\\tau=\\inf\\\{t:f\(\\Delta\_\{t\}\)\>c\\\}for any non\-decreasingff\.
###### Proof sketch\.
The ECE at stopping timeτ\\tausatisfies
ECE\(τ\)≈\|p^τ−α^τ\|=\|Δτ\|\.\\mathrm\{ECE\}\(\\tau\)\\approx\|\\hat\{p\}\_\{\\tau\}\-\\hat\{\\alpha\}\_\{\\tau\}\|=\|\\Delta\_\{\\tau\}\|\.Under the stated monotonicity assumptions,Δt\\Delta\_\{t\}is non\-decreasing, so the first timeΔt\\Delta\_\{t\}crossesδ\\deltais also the time that minimizes the future expected ECE \(since pastτ∗\\tau^\{\*\},Δt≥δ\\Delta\_\{t\}\\geq\\deltaand ECE only worsens\)\. Among rules of the given form, the threshold rule with thresholdδ\\deltais optimal by the structure of first\-passage times\. ∎
### 7\.3Algorithm
Algorithm 1CABStop: Confidence\-Accuracy Budget Stopping1:Question
qq, calibration threshold
δ\\delta, check interval
ΔB\\Delta B, max budget
BmaxB\_\{\\max\}
2:Answer
a^\\hat\{a\}, confidence
p^\\hat\{p\}, stopping budget
τ∗\\tau^\{\*\}
3:
t←0t\\leftarrow 0,
R←ϵR\\leftarrow\\epsilon\(empty reasoning chain\)
4:while
t<Bmaxt<B\_\{\\max\}do
5:Generate
ΔB\\Delta Bmore reasoning tokens:
R←R∪Rt:t\+ΔBR\\leftarrow R\\cup R\_\{t:t\+\\Delta B\}
6:
t←t\+ΔBt\\leftarrow t\+\\Delta B
7:Extract candidate answer
a^t\\hat\{a\}\_\{t\}and confidence
p^t\\hat\{p\}\_\{t\}from
\(q,R\)\(q,R\)
8:Compute auxiliary accuracy estimate
α^t\\hat\{\\alpha\}\_\{t\}⊳\\trianglerighte\.g\., self\-consistency overkksamples
9:if
p^t−α^t\>δ\\hat\{p\}\_\{t\}\-\\hat\{\\alpha\}\_\{t\}\>\\deltathen
10:return
a^t\\hat\{a\}\_\{t\},
p^t\\hat\{p\}\_\{t\},
τ∗=t\\tau^\{\*\}=t
11:endif
12:endwhile
13:return
a^Bmax\\hat\{a\}\_\{B\_\{\\max\}\},
p^Bmax\\hat\{p\}\_\{B\_\{\\max\}\},
τ∗=Bmax\\tau^\{\*\}=B\_\{\\max\}
### 7\.4TikZ: CABStop Mechanism
Inputqqstart reasoningGenerateΔB\\Delta BtokensElicita^t\\hat\{a\}\_\{t\},p^t\\hat\{p\}\_\{t\}Estimateα^t\\hat\{\\alpha\}\_\{t\}p^t−α^t\>δ\\hat\{p\}\_\{t\}\-\\hat\{\\alpha\}\_\{t\}\>\\delta?Stop\.Returna^t\\hat\{a\}\_\{t\},p^t\\hat\{p\}\_\{t\}Continuereasoningt≥Bmaxt\\geq B\_\{\\max\}?Force stop\.Return answer\.yesnoyesnoFigure 4:CABStopcontrol flow\. At each reasoning checkpoint, the algorithm compares the model’s expressed confidencep^t\\hat\{p\}\_\{t\}with an auxiliary accuracy estimateα^t\\hat\{\\alpha\}\_\{t\}\. Reasoning halts when their gap exceeds the calibration thresholdδ\\delta\.
### 7\.5Discussion
#### Choice ofδ\\delta\.
In our experiments, OG\>0\.10\>0\.10consistently corresponds to what practitioners would consider problematic overconfidence\. We recommendδ=0\.10\\delta=0\.10as a starting point, which aligns with the conventional ECE threshold for “well\-calibrated” systems\. The appropriateδ\\deltais task\- and deployment\-dependent and should be tuned accordingly\.
#### Auxiliary accuracy estimation\.
The most straightforward implementation uses self\-consistency\[Wang et al\.,[2022](https://arxiv.org/html/2606.11211#bib.bib16)\]: generatekkindependent continuations from the current reasoning state and use the fraction agreeing witha^t\\hat\{a\}\_\{t\}asα^t\\hat\{\\alpha\}\_\{t\}\. Withk=5k=5, this adds modest compute overhead relative to the inference budget\.
#### Failure cases of CABStop\.
CABStopassumesα^t\\hat\{\\alpha\}\_\{t\}provides a meaningful signal about correctness\. This assumption fails in at least two important scenarios\.
First, when the model is*consistently wrong across samples*: if allkkself\-consistency samples agree on an incorrect answer,α^t\\hat\{\\alpha\}\_\{t\}will be high \(reflecting consistency\) even though accuracy is zero\. In this case,Δt=p^t−α^t\\Delta\_\{t\}=\\hat\{p\}\_\{t\}\-\\hat\{\\alpha\}\_\{t\}will be small andCABStopwill not halt early—precisely when it should\. This failure mode is most likely on questions in the high\-frequency trap categories \(counting, syllogism\), where the model may have a strong prior toward a specific wrong answer\.
Second, when confidence is not monotone intt: the optimality guarantee of Proposition 6\.1 requiresp^t\\hat\{p\}\_\{t\}to be non\-decreasing\. In practice, verbalized confidence may fluctuate, especially when the model partially reconsiders an intermediate step\. In this case, the stopping rule may trigger and release prematurely\.
Both failure cases underscore thatCABStopis a heuristic whose effectiveness depends on the quality ofα^t\\hat\{\\alpha\}\_\{t\}and the behavioral properties of the model\. Empirical validation of the algorithm on a held\-out question set is a necessary step before deployment\.
#### Practical recommendations\.
Based on our empirical observations, we offer the following guidance for practitioners choosing reasoning budgets for calibration\-sensitive applications\.
For tasks requiring precise discrete computation \(counting, set theory, modular arithmetic, combinatorics\), light reasoning budgets appear particularly risky: our results suggest they increase confidence without commensurate accuracy gains\. Either no reasoning \(for speed\) or heavy reasoning \(for accuracy\) is preferable to a brief chain\-of\-thought that may lock in a wrong answer\.
For tasks requiring multi\-step logical inference \(syllogisms, conditionals\), the risk of early lock\-in is high\. Self\-consistency checking—generating multiple independent chains and comparing answers—is a low\-cost intervention that can improve confidence calibration without increasing single\-chain budget\.
Monitoring the OG metric \(Conf¯−Acc¯\\overline\{\\mathrm\{Conf\}\}\-\\overline\{\\mathrm\{Acc\}\}\) on a held\-out calibration set, separately for each reasoning budget, is recommended for any deployment where calibration matters\. A model showing OG\>0\.20\>0\.20at a given budget should be treated with caution, even if its accuracy is acceptable\.
## 8Threats to Validity
We report the following threats transparently\. Some of these are discussed in detail in earlier sections; we consolidate them here for clarity\.
#### Small sample size\.
The dataset contains 47 trap questions, yielding approximately 8–10 valid responses per trap type per model per budget\. This limits the statistical power of all analyses, and particularly per\-category analyses\. ECE estimates are sensitive to outliers at this sample size\. Confidence intervals across seeds partially mitigate this, but the standard deviations reported in Table[3](https://arxiv.org/html/2606.11211#A3.T3)are large in several conditions, indicating that results should be interpreted as directional rather than precise\.
#### Validity filtering and selection bias\.
As discussed in Section[5\.3](https://arxiv.org/html/2606.11211#S5.SS3), the 42% validity rate may introduce systematic bias\. If models are more likely to provide parseable responses when confident, our ECE estimates will be biased toward overconfidence\. The direction of this bias is consistent with our main finding, meaning we cannot rule out the possibility that the observed overconfidence is partly an artifact of selection\.
#### Verbalized confidence limitations\.
As discussed in Section[5\.5](https://arxiv.org/html/2606.11211#S5.SS5), verbalized confidence is not equivalent to model\-internal epistemic uncertainty\. Results should be interpreted as measuring a behaviorally expressed property—verbalized confidence—rather than true uncertainty\.
#### Model family restriction\.
We evaluate only models from the Llama\-3 family\. Whethercdurmanifests similarly in GPT\-4o, Claude\-3\.5, Gemini, or other families is an open empirical question\. Based on our theoretical analysis, we expect the phenomenon to be present in any model with RLHF\-style training, but its severity may vary substantially\.
#### Prompt sensitivity\.
Budget levels are implemented via prompt engineering, which may interact with model\-specific instruction\-following behavior\. The specific wording of budget prompts could affect both reasoning quality and confidence elicitation, and we have not conducted ablations over prompt formulations\.
#### Incomplete 70B data\.
As discussed in Section[6\.3](https://arxiv.org/html/2606.11211#S6.SS3), the absence of multi\-budget data for the 70B model means that scale\-related conclusions are severely limited\. We do not draw conclusions about the scaling behavior ofcdur\.
## 9Discussion
### 9\.1When CDUR Should Be Expected
Thecdurphenomenon is not expected to manifest equally across all task types\. Based on our theoretical model and empirical results, calibration drift under reasoning is most likely for tasks that share the following properties\.
Discrete, exact\-answer structure\.Problems with a unique correct answer determined by precise discrete computation \(counting, modular arithmetic, combinatorics\) leave little room for the model to “almost be right\.” A wrong initial hypothesis is simply wrong, and reasoning that elaborates on it increases confidence without improving accuracy\. Our results confirm that these categories dominate the wrong\-and\-confident distribution\.
Non\-obvious traps\.Problems where a plausible\-seeming wrong answer exists \(e\.g\., anchoring, spurious pattern completion, base\-rate neglect\) are particularly vulnerable, because the initial hypothesis sampled by the model is likely to be the trap answer\. RLHF\-trained models, which are rewarded for fluent and confident responses, may be especially prone to sampling the most plausible\-seeming hypothesis rather than the correct one\.
Short reasoning chains insufficient for correction\.For problems requiring multiple precise logical steps to reach the answer, short reasoning chains may correctly identify the problem type but fail to execute all necessary steps, producing a confident partial answer\.
Conversely, for open\-ended generation tasks, subjective evaluation tasks, and tasks with many acceptable answers, the lock\-in mechanism is less relevant: there is no unique correcth0h\_\{0\}, and confidence\-accuracy alignment is harder to measure\.
### 9\.2Implications for Inference\-Time Scaling
Recent work proposes scaling inference\-time compute as a complementary axis to training\-time compute\[Snell et al\.,[2024](https://arxiv.org/html/2606.11211#bib.bib13)\]\. Our findings suggest a nuance: inference\-time scaling improves accuracy but may worsen calibration at intermediate budget levels\. A system that scales compute without monitoring calibration may present confidently wrong answers to users at the budget levels where the scaling is cheapest\.
TheCABStopalgorithm offers a concrete mechanism for calibration\-aware scaling: allocate more compute when confidence and estimated accuracy agree, halt when they diverge\.
### 9\.3Implications for Model Evaluation
Standard benchmarks measure accuracy\. Our results suggest that calibration should be measured alongside accuracy, particularly as a function of reasoning budget\. A model that achieves high accuracy at heavy reasoning but poor ECE at light reasoning may perform poorly in deployment scenarios where the full reasoning budget is not always available\.
We recommend that future model evaluations report\(Acc\(B\),ECE\(B\)\)\(\\mathrm\{Acc\}\(B\),\\mathrm\{ECE\}\(B\)\)curves across multiple budgets, not just peak accuracy\.
### 9\.4Theoretical Limitations
The Hypothesis Lock\-In Model \(Section[4](https://arxiv.org/html/2606.11211#S4)\) is a stylized approximation\. Real autoregressive LLMs can revise their hypothesis mid\-chain, especially when prompted with explicit revision instructions\. The propositions in Section[3](https://arxiv.org/html/2606.11211#S3)should be understood as characterizing a specific mechanistic regime—where the model’s context is dominated by an early committed hypothesis—rather than as universal laws governing all LLM reasoning\.
The commitment model also does not account for the structured attention patterns of Transformer architectures, temperature effects on hypothesis sampling, or the effect of instruction\-following fine\-tuning on the probability of self\-correction\. A more precise mechanistic account would require analysis at the level of attention weights and token probabilities, which is left to future work\.
### 9\.5Connections to Human Reasoning
Thecdurphenomenon has a human cognition analog:*post\-hoc rationalization*\[Haidt,[2001](https://arxiv.org/html/2606.11211#bib.bib6)\]\. Humans often form an initial intuitive judgment and then construct reasoning that justifies it, increasing their confidence without increasing its accuracy\. The commitment model in Section[4](https://arxiv.org/html/2606.11211#S4)formalizes this structure in the LLM setting\. The parallel suggests thatcdurmay be a general feature of systems trained to produce justified conclusions rather than accurate ones, extending beyond LLMs to any architecture with a generate\-then\-justify structure\.
## 10Future Work
#### Empirical replication at scale\.
The most pressing need is a larger dataset—at least 500 trap questions—to provide robust per\-category calibration estimates and to verify the non\-monotone ECE trajectory with statistical significance\.
#### Across model families\.
Testingcduron GPT\-4o, Claude\-3\.5, and Gemini models would determine whether the phenomenon is universal or Llama\-specific\. Completing the 70B evaluation across all budget conditions is a near\-term priority\.
#### Training interventions\.
RLHF may be modified to reward calibrated confidence rather than expressed confidence\. Calibration\-aware reward models—which penalize high verbalized confidence on wrong answers—are a natural next step\.
#### Better auxiliary estimators\.
CABStopdepends on the quality ofα^t\\hat\{\\alpha\}\_\{t\}\. Lightweight verifier models, reward models, or retrieval\-augmented consistency checks could improve the accuracy estimate without substantial compute overhead, and would address the consistent\-wrong\-answer failure case\.
#### Adaptive budgeting\.
A learned policy that maps \(question, current calibration gap\) to \(continue/stop\) would be a stronger version ofCABStop\. This could be framed as a reinforcement learning problem with a calibration\-aware reward, and would avoid the need for a hand\-tuned thresholdδ\\delta\.
#### Formal lower bounds\.
Can we prove a lower bound on the overconfidence gap under the commitment model for specific problem classes? This would provide a theoretical floor on achievable calibration and would clarify when no amount of budget tuning can resolve the overconfidence problem\.
#### Per\-sample calibration analysis\.
As discussed in Section[6\.5](https://arxiv.org/html/2606.11211#S6.SS5), future work should report per\-sample confidence\-accuracy distributions in addition to aggregate ECE, to distinguish between uniform miscalibration and concentrated dangerous overconfidence\.
## 11Conclusion
We have introducedCalibration Drift Under Reasoning\(cdur\): the phenomenon whereby increasing reasoning budget may first improve and then worsen model calibration, producing non\-monotone dynamics in theECE\(B\)\\mathrm\{ECE\}\(B\)curve\. We have provided:
1. \(1\)A formal definition ofcdurand three propositions characterizing it under a probabilistic Hypothesis Lock\-In Model, establishing theoretical grounding for the phenomenon under an explicit mechanistic abstraction\.
2. \(2\)Empirical evidence from Llama\-3\.1\-8B on 47 reasoning\-trap questions spanning 21 cognitive failure modes, showing non\-monotone ECE dynamics and persistent overconfidence gaps exceeding 0\.25 across all budget levels\. We report limited Llama\-3\.3\-70B results \(no\-reasoning only\) and explicitly acknowledge thatcdurdynamics at this scale remain unconfirmed\.
3. \(3\)TheCABStopalgorithm—a calibration\-aware stopping rule grounded in optimal stopping theory—which halts reasoning when the confidence\-accuracy gap exceeds a threshold\. We also characterize its failure cases and discuss practical guidance for threshold selection\.
We have been explicit about the methodological limitations of this study: the small dataset, the 42% validity rate and its potential for selection bias, the imprecision of verbalized confidence as an uncertainty signal, and the incomplete multi\-scale evaluation\. These limitations do not negate the value of the theoretical framework or the empirical observations for the 8B model, but they do call for the results to be treated as preliminary evidence motivating further investigation rather than as settled empirical findings\.
The central message is not that more reasoning is always worse for calibration, but that the relationship between reasoning depth and calibration quality is not monotone, and that this non\-monotonicity can be dangerous in practice\. As inference\-time scaling becomes a mainstream technique, calibration monitoring—ideally combined with adaptive stopping rules—deserves attention as a first\-class concern alongside accuracy optimization\.
## References
- Chow and Robbins \[1961\]Chow, Y\. S\., and Robbins, H\. \(1961\)\. On optimal stopping rules\.*Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete*, 2\(1\):33–49\.
- De Groot and Fienberg \[1983\]De Groot, M\. H\. and Fienberg, S\. E\. \(1983\)\. The comparison and evaluation of forecasters\.*Journal of the Royal Statistical Society: Series D \(The Statistician\)*, 32\(1\-2\):12–22\.
- Desai and Durrett \[2020\]Desai, S\. and Durrett, G\. \(2020\)\. Calibration of pre\-trained transformers\.*Proceedings of EMNLP*\.
- Graves \[2016\]Graves, A\. \(2016\)\. Adaptive computation time for recurrent neural networks\.*arXiv preprint arXiv:1603\.08983*\.
- Guo et al\. \[2017\]Guo, C\., Pleiss, G\., Sun, Y\., and Weinberger, K\. Q\. \(2017\)\. On calibration of modern neural networks\.*Proceedings of ICML*\.
- Haidt \[2001\]Haidt, J\. \(2001\)\. The emotional dog and its rational tail: A social intuitionist approach to moral judgment\.*Psychological Review*, 108\(4\):814\.
- Kadavath et al\. \[2022\]Kadavath, S\., Conerly, T\., Askell, A\., Henighan, T\., Drain, D\., Perez, E\., Schiefer, N\., Hatfield\-Dodds, Z\., DasSarma, N\., Tran\-Johnson, T\., et al\. \(2022\)\. Language models \(mostly\) know what they know\.*arXiv preprint arXiv:2207\.05221*\.
- Kojima et al\. \[2022\]Kojima, T\., Gu, S\. S\., Reid, M\., Matsuo, Y\., and Iwasawa, Y\. \(2022\)\. Large language models are zero\-shot reasoners\.*Advances in Neural Information Processing Systems*, 35\.
- Lightman et al\. \[2023\]Lightman, H\., Kosaraju, V\., Burda, Y\., Edwards, H\., Baker, B\., Lee, T\., Leike, J\., Schulman, J\., Sutskever, I\., and Cobbe, K\. \(2023\)\. Let’s verify step by step\.*arXiv preprint arXiv:2305\.20050*\.
- Muennighoff et al\. \[2025\]Muennighoff, N\., et al\. \(2025\)\. Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.*arXiv preprint arXiv:2408\.03314*\.
- Murphy \[1977\]Murphy, A\. H\. \(1977\)\. The value of climatological, categorical and probabilistic forecasts in the cost\-loss ratio situation\.*Monthly Weather Review*, 105\(7\):803–816\.
- Ouyang et al\. \[2022\]Ouyang, L\., Wu, J\., Jiang, X\., Almeida, D\., Wainwright, C\., Mishkin, P\., Zhang, C\., Agarwal, S\., Slama, K\., Ray, A\., et al\. \(2022\)\. Training language models to follow instructions with human feedback\.*Advances in Neural Information Processing Systems*, 35\.
- Snell et al\. \[2024\]Snell, C\., Lee, J\., Xu, K\., and Kumar, A\. \(2024\)\. Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.*arXiv preprint arXiv:2408\.03314*\.
- Turpin et al\. \[2023\]Turpin, M\., Michael, J\., Perez, E\., and Bowman, S\. R\. \(2023\)\. Language models don’t always say what they think: Unfaithful explanations in chain\-of\-thought prompting\.*arXiv preprint arXiv:2305\.04388*\.
- Wald \[1947\]Wald, A\. \(1947\)\.*Sequential Analysis*\. Wiley, New York\.
- Wang et al\. \[2022\]Wang, X\., Wei, J\., Schuurmans, D\., Le, Q\., Chi, E\., Narang, S\., Chowdhery, A\., and Zhou, D\. \(2022\)\. Self\-consistency improves chain of thought reasoning in language models\.*arXiv preprint arXiv:2203\.11171*\.
- Wei et al\. \[2022\]Wei, J\., Wang, X\., Schuurmans, D\., Bosma, M\., Ichter, B\., Xia, F\., Chi, E\., Le, Q\., and Zhou, D\. \(2022\)\. Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in Neural Information Processing Systems*, 35\.
- Xiong et al\. \[2024\]Xiong, M\., Hu, Z\., Lu, X\., Li, Y\., Fu, J\., He, J\., and Hooi, B\. \(2024\)\. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs\.*Proceedings of ICLR*\.
- Zhou et al\. \[2023\]Zhou, W\., Sha, Z\., Zhang, Q\., Gong, W\., Shan, B\., Yang, L\., He, X\., and Liu, B\. \(2023\)\. Navigating the grey area: How expressions of uncertainty and overconfidence affect language models\.*arXiv preprint arXiv:2302\.13439*\.
## Appendix AProof Details
### A\.1Formal Consistency Score
We define the consistency score𝒞\\mathcal\{C\}used in Proposition 3\.4 as follows\. Let𝒯\\mathcal\{T\}be the vocabulary andVh0⊂𝒯V\_\{h\_\{0\}\}\\subset\\mathcal\{T\}be the set of tokens that reinforce hypothesish0h\_\{0\}\(e\.g\., tokens that share lexical content withh0h\_\{0\}or that appear in reasoning chains that conclude withh0h\_\{0\}, as measured by a reference corpus\)\. Then:
𝒞\(R1,…,Rt;h0\)=1t∑i=1t𝟏\[Ri∈Vh0\]\.\\mathcal\{C\}\(R\_\{1\},\\ldots,R\_\{t\};h\_\{0\}\)=\\frac\{1\}\{t\}\\sum\_\{i=1\}^\{t\}\\mathbf\{1\}\[R\_\{i\}\\in V\_\{h\_\{0\}\}\]\.This is a running fraction, and𝔼\[𝒞\]=ℙ\[Ri∈Vh0\|h0\]=:ρ\(h0\)∈\(0,1\)\\mathbb\{E\}\[\\mathcal\{C\}\]=\\mathbb\{P\}\[R\_\{i\}\\in V\_\{h\_\{0\}\}\\,\|\\,h\_\{0\}\]=:\\rho\(h\_\{0\}\)\\in\(0,1\)by assumption\.
Under the commitment model,ℙ\[Ri∈Vh0\|h0,R1,…,Ri−1\]\\mathbb\{P\}\[R\_\{i\}\\in V\_\{h\_\{0\}\}\\,\|\\,h\_\{0\},R\_\{1\},\\ldots,R\_\{i\-1\}\]is non\-decreasing inii\(since the context increasingly supportsh0h\_\{0\}\), so𝒞\\mathcal\{C\}is a submartingale with positive drift for allt≥1t\\geq 1\. Hence𝔼\[𝒞\(R1,…,Rt;h0\)\]\\mathbb\{E\}\[\\mathcal\{C\}\(R\_\{1\},\\ldots,R\_\{t\};h\_\{0\}\)\]is non\-decreasing intt\.
### A\.2Discussion of Assumption Robustness
The key assumption is thatPθ\(Ri∈Vh0\|h0,R<i\)P\_\{\\theta\}\(R\_\{i\}\\in V\_\{h\_\{0\}\}\\,\|\\,h\_\{0\},R\_\{<i\}\)is non\-decreasing inii\. This holds under self\-attention architectures when:
1. 1\.h0h\_\{0\}appears in the first few tokens of the reasoning chain and thus in the attention window\.
2. 2\.The model has been trained to produce coherent, self\-consistent outputs \(as encouraged by RLHF\)\.
Both conditions hold approximately for current instruction\-tuned LLMs\. We do not claim they hold universally, and the assumption may be violated when the model encounters strong contradictory evidence mid\-chain or when explicit revision instructions are provided\.
## Appendix BDataset: Trap Question Examples
We provide representative examples from four high\-frequency trap categories\.
#### Counting \(highest frequency\)\.
Question:“A frog climbs 3 meters up a 10\-meter wall each day and slides back 2 meters each night\. How many days to reach the top?”Expected:8\.Common wrong answer:10\.Trap:On day 8, the frog reaches 10m during the day before sliding back; students often compute 10/1 = 10 naively\.
#### Set theory\.
Question:“In a class of 30, 18 play football, 15 play cricket, and 5 play both\. How many play neither?”Expected:2\.Common wrong answer:3\.Trap:Inclusion\-exclusion requires\|F∪C\|=18\+15−5=28\|F\\cup C\|=18\+15\-5=28; neither =30−28=230\-28=2\.
#### Syllogism\.
Question:“All A are B\. All B are C\. Is it true that all C are A?”Expected:No\.Common wrong answer:Yes\.Trap:The converse of a universal statement does not follow\.
#### Probability\.
Question:“A box has 2 red and 2 blue balls\. You draw 2 without replacement\. What is the probability both are red?”Expected:1/61/6\.Common wrong answer:1/41/4\.Trap:The draws are dependent;P=24⋅13=16P=\\frac\{2\}\{4\}\\cdot\\frac\{1\}\{3\}=\\frac\{1\}\{6\}\.
## Appendix CExperimental Logs \(Summary\)
Table[3](https://arxiv.org/html/2606.11211#A3.T3)reproduces the full summary statistics from the experimental runs\. Standard deviations are computed across 3 seeds; the 70B model was not evaluated at multi\-budget conditions, and those entries are absent\.
Table 3:Full results: ECE \(mean±\\pmstd across 3 seeds\), overconfidence gap, and accuracy for trap questions\. The large standard deviation at 8B medium budget reflects high seed\-to\-seed variability and should be interpreted cautiously\. 70B results are available only at the no\-reasoning condition\.Similar Articles
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
This paper presents a comprehensive empirical evaluation of how large language models handle corruptions in chain-of-thought reasoning steps, testing 13 models across 5 perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) on mathematical reasoning tasks. The findings reveal heterogeneous vulnerability patterns with implications for deploying LLMs in multi-stage reasoning pipelines.
Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
This paper tests whether varying inference-time reasoning effort affects the alignment between large reasoning models' chain-of-thought lengths and human reaction times. Results show alignment is invariant to effort perturbations, suggesting it is a training-time achievement.
Large Language Models Are Overconfident in Their Own Responses
This paper investigates why instruction-tuned LLMs are overconfident in their own responses, identifying an 'ownership bias' that gives higher confidence to self-generated answers. It proposes a simple inference-time strategy to reframe the model's answer as user input, improving calibration by up to 26% without retraining.
Reasoning Can Be Restored by Correcting a Few Decision Tokens
This paper shows that the reasoning gap between base LLMs and large reasoning models is concentrated on a small set of early planning tokens. It introduces disagreement-guided token intervention, where replacing only those critical tokens with a reasoning model's outputs allows a base model to nearly match the reasoning model's performance.
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
This paper revisits the reliability paradox in the context of machine unlearning for language models, demonstrating that models can achieve low calibration error while relying on shortcut-based decision rules, thereby extending the paradox to unlearned models.