How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

arXiv cs.CL Papers

Summary

This paper characterizes two distinct processes by which language models fail in reasoning—committed failure and persistent uncertainty—using token-level uncertainty signals, and demonstrates implications for self-consistency and failure detection strategies.

arXiv:2606.06635v1 Announce Type: new Abstract: Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework's falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:19 AM

# How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures
Source: [https://arxiv.org/html/2606.06635](https://arxiv.org/html/2606.06635)
Tanvi Thoria1,Kiana Jafari2,Marc R\. Schlichting2,Mykel J\. Kochenderfer1,2,

1Department of Computer Science, Stanford University 2Department of Aeronautics and Astronautics, Stanford University, Correspondence:[mykel@stanford\.edu](https://arxiv.org/html/2606.06635v1/mailto:[email protected])

###### Abstract

Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace\. We characterize these failures using token\-level uncertainty signals, finding they arise through two empirically distinguishable processes\. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace\. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection\. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions\. These signatures reproduce across 23 model\-dataset configurations, with the framework’s falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes\. Finally, we demonstrate our failure mode framework has direct implications for self\-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped\. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly\.

How Language Models Fail: Token\-Level Signatures of Committed and Persistent Reasoning Failures

Tanvi Thoria1, Kiana Jafari2, Marc R\. Schlichting2, Mykel J\. Kochenderfer1,2,1Department of Computer Science, Stanford University2Department of Aeronautics and Astronautics, Stanford University,Correspondence:[mykel@stanford\.edu](https://arxiv.org/html/2606.06635v1/mailto:[email protected])

## 1Introduction

1\. Token Distribution2\. Temporal SignalSo,theanswermustbe4Generated Token Sequence25P​\(x\)P\(x\)Prefix Size \(TT\)ValueMonitored Signals:Entropy,Margin, NLL, Nucleus,or Near\-TieAggregate OverPrefix WindowsFull\-Trace BaselineT∗T^\{\*\}\(committment point\)Early PeakPrefix Window Size \(TT\)PR​\-​AUC​\(T\)\\mathrm\{PR\\text\{\-\}AUC\}\(T\)Case A: Committed FailureFull\-Trace BaselinePrefix Window Size \(TT\)PR​\-​AUC​\(T\)\\mathrm\{PR\\text\{\-\}AUC\}\(T\)Case B: Persistent UncertaintyCase ACase B

Figure 1:Our framework computes token\-level uncertainty signals over prefixes of an LLM reasoning trace to diagnose how and when the model fails\.Detecting when language models will fail on complex reasoning tasks is an ongoing challenge with immediate ramifications for deployment reliability\. Existing approaches to failure detection, such as self\-consistencyWanget al\.\([2023](https://arxiv.org/html/2606.06635#bib.bib3)\)and uncertainty quantificationKadavathet al\.\([2022](https://arxiv.org/html/2606.06635#bib.bib13)\); Farquharet al\.\([2024](https://arxiv.org/html/2606.06635#bib.bib20)\), treat failure as a binary prediction task\. These methods can be effective at detecting when a model may fail, but they do not characterize the process through which failure emerges\. We propose that this process is not singular, and treating it as such limits our understanding of how models fail and our ability to respond appropriately\.

If reasoning failures develop through different processes, then a single detection strategy cannot be optimal across all cases\. Consider a model that commits to a wrong approach before its reasoning trace concludes and reproduces it consistency across completions\. In this case, self\-consistency will incorrectly confirm the wrong answer with high confidence and additional sampling cannot recover the failure signal\. Conversely, for a model that remains genuinely uncertain throughout its reasoning, aggregating across completions would be the correct approach\. These two differing situations require different detection strategies, yet existing methods apply the same approach regardless\. Characterizing the process through which failures develop, instead of treating failure as a binary outcome, is a prerequisite for building detection methods that can adapt accordingly\.

Characterizing how failures manifest requires observing the model’s reasoning process, not just its outcome\. Recent work has made progress on this through mechanistic approaches such as probing internal activations to show incorrect answers are decodable before they are expressedBoppanaet al\.\([2026](https://arxiv.org/html/2606.06635#bib.bib14)\), distorting reasoning steps to identify causal influence on final answersYeet al\.\([2026](https://arxiv.org/html/2606.06635#bib.bib24)\)and intervening on model representations to show early commitment restricts the effectiveness of correctionsZuret al\.\([2025](https://arxiv.org/html/2606.06635#bib.bib16)\)\. These methods reveal structure in how models fail, but they require access to model weights and internal representations\. This restricts them from closed\-API frontier models such as GPT\-4o and Gemini, where only output tokens are accessible\. Without access to the model internals, the same failure structure should be observable from external metrics such as token\-level signals\.

We propose a framework that characterizes reasoning failures through token\-level uncertainty signals on chain\-of\-thought traces, requiring only log probabilities from a single completion\. The framework identifies two failure modes, committed failure and persistent uncertainty, each with a distinct uncertainty trajectory across the trace\. Across 23 model\-dataset configurations spanning five model families and four reasoning domains, the framework’s falsifiable predictions hold in 20 of 23 cases\. Our framework requires moderate failure rates: extremes yield unreliable classification signals, and closed\-API constraints limit available log probabilities\.

This paper makes the following contributions\. \(1\) We propose token\-level uncertainty signals predict two distinct failure modes, classifying how LLM reasoning failures manifest\. We empirically validate that these failure modes are falsifiable and reproducible across diverse models and tasks\. \(2\) We identify the commitment point: the position in a reasoning trace at which token\-level uncertainty is maximally predictive of failure, marking where the model locks onto a reasoning path\. \(3\) We outline practical consequences for our failure framework, showing that failure mode characterization can predict when self\-consistency is effective and when single completion uncertainty features provide complementary signals\.

## 2Related Work

#### Uncertainty estimation in LLMs\.

LLMs are well\-calibrated on multiple\-choice tasks and can estimate the probability that their own answers are correctKadavathet al\.\([2022](https://arxiv.org/html/2606.06635#bib.bib13)\)\. Semantic entropy clusters generations by meaning rather than surface form to produce an uncertainty measure for hallucination detection, at the cost of five to ten generations per queryFarquharet al\.\([2024](https://arxiv.org/html/2606.06635#bib.bib20)\)\. Alignment tuning has been shown to sharpen output distributions, with the Branching Factor reducing by a factor of two to five and up to an order of magnitude at the earliest positionsYanget al\.\([2025](https://arxiv.org/html/2606.06635#bib.bib5)\)\. These methods treat uncertainty either at the answer level or as a static property of the output distribution\. We instead study how token\-level uncertainty signals evolve along a reasoning trace and show that their predictive power is non\-uniform\.

#### Self\-consistency as a failure\-detection baseline\.

The dominant baseline for verifying LLM reasoning is self\-consistency: sampling multiple chains of thought and taking the majority\-vote answerWanget al\.\([2023](https://arxiv.org/html/2606.06635#bib.bib3)\); Semantic entropy similarly requires repeated samplingFarquharet al\.\([2024](https://arxiv.org/html/2606.06635#bib.bib20)\)\. These multi\-completion methods are effective when model uncertainty surfaces as inter\-sample disagreement, but they are structurally blind to the committed\-failure regime we identify: when a model has committed to an incorrect reasoning path early in its trace, it produces the same wrong answer consistently across completions, and self\-consistency cannot distinguish these cases from genuinely correct ones\. Our token\-level uncertainty signals are complementary to self\-consistency and operate from a single completion\.

#### CoT faithfulness\.

Chain\-of\-thought promptingWeiet al\.\([2022](https://arxiv.org/html/2606.06635#bib.bib1)\); Kojimaet al\.\([2022](https://arxiv.org/html/2606.06635#bib.bib19)\)elicits step\-by\-step reasoning traces and substantially improves multi\-step performance\. The relationship between visible CoT and the model’s internal computation is contestedLanhamet al\.\([2023](https://arxiv.org/html/2606.06635#bib.bib21)\); Turpinet al\.\([2023](https://arxiv.org/html/2606.06635#bib.bib22)\); Young \([2026](https://arxiv.org/html/2606.06635#bib.bib23)\)\.

We analyzed CoT traces produced under standard zero\-shot prompting; whether these traces faithfully reflect internal computation is orthogonal to our empirical claims, which concerns structure observable in the visible trace\.

#### Trace\-level structure\.

Recent works have characterized failure\-relevant structure in reasoning traces\. Trace length itself is a confidence estimator whose relationship to accuracy is altered by reasoning post\-trainingDevicet al\.\([2025](https://arxiv.org/html/2606.06635#bib.bib2)\)and, the correlation between CoT length and problem complexity is brittle, arising from approximate recall of training distribution rather than adaptive computationPalodet al\.\([2025](https://arxiv.org/html/2606.06635#bib.bib18)\)\. At the step level, the shape of the entropy trajectory across reasoning steps has been argued to be more diagnostic than its scalar magnitudeZhao \([2026](https://arxiv.org/html/2606.06635#bib.bib17)\)\. We differ in two ways\. First, we explicitly control for the length confound through a pre\-final analysis that strips tokens after the answer marker\. Second, we operate at the token level over cumulative prefix windows rather at the step level, finding that magnitude features carry more predictive signal than shape alone\.

#### Concurrent work on early commitment\.

A concurrent thread has established early commitment as a recognized phenomenon in LLM reasoning through several methodological lenses\. Activation probing on large reasoning models reveals that final answers are decodable from internal activations well before verbalizationBoppanaet al\.\([2026](https://arxiv.org/html/2606.06635#bib.bib14)\); counterfactual corruption identifies a reasoning horizon at70−85%70\-85\\%of chain lengthYeet al\.\([2026](https://arxiv.org/html/2606.06635#bib.bib24)\); resampling identifies forking tokens with non\-uniform importanceBigelowet al\.\([2025](https://arxiv.org/html/2606.06635#bib.bib15)\); activation interventions are most effective before commitmentZuret al\.\([2025](https://arxiv.org/html/2606.06635#bib.bib16)\)\. We complement this thread along three axes\. First, we operate on token\-level uncertainty signals extractable from logprobs alone, without requiring model weights, counterfactual interventions, or repeated sampling\. This approach makes the method deployable on closed\-API models within logprob constraints, as we demonstrate on GPT\-4o and Gemini\-2\.5Pro\. Second, we characterize two qualitatively distinct failure modes, committed and persistent, with bidirectional statistical validation across2323\(model, dataset\) configurations\. Third, we extend the analysis to standard inference\-mode CoT models, complementing the reasoning\-model focus of prior work in this thread\.

## 3Methods

A language model’s chain\-of\-thought reveals how it produced its final answer and, in well\-calibrated cases, should be informative of whether that answer is incorrect \(Figure[1](https://arxiv.org/html/2606.06635#S1.F1)\)\. We analyze the token\-level uncertainty signals across reasoning traces to characterize the structure of model failures\.

### 3\.1Failure Modes in LLM Reasoning

We define model failure as any trace in which a model’s final extracted answer is incorrect\. If the structure of a model’s reasoning determines eventual failure, then the progression of token\-level signals across that trace should be characteristic of how and when failure occurs\.

We propose that this progression takes one of two qualitatively different forms\. In the first failure mode,*committed failure*, the model locks onto an incorrect reasoning path early in its trace\. Failure becomes apparent early in the model’s reasoning, and its uncertainty signals are most informative over a prefix of the trace rather than the full sequence\. In the second,*persistent uncertainty*, the model never commits to a reasoning path\. Uncertainty builds monotonically throughout the trace, and a complete reasoning path is required to distinguish failed from successful traces\. These two modes produce qualitatively different signatures in how uncertainty progresses across a reasoning trace, which we will formalize and test empirically\.

### 3\.2Commitment Point

If a model locks onto a reasoning path early, there likely is a token position where this is observable\. We define this position as the commitment point: the point in a reasoning trace at which the uncertainty signals are most informative of model failure\.

Beyond the commitment point, the model has already selected a reasoning path, and subsequent uncertainty is downstream noise rather than signal about the eventual outcome\. In the persistent uncertainty regime, no such commitment point exists as predictive power increases monotonically, and the full trace remains more informative than any prefix\.

### 3\.3Uncertainty Features

If a model has locked onto a reasoning path, its token distribution should reflect its diminished uncertainty as the model is no longer exploring multiple paths\. To reveal these failure patterns, we compute the following signals over prefixes of the reasoning trace, which we formalize below as early windows\.

Letp\(t\)=\(p1\(t\),p2\(t\),…\)p^\{\(t\)\}=\(p^\{\(t\)\}\_\{1\},p^\{\(t\)\}\_\{2\},\\ldots\)denote the token probability distribution at positiontt, withp\(1\)\(t\)≥p\(2\)\(t\)≥⋯p^\{\(t\)\}\_\{\(1\)\}\\geq p^\{\(t\)\}\_\{\(2\)\}\\geq\\cdotsbe the sorted probabilities\. For a reasoning trace of lengthLL, we define the early window𝒲T=\{1,…,min⁡\(T,L\)\}\\mathcal\{W\}\_\{T\}=\\\{1,\\ldots,\\min\(T,L\)\\\}and compute the following uncertainty signals at each token positiontt\.

EntropySpread of the top\-KKdistributionKadavathet al\.\([2022](https://arxiv.org/html/2606.06635#bib.bib13)\): ℋt=−∑ipi\(t\)​log⁡pi\(t\)\\mathcal\{H\}\_\{t\}=\-\\sum\_\{i\}p^\{\(t\)\}\_\{i\}\\log p^\{\(t\)\}\_\{i\}

MarginDifference between top two probabilitiesSchefferet al\.\([2001](https://arxiv.org/html/2606.06635#bib.bib25)\): ℳt=p\(1\)\(t\)−p\(2\)\(t\)\\mathcal\{M\}\_\{t\}=p^\{\(t\)\}\_\{\(1\)\}\-p^\{\(t\)\}\_\{\(2\)\}

NLLConfidence in the top token: ℒt=−log⁡p\(1\)\(t\)\\mathcal\{L\}\_\{t\}=\-\\log p^\{\(t\)\}\_\{\(1\)\}

NucleusTokens needed to capture a probability threshold of0\.90\.9Holtzmanet al\.\([2020](https://arxiv.org/html/2606.06635#bib.bib26)\): 𝒩t=min⁡\{k:∑i=1kp\(i\)\(t\)≥0\.9\}\\mathcal\{N\}\_\{t\}=\\min\\\{k:\\sum\_\{i=1\}^\{k\}p^\{\(t\)\}\_\{\(i\)\}\\geq 0\.9\\\}

Near\-TieFraction of top\-KKwithin90%90\\%ofp\(1\)\(t\)p^\{\(t\)\}\_\{\(1\)\}: 𝒯t=1K​∑i=1K𝟏​\[p\(i\)\(t\)≥0\.9⋅p\(1\)\(t\)\]\\mathcal\{T\}\_\{t\}=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\mathbf\{1\}\[p^\{\(t\)\}\_\{\(i\)\}\\geq 0\.9\\cdot p^\{\(t\)\}\_\{\(1\)\}\]

For each signalst∈\{ℋt,ℳt,ℒt,𝒩t,𝒯t\}s\_\{t\}\\in\\\{\\mathcal\{H\}\_\{t\},\\mathcal\{M\}\_\{t\},\\mathcal\{L\}\_\{t\},\\mathcal\{N\}\_\{t\},\\mathcal\{T\}\_\{t\}\\\}and window𝒲T\\mathcal\{W\}\_\{T\}, we compute both the mean and maximum features:

s¯T=1\|𝒲T\|​∑t∈𝒲TstandsTmax=maxt∈𝒲T⁡st\\bar\{s\}\_\{T\}=\\frac\{1\}\{\|\\mathcal\{W\}\_\{T\}\|\}\\sum\_\{t\\in\\mathcal\{W\}\_\{T\}\}s\_\{t\}\\qquad\\text\{and\}\\qquad s^\{\\max\}\_\{T\}=\\max\_\{t\\in\\mathcal\{W\}\_\{T\}\}s\_\{t\}Mean features capture average uncertainty across the early window and the maximum features capture the peak uncertainty event\. These ten aggregated features make up the input to our classifier\.

### 3\.4Diagnosing Failure Modes with PR\-AUC

To quantify failure mode detection, we compute PR\-AUC, a measure of a classifier’s ability to identify failures, at each early window and examine the shape of the resulting curve\. We propose that each failure mode will differ along the following axes: the shape of the PR\-AUC curve across window sizes, whether a commitment pointT∗T^\{\*\}exists and whether the95%95\\%bootstrap confidence interval shows the early window is more informative than the full trace\.

The committed failure mode should reveal an inverted\-U shape in its PR\-AUC plot\. As the early window expands, the predictive power of the uncertainty signals should grow as well\. The PR\-AUC reaches its peak at the commitment point,T∗T^\{\*\}and subsequent early windows decline in their predictive performance, producing an inverted\-U curve\. We further categorize committed failures into*strong committed*and*weak committed*based on the bootstrap confidence interval onΔ​\(T∗\)=PR​\-​AUC​\(T∗\)−PR​\-​AUC​\(full\)\\Delta\(T^\{\*\}\)=\\mathrm\{PR\\text\{\-\}AUC\}\(T^\{\*\}\)\-\\mathrm\{PR\\text\{\-\}AUC\}\(\\mathrm\{full\}\), the difference between the peak early\-window PR\-AUC and the full\-trace PR\-AUC\. Strong committed failures have a95%95\\%confidence interval onΔ​\(T∗\)\\Delta\(T^\{\*\}\)that excludes zero, confirming that the early window is strictly more informative than the full trace\. Weak committed failures haveΔ​\(T∗\)\>0\\Delta\(T^\{\*\}\)\>0\. The early windows still outperform the full trace in expectation, but confidence intervals span zero due to statistical uncertainty\.

In contrast, the PR\-AUC of persistent uncertainty regimes monotonically rises\. The PR\-AUC curve steadily increases and never exceeds that of the full trace because the model never selects a path and each subsequent token continues to add genuine signal\. Formally,Δ​\(T\)<0\\Delta\(T\)<0for all tested windowsTT, revealing that no early window can recover the full information available in the complete trace\. This shows the absence of a commitment event; there is no position along a reasoning trace that has concentrated commitment power\.

### 3\.5Pre\-final Analysis

The length of a model’s reasoning trace often correlates with the correctness of its final answer\(Devicet al\.,[2025](https://arxiv.org/html/2606.06635#bib.bib2)\)\. Failing models tend to write longer traces, inflating full\-trace uncertainty features by proxying trace length rather than capturing genuine uncertainty\. To account for this potential length confound, we strip all tokens that occur after the final answer marker in the reasoning trace before computing the uncertainty signals, an approach we denote as pre\-final analysis\. In configurations where failing models produce substantially longer traces, a length confound could inflate the full\-trace PR\-AUC curve, omitting an otherwise underlying inverted\-U signature\. The pre\-final analysis is designed to prevent this by controlling for post\-answer token length within each configuration\.

### 3\.6Connection to Self\-Consistency

The framework’s failure\-mode classification has direct implications for self\-consistency\(Wanget al\.,[2023](https://arxiv.org/html/2606.06635#bib.bib3)\), which sampleskkcompletions and uses majority\-vote agreement as a confidence signal\. In the committed regime, a model can reproduce the same wrong answer consistently across completions, making agreement rate an unreliable signal; single\-completion uncertainty captures within\-completion structure that agreement cannot observe\. In the persistent regime, failure signatures surface as cross\-completion disagreement, and self\-consistency aggregation is genuinely informative\. Section[5](https://arxiv.org/html/2606.06635#S5)empirically tests both regime\-conditional triage and complementarity\.

## 4Experiments

We test whether the two failure modes manifest empirically across a range of models, datasets and task difficulty levels\. Our framework operates entirely on the externalized chain\-of\-thought trace and requires only token\-level log probabilities; no access to internal model representations is needed\. Our code is publicly available\.111[https://github\.com/sisl/LMTwoFailureModeFramework](https://github.com/sisl/LMTwoFailureModeFramework)

### 4\.1Models and Datasets

We evaluate models spanning a range of sizes, families and architectures: Qwen3\.5\-2B, Qwen3\.5\-9B, Qwen3\.5\-27B, Qwen3\.5\-122B\-A10BTeam \([2026](https://arxiv.org/html/2606.06635#bib.bib28)\), Llama3\.1\-8B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2606.06635#bib.bib29)\), GPT\-OSS\-20BAgarwalet al\.\([2025](https://arxiv.org/html/2606.06635#bib.bib30)\), Gemma4\-31B, GPT\-4oAchiamet al\.\([2023](https://arxiv.org/html/2606.06635#bib.bib27)\), Gemini\-2\.5ProComaniciet al\.\([2025](https://arxiv.org/html/2606.06635#bib.bib31)\)\. We include both dense models, mixture of experts, open\-source and frontier models in order to capture a variety of patterns\. These are evaluated on five benchmarks spanning mathematics, scientific, logical and coding domains: GSM8K \(13191319test questions of grade\-school math;Cobbeet al\.[2021](https://arxiv.org/html/2606.06635#bib.bib6)\), MATH\-500 \(500500competition\-level math problems representative of the full MATH benchmark;Hendryckset al\.[2021](https://arxiv.org/html/2606.06635#bib.bib7),Lightmanet al\.[2024](https://arxiv.org/html/2606.06635#bib.bib8)\), GPQA Diamond \(198198multiple\-choice questions on graduate\-level biology, chemistry and physics;Reinet al\.[2024](https://arxiv.org/html/2606.06635#bib.bib9)\), and LiveCodeBench \(451451applicable coding challenges;Jainet al\.[2025](https://arxiv.org/html/2606.06635#bib.bib11)\)\. We additionally evaluated AR\-LSAT \(230230questions from the Law School Admissions Test;Zhonget al\.[2021](https://arxiv.org/html/2606.06635#bib.bib12)\) but every configuration we ran fell outside the applicability band; these results are reported in Table[3](https://arxiv.org/html/2606.06635#Ax1.T3)in the appendix, for transparency and excluded from the pool as a scope decision\. These datasets were selected to span across domains and a range of difficulty levels relative to model capability, with the intention to generalize the failure framework\. Failure rates below15%15\\%or above60%60\\%paired with an AUROC<0\.55<0\.55render the framework inapplicable\. We additionally exclude configurations whose prefinal\-stripped trace contains too few failures to support reliable analysis \(typically fewer than∼\\sim10 failures in the prefinal\-valid subset\), where the analysis pipeline falls back to regular features\.

All experiments use a temperature of0\.60\.6, which is consistent with standard LLM evaluation practice\(Renze and Guven,[2024](https://arxiv.org/html/2606.06635#bib.bib4)\)and balances exploration and exploitation\. For open\-weight models, we retrieve the top200200log probabilities per token \(capture around99%99\\%of all probability mass\) since almost all probability mass is concentrated on a small subset of tokens\(Yanget al\.,[2025](https://arxiv.org/html/2606.06635#bib.bib5)\)\. The frontier models, GPT\-4o and Gemini2\.5\-Pro expose only the top2020log probabilities, so their experiments are correspondingly constrained\. We do not test Claude models as they do not expose any log probabilities at the time of testing\. All open\-source models are served using vLLMKwonet al\.\([2023](https://arxiv.org/html/2606.06635#bib.bib32)\)\.

Models are prompted to format their final answer within\\boxed\{\}for math datasets orFinal: Answerfor GPQA Diamond, and answers are extracted via regex\. All experiments were run on 2 × NVIDIA H100 96GB GPUs, for a total of approximately 200 GPU hours across all configurations\.

### 4\.2Evaluation Protocol

For each question, we prompt the model with the instruction:“Reason through the problem step by step to arrive at an answer”\(Weiet al\.,[2022](https://arxiv.org/html/2606.06635#bib.bib1)\), and compute uncertainty signals over the resulting reasoning trace\. We define a binary failure labely=¬correcty=\\neg\\text\{correct\}, where the correctness is determined by the automated extraction of the final answer\.

We compute the uncertainty signals over prefixes of each trace, where a prefix is the firstTTtokens in the trace withT∈\{128,256,400,512,1024,2048\}T\\in\\\{128,256,400,512,1024,2048\\\}\. A single inference call is made per question and the features for each window are computed over the same trace, ensuring that comparisons acrossTTare not confounded by sampling variation\.

We use PR\-AUC as our primary evaluation metric\. Across our experiments, model failure rates range from5%5\\%to84%84\\%\. At these class priors, AUROC can remain high even when a classifier has poor precision on the minority class\(Davis and Goadrich,[2006](https://arxiv.org/html/2606.06635#bib.bib10)\), making PR\-AUC more informative for evaluating failure detection\.

To calculate PR\-AUC, we use out\-of\-fold \(OOF\) predictions from a 5\-fold stratified logistic regression classifier with balanced class weights and regularization strengthC=1\.0C=1\.0\. The OOF predictions are concatenated across all folds to obtain a single set of predictions:\{\(p^i,yi\)\}i=1n\\\{\(\\hat\{p\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, whereyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}indicates failure\. The random baseline PR\-AUC is the empirical failure ratey¯=1n​∑i=1nyi\\bar\{y\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}y\_\{i\}\. Statistical uncertainty is quantified via paired bootstrap confidence intervals on

Δ​PR\-AUC=PR\-AUCearly−PR\-AUCfull\\Delta\\text\{PR\-AUC\}=\\text\{PR\-AUC\}\_\{\\text\{early\}\}\-\\text\{PR\-AUC\}\_\{\\text\{full\}\}\(1\)resamplingnnobservations with replacement for10,00010\{,\}000iterations\.

To test whether the inverted\-U signature truly reproduces across model\-dataset configurations, we pool per\-configuration evidence using four complementary tests: a sign test on the direction ofΔ^\\hat\{\\Delta\}across committed configurations, Stouffer’sZZcombining bootstrappp\-values, an inverse\-variance weighted meta\-analysis estimating a pooled effect size and a joint sign test evaluating the framework’s bidirectional prediction across all configurations simultaneously\. Classification into committed or persistent is made for each configuration independently before pooling\. In cases where within\-dataset stratification reveals that an aggregateΔ\\DeltaPR\-AUC averages over distinct modes, difficulty strata are substituted as units\.

The four tests are complementary as each answers a distinct question about either directional consistency, cumulative significance, pooled magnitude and bidirectional falsifiability\.

Several configurations are excluded from our primary analysis based on two methodological criteria: \(i\) the failure rate falls outside\[15%,60%\]\[15\\%,60\\%\]and the AUROC falls below0\.550\.55, or \(ii\) the prefinal\-stripped trace contains too few failures for reliable analysis\. AR\-LSAT is additionally excluded as a scope decision\.

### 4\.3Commitment Point Identification

A commitment point is identified as the window achieving the highest PR\-AUC whose lower bound of the 95% bootstrap confidence interval onΔ​PR\-AUC\\Delta\\text\{PR\-AUC\}excludes zero \(*strong committed*\), or whenΔ​PR\-AUC\>0\\Delta\\text\{PR\-AUC\}\>0withp​\(Δ\>0\)\>0\.8p\(\\Delta\>0\)\>0\.8but the interval spans zero \(*weak committed*\)\.

We restrict identification of the commitment point to genuine early windows\. Windows that capture the majority of the trace are excluded, as they no longer constitute an early observation of the reasoning process\. Concretely, this results in early windows of10241024and20482048being omitted if the maximum length of a reasoning trace is less than either threshold\.

### 4\.4Evaluation Against Self\-Consistency

We compare how single\-completion uncertainty signals relate to self\-consistency\. Specifically, we identify when self\-consistency is effective and whether uncertainty signals add predictive power on top of when it is\. We evaluate three models overk=15k=15completions each on GPQA Diamond, Gemma4\-31B \(weak commitment\), Qwen3\.5\-9B \(persistent uncertainty\) and Qwen3\.5\-122B \(persistent uncertainty\), spanning the two failure regimes, model families and model sizes\.

## 5Results

Across2323model and dataset configurations, we find clear evidence for both failure modes in our framework\. Fourteen configurations exhibit committed failure, where the early uncertainty signals reach peak predictive power before the full trace\. Nine exhibit persistent uncertainty, where the PR\-AUC accumulates monotonically and never surpasses that of the full trace\. Additionally, we verify that reasoning trace lengths do not differ systematically across the two failure regimes, ruling out trace length as a confound on the failure mode framework itself\.

Table 1:Failure\-mode classification by \(model, dataset\) configuration\.*SC*/*WC*: strong/weak committed \(95% CI onΔ​\(T∗\)\\Delta\(T^\{\*\}\)excludes/spans zero\)\.*Persist*: persistent uncertainty\.*Stratified*: per\-difficulty\-level analysis\.−\-: excluded \(see Table[3](https://arxiv.org/html/2606.06635#Ax1.T3)\)\.### 5\.1Committed Failures

Committed failure is the most prevalent failure mode in our experiments, occurring in fourteen model\-dataset configurations across five model families, four datasets and a range of model scales\. All committed cases have an inverted U\-shape PR\-AUC curve where the predictive power increases as the early window grows to include the commitment pointT∗T^\{\*\}, after which it falls as additional tokens dilute the early signal\.

The strongest committed failure signatures occur whenΔ​\(T∗\)\\Delta\(T^\{\*\}\)excludes zero, confirming that the early window trace is strictly more informative than the full trace\. This strong committed signature is most evident for Gemma4\-31b on LiveCodeBench, where the inverted\-U shape is visually unambiguous and the CI excludes zero with high confidence \(Figure[2](https://arxiv.org/html/2606.06635#S5.F2)\)\. GPT\-OSS\-20B on the easy\-question split of LiveCodeBench independently replicates this signature with the cleanest delta confidence bands across four consecutive committed windows\. Gemini\-2\.5\-Pro on MATH\-500 further confirms this pattern, though this result uses the top\-2020log probabilities rather than the full output distributions\. These cases demonstrate the committed failure regime clearly: the model selects an incorrect reasoning path before the trace finishes, and subsequent tokens add noise relative to the early signal\.

![Refer to caption](https://arxiv.org/html/2606.06635v1/x1.png)Figure 2:Strong Committed Failure: Gemma4\-31 on LiveCodeBench\. TheΔ​\(T\)\\Delta\(T\)confidence interval excludes0\.Not all committed failures appear as strongly\. Several configurations demonstrate the inverted U\-shape and a positiveΔ​\(T\)\\Delta\(T\)but with confidence intervals that span zero\. We classify these failures as*weak committed*\. This continuum of signal strength arises when the failure pool is small \(Gemma4\-31b on GPQA with4343failures\), when the curve collapses sharply after the commitment point \(Qwen3\.5\-2B on GPQA, whereΔ\\Deltapeaks atT∗=512T^\{\*\}\{=\}512before dropping below the full\-trace baseline atT=1024T\{=\}1024\) or when a high failure rate compresses the signature of the uncertainty features \(GPT\-OSS\-20B on LCB hard at a60%60\\%failure rate\)\. The commitment point still exists across all these cases, and the statistical uncertainty is a reflection of sample constraints\.

These examples generalize across all three tested axes of variation \(Table[1](https://arxiv.org/html/2606.06635#S5.T1)\): four reasoning domains \(GSM8K, MATH\-500, GPQA, LiveCodeBench\), five model families \(Gemini, Gemma, GPT\-OSS, Llama, Qwen\), and a model\-scale range from Qwen3\.5\-2B through frontier\-scale systems\. No single architecture, training pipeline, or task family explains the inverted\-U pattern\.

### 5\.2Persistent Uncertainty Failures

Persistent uncertainty appears in nine of the model\-dataset configurations, revealing a diagnostically distinct pattern from committed failure\. Instead of an inverted\-U, the PR\-AUC curve rises monotonically with window size, and the full trace is consistently higher than any early window\. There is noT∗T^\{\*\}at which subsequent tokens become noise since the full trace outperforms all windows\. Additional tokens always add signal because the model does not have a single position with concentrated predictive power throughout the trace\. The model is genuinely uncertain throughout its reasoning, and the uncertainty is only fully understood once we observe the full trace\.

The clearest example of this is Llama3\.1\-8b on MATH\-500 where the PR\-AUC rises steadily across all window sizes with no peak \(Figure[3](https://arxiv.org/html/2606.06635#S5.F3)\)\. The model does not lock onto a wrong path but instead searches paths across the full trace, with uncertainty remaining elevated throughout\.

![Refer to caption](https://arxiv.org/html/2606.06635v1/x2.png)Figure 3:Persistent Uncertainty: Llama3\.1\-8B on MATH\-500\. Early windows never beat the full trace\.This is not a phenomenon restricted to model architecture or task type\. It appears in GPT\-OSS\-20B on GSM8K, a dataset which other models exhibited committed failure on\. We also observe it in Qwen3\.5\-122B on GPQA and LiveCodeBench, the largest model in our evaluation\. In all persistent uncertainty cases the diagnostic is the same\.

GPT\-4o presents a nuanced case under the API’s top\-2020log probability constraint\. On MATH\-500, the pattern is persistent uncertainty:Δ\\Deltais uniformly negative across all early windows and the full trace dominates\. On GPQA, the pattern is weak committed \(Δ^=\+0\.005\\hat\{\\Delta\}=\+0\.005atT∗=1024T^\{\*\}=1024\); near the persistent boundary\. The top\-2020constraint affects feature reliability: mean\-based features \(entropy, NLL\) depend on tail probability mass that is missing, while max\-based features \(margin, near\-tie\) remain valid since they depend only on the single\-highest probability token\. Despite this constraint, both configurations are in the pool with classifications consistent with the framework’s prediction\.

### 5\.3Reproducibility Across Configurations

Table 2:Meta\-analytic pooling ofΔ​\(T∗\)\\Delta\(T^\{\*\}\)across configurations\. Stouffer’sZZcombines per\-configuration evidence on the standard\-normal scale; the joint sign test pools the committed and persistent classes\.The framework’s directional predictions are entirely consistent in the committed regime: every configuration showsΔ^\>0\\hat\{\\Delta\}\>0as predicted \(Table[2](https://arxiv.org/html/2606.06635#S5.T2), Figure[5](https://arxiv.org/html/2606.06635#Ax1.F5)\)\. Persistent configurations follow the opposite prediction in six of nine cases; the three boundary cases \(Gemma4\-31B / MATH\-500 L5, Qwen3\.5\-2B / LCB, Qwen3\.5\-122B / LCB\) haveΔ^\\hat\{\\Delta\}within±0\.003\\pm 0\.003of zero — statistically indistinguishable from the prediction rather than violations\. The framework is falsifiable across both signs and survives\.

### 5\.4Self\-Consistency

We find that triage performance tracks failure mode classification directly \(Figure[4](https://arxiv.org/html/2606.06635#S5.F4); full operating curves in Appendix Table[6](https://arxiv.org/html/2606.06635#Ax1.T6)\)\. Pre\-final stripping is essential across all models since the full trace signal collapses as a triage tool in every panel\. With pre\-final features, Gemma4\-31B \(committed\) achieves near\-perfect recall up to a30%30\\%skip rate using either the prefinal early window \(T=400T=400\) or the full prefinal trace\. For both Qwen models in the persistent regime, the pre\-final curves hold near1\.01\.0in the top20%20\\%of confident questions skipped and degrade more steeply beyond\. For Qwen3\.5\-9B, the early\-window and full\-prefinal curves are indistinguishable, reaffirming the absence of a commitment point\.

![Refer to caption](https://arxiv.org/html/2606.06635v1/x3.png)Figure 4:Selective self\-consistency triage: recall on SC failures vs\. skip rate \(% of most\-confident inputs skipped\), for committed \(Gemma4\-31B / GPQA\) and persistent \(Qwen3\.5\-9B / GPQA\) configurations\. Pre\-final stripping is essential — the full\-trace curve collapses without it\.We then evaluate whether uncertainty signals complement self\-consistency’s agreement rate\. Single\-completion uncertainty alone is substantially weaker than agreement rate across all three configurations \(PR\-AUC≈0\.42\\approx 0\.42vs\.≈0\.78\\approx 0\.78\), confirming that the two signals are not interchangeable\. Combining them improves PR\-AUC over agreement alone in every configuration:\+0\.026\+0\.026\(Gemma4\-31B\),\+0\.035\+0\.035\(Qwen3\.5\-9B\),\+0\.045\+0\.045\(Qwen3\.5\-122B\)\. While individual\-cell lifts have CIs that span zero given GPQA’s small failure pool \(n=198n=198\), the consistent positive direction across both regimes indicates that uncertainty features capture within\-completion reasoning quality that agreement rate, a purely between\-completion signal, cannot observe \(Appendix Figure[6](https://arxiv.org/html/2606.06635#Ax1.F6)\)\. When self\-consistency is already deployed, adding uncertainty features incurs no additional inference cost and consistently improves failure prediction in expectation\.

## 6Conclusion

We introduced a two\-mode framework characterizing how language\-model reasoning failures manifest in chain\-of\-thought traces, requiring only log probabilities from a single completion\. Across 23 configurations spanning five model families and four reasoning domains, the framework’s bidirectional prediction holds in 20 of 23 cases \(sign test on committed configurations: 14/14,p=6\.1×10−5p=6\.1\{\\times\}10^\{\-5\}; pooledΔ^=\+0\.013\\hat\{\\Delta\}=\+0\.013, 95% CI\[\+0\.005,\+0\.020\]\[\+0\.005,\+0\.020\]\)\.

The failure\-mode classification has direct deployment implications: in the committed regime we can skip self\-consistency on the top 30% most\-confident inputs without sacrificing failure recall, and across both regimes, combining uncertainty features with the agreement rate yields a consistent positive lift\. Failure detection strategies should be adapted to failure mode rather than applied uniformly\.

## 7Limitations

Our framework requires failure rates within a workable applicability band; configurations at the extremes do not produce reliable PR\-AUC estimates and are excluded from the pool\. We use a single completion per question, so commitment\-point identification is sensitive to the sampled trace\. The commitment pointT∗T^\{\*\}is identified at the granularity of six fixed window sizes \(\{128,256,400,512,1024,2048\}\\\{128,256,400,512,1024,2048\\\}\) and represents a window range rather than an exact token position; a finer\-grained sweep between adjacent windows could refine its localization\. Closed\-API constraints, where GPT\-4o and Gemini\-2\.5Pro expose only their top\-20 log probabilities, limit the reliability of mean\-based features that depend on tail probability mass; max\-based features remain valid under truncation\. Our self\-consistency evaluation is restricted to three configurations on a single benchmark \(GPQA Diamond\), so consistency of the complementarity result across broader settings remains to be validated\. Finally, the framework operates on visible chain\-of\-thought traces and does not address whether these traces faithfully reflect internal model computation; this question is orthogonal to our empirical claims, which concern structure observable in the visible trace\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- Forking paths in neural text generation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px5.p1.2)\.
- S\. Boppana, A\. Ma, M\. Loeffler, R\. Sarfati, E\. Bigelow, A\. Geiger, O\. Lewis, and J\. Merullo \(2026\)Reasoning theater: disentangling model beliefs from chain\-of\-thought\.arXiv preprint arXiv:2603\.05488\.Cited by:[§1](https://arxiv.org/html/2606.06635#S1.p3.1),[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px5.p1.2)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- J\. Davis and M\. Goadrich \(2006\)The relationship between precision\-recall and roc curves\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§4\.2](https://arxiv.org/html/2606.06635#S4.SS2.p3.2)\.
- S\. Devic, C\. Peale, A\. Bradley, S\. Williamson, P\. Nakkiran, and A\. Gollakota \(2025\)Trace length is a simple uncertainty signal in reasoning models\.arXiv preprint arXiv:2510\.10409\.Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px4.p1.1),[§3\.5](https://arxiv.org/html/2606.06635#S3.SS5.p1.1)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.Cited by:[§1](https://arxiv.org/html/2606.06635#S1.p1.1),[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi \(2020\)The curious case of neural text degeneration\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[itemNucleus](https://arxiv.org/html/2606.06635#S3.I1.ix4.p1.2)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2025\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§1](https://arxiv.org/html/2606.06635#S1.p1.1),[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px1.p1.1),[itemEntropy](https://arxiv.org/html/2606.06635#S3.I1.ix1.p1.2)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InSymposium on Operating Systems Principles \(SOSP\),Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p2.4)\.
- T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion,et al\.\(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.arXiv preprint arXiv:2307\.13702\.Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- V\. Palod, K\. Valmeekam, K\. Stechly, and S\. Kambhampati \(2025\)Performative thinking? the brittle correlation between cot length and problem complexity\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InConference on Language Modeling \(COLM\),Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- M\. Renze and E\. Guven \(2024\)The effect of sampling temperature on problem solving in large language models\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p2.4)\.
- T\. Scheffer, C\. Decomain, and S\. Wrobel \(2001\)Active hidden Markov models for information extraction\.InInternational Symposium on Intelligent Data Analysis,Cited by:[itemMargin](https://arxiv.org/html/2606.06635#S3.I1.ix2.p1.1)\.
- Q\. Team \(2026\)Qwen3\. 5\-omni technical report\.arXiv preprint arXiv:2604\.15804\.Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- M\. Turpin, J\. Michael, E\. Perez, and S\. Bowman \(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.06635#S1.p1.1),[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px2.p1.1),[§3\.6](https://arxiv.org/html/2606.06635#S3.SS6.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.06635#S4.SS2.p1.1)\.
- C\. Yang, S\. Li, and A\. Holtzman \(2025\)LLM probability concentration: how alignment shrinks the generative horizon\.arXiv preprint arXiv:2506\.17871\.Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p2.4)\.
- D\. Ye, M\. Loffgren, O\. Kotadia, and L\. Wong \(2026\)Mechanistic evidence for faithfulness decay in chain\-of\-thought reasoning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.06635#S1.p3.1),[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px5.p1.2)\.
- R\. J\. Young \(2026\)Why models know but don’t say: chain\-of\-thought faithfulness divergence between thinking tokens and answers in open\-weight reasoning models\.arXiv preprint arXiv:2603\.26410\.Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Zhao \(2026\)Entropy trajectory shape predicts LLM reasoning reliability: a diagnostic study of uncertainty dynamics in chain\-of\-thought\.arXiv preprint arXiv:2603\.18940\.Cited by:[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px4.p1.1)\.
- W\. Zhong, S\. Wang, D\. Tang, Z\. Xu, D\. Guo, J\. Wang, J\. Yin, M\. Zhou, and N\. Duan \(2021\)Ar\-lsat: investigating analytical reasoning of text\.arXiv preprint arXiv:2104\.06598\.Cited by:[§4\.1](https://arxiv.org/html/2606.06635#S4.SS1.p1.9)\.
- A\. Zur, A\. Geiger, E\. S\. Lubana, and E\. Bigelow \(2025\)Are language models aware of the road not taken? token\-level uncertainty and hidden state dynamics\.arXiv preprint arXiv:2511\.04527\.Cited by:[§1](https://arxiv.org/html/2606.06635#S1.p3.1),[§2](https://arxiv.org/html/2606.06635#S2.SS0.SSS0.Px5.p1.2)\.

## Appendix

Additional results and excluded configurations are reported below\.

Table 3:The framework’s applicability band is\[15%,60%\]\[15\\%,60\\%\]failure rate paired with AUROC\>0\.55\>0\.55; below15%15\\%the PR\-AUC estimator is unstable due to insufficient positive class examples, above60%60\\%capability\-floor noise dominates the signal\. The prefinal\-trace confound criterion excludes configurations whose prefinal\-valid subset contains fewer than∼\\sim10 failures, where the analysis pipeline falls back to regular features and the prefinal\-mode comparison becomes unreliable\. AR\-LSAT is excluded as a scope decision rather than treated as evidence against the framework\.Table 4:Robustness of the pooled estimate to two analytic choices: how Gemma4\-31B / MATH\-500 is entered \(Scenario A: stratified by difficulty level, the primary choice motivated by between\-mode cancellation in the aggregate; Scenario B: aggregate\), and whether configurations failing the prefinal\-trace confound criterion are excluded \(A, B\) or included \(Scenario C\)\. The pooled effect size and statistical evidence are stable across all three choices\.![Refer to caption](https://arxiv.org/html/2606.06635v1/x4.png)Figure 5:Forest plot ofΔ\\DeltaPR\-AUC atT∗T^\{\*\}across all 23 configurations with95%95\\%bootstrap CIs\. Committed configurations \(blue, upper group\) all show positiveΔ\\Delta, ranging from\+0\.005\+0\.005to\+0\.135\+0\.135\. Persistent configurations \(red, lower group\) cluster near or below zero, with three boundary cases \(Gemma4\-31B / MATH\-500 L5, Qwen3\.5\-2B / LCB, Qwen3\.5\-122B / LCB\) atΔ^≈\+0\.002\\hat\{\\Delta\}\\approx\+0\.002\. The green diamond is the inverse\-variance weighted pooled estimate over committed configurations \(Δ^=\+0\.013\\hat\{\\Delta\}=\+0\.013,95%95\\%CI\[\+0\.005,\+0\.020\]\[\+0\.005,\+0\.020\]\)\.Table 5:Full triage operating curves\. Each table entry reports*recall / precision*on the self\-consistency verdict for inputs flagged by ranking single\-completion confidence\. Prefinal stripping \(removing post\-answer tokens from the trace before extracting uncertainty features\) eliminates a length confound; without it, generated\-length leakage produces overconfident triage at the cost of recall\. Gemma4\-31B \(committed\) sustains perfect triage out to a 30% skip rate; Qwen3\.5\-9B \(persistent\) sustains it only to 20%\.Operating pointComputeRecallPrecisionsavedon SC failsCommitted: Gemma4\-31B / GPQASkip top 20%20%20\\%1\.001\.001\.001\.00Skip top 30%30%30\\%1\.001\.001\.001\.00Skip top 45%45%45\\%0\.950\.950\.990\.99Skip top 50%50%50\\%0\.850\.850\.970\.97Persistent: Qwen3\.5\-9B / GPQASkip top 20%20%20\\%1\.001\.001\.001\.00Skip top 30%30%30\\%0\.920\.920\.970\.97Skip top 45%45%45\\%0\.880\.880\.970\.97Skip top 50%50%50\\%0\.830\.830\.960\.96Table 6:Self\-consistency triage on GPQA\. “Skip topk%k\\%” ranks inputs by single\-completion confidence and skips 15\-completion SC on thek%k\\%most confident\. Recall and precision are against the SC verdict\. Full operating curves in Appendix[5](https://arxiv.org/html/2606.06635#Ax1.T5)\.![Refer to caption](https://arxiv.org/html/2606.06635v1/x5.png)

![Refer to caption](https://arxiv.org/html/2606.06635v1/x6.png)

Figure 6:Comparison of self\-consistency’s agreement rate signal, single\-completion uncertainty signals and the aggregate of the two across Gemma4\-31B, Qwen3\.5\-9B and Qwen3\.5\-122B on GPQA Diamond\.Table 7:PR\-AUC for different window sizesTT\. Configurations excluded from pooling analysis due to capability floor or ceiling criteria are still included\.Table 8:Delta PR\-AUC for different window sizesTT\. Configurations excluded from pooling analysis due to capability floor or ceiling criteria are still included\.

Similar Articles

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

arXiv cs.CL

This paper presents a comprehensive empirical evaluation of how large language models handle corruptions in chain-of-thought reasoning steps, testing 13 models across 5 perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) on mathematical reasoning tasks. The findings reveal heterogeneous vulnerability patterns with implications for deploying LLMs in multi-stage reasoning pipelines.

Reasoning Can Be Restored by Correcting a Few Decision Tokens

arXiv cs.AI

This paper shows that the reasoning gap between base LLMs and large reasoning models is concentrated on a small set of early planning tokens. It introduces disagreement-guided token intervention, where replacing only those critical tokens with a reasoning model's outputs allows a base model to nearly match the reasoning model's performance.