Reading Calibrated Uncertainty from Language Model Trajectories

arXiv cs.LG Papers

Summary

This paper introduces a method to calibrate uncertainty in language models by extracting eleven scale-invariant geometric features from per-layer MLP update trajectories and feeding them to a sparse linear probe, outperforming MSP under selective abstention by up to 21 AURC points.

arXiv:2605.22864v1 Announce Type: new Abstract: The maximum softmax probability (MSP) represents a default approach when evaluating uncertainty quantification for language model generation with structured output. Although cheap, it is often miscalibrated. Methods that probe the model's internal activations feed raw hidden states into opaque classifiers, reading activations as static snapshots and leaving implicit the layer-wise trajectory by which a representation is formed. Yet, similar endpoints can arise from very different paths, and how evidence accumulates, reinforces, or reverses across depth might reveal uncertainty that final probabilities obscure. We extract eleven scale-invariant geometric features, tracing the cumulative path of per-layer MLP updates, and feed them to a sparse linear probe. The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points. Because every feature has a closed-form geometric meaning, the probe's coefficients trace how and where along depth errors take shape -- which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:54 AM

# Reading Calibrated Uncertainty from Language Model Trajectories
Source: [https://arxiv.org/html/2605.22864](https://arxiv.org/html/2605.22864)
Alexander HerzogXiaoyu LiangMarie VasekEnrico MaricontiLorenzo Cavallaro

###### Abstract

The maximum softmax probability \(MSP\) represents a default approach when evaluating uncertainty quantification for language model generation with structured output\. Although cheap, it is often miscalibrated\. Methods that probe the model’s internal activations feed raw hidden states into opaque classifiers, reading activations as static snapshots and leaving implicit the layer\-wise trajectory by which a representation is formed\. Yet, similar endpoints can arise from very different paths, and how evidence accumulates, reinforces, or reverses across depth might reveal uncertainty that final probabilities obscure\. We extract eleven scale\-invariant geometric features, tracing the cumulative path of per\-layer MLP updates, and feed them to a sparse linear probe\. The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points\. Because every feature has a closed\-form geometric meaning, the probe’s coefficients trace*how*and*where*along depth errors take shape – which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint\.

Machine Learning, ICML

## 1Introduction

A trustworthy language model should know when it might be wrong\. Calibrated uncertainty quantification \(UQ\) captures this requirement: a model’s confidence should reflect its likelihood of being correct\. Two equally accurate models can nevertheless differ substantially in reliability\. If a model assigns low confidence to most of its errors, uncertain predictions can be deferred, abstained from, or escalated for human review\. By contrast, when confidence is poorly aligned with correctness, high\-confidence errors become indistinguishable from reliable predictions\. In clinical triage, the first model would set its uncertain cases aside for a clinician to review; the second would queue a life\-threatening case behind a routine one, both stamped high\-risk with equally strong confidence\. Calibrated uncertainty therefore serves a triage function, enabling uncertainty to guide which predictions warrant intervention or verification\.

The default approach to UQ in discrete choice settings is the Maximum Softmax Probability \(MSP\)\(Vashurinet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib15); Dakhmoucheet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib38)\), which uses the predicted token’s softmax probability as a confidence score\. MSP incurs no additional computational cost and is often surprisingly competitive\. However, it may inherit the well\-known pathology of miscalibration: confidence scores that fail to reflect the true likelihood of correctness, often remaining high even when the prediction is wrong\(Guoet al\.,[2017](https://arxiv.org/html/2605.22864#bib.bib2)\)\. A parallel line of work reads from the model’s activations directly, on the premise that the internal computation leading to a generation carries information about its truthfulness that the output distribution alone does not\.Azaria and Mitchell \([2023](https://arxiv.org/html/2605.22864#bib.bib33)\)demonstrated that a simple classifier trained on hidden activations can predict whether an LLM’s answer is truthful, and a growing body of work has since traced the contours of this “geometry of truth"\(Liet al\.,[2023b](https://arxiv.org/html/2605.22864#bib.bib22); Marks and Tegmark,[2023](https://arxiv.org/html/2605.22864#bib.bib20); Dakhmoucheet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib38); Liuet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib34); Beigiet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib18); Azizianet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib21)\)\.

Yet across this literature, activations are typically read as static snapshots – a hidden state extracted from one layer, or averaged across layers, and analyzed for the information it contains\. This discards the layer\-wise trajectory by which the representation is formed\. A final hidden state is the endpoint of a path through representation space, and similar endpoints may arise from qualitatively different trajectories\. Some representations may develop steadily, others may emerge late, fluctuate, or be partially reversed before settling\. These trajectories are not incidental to the prediction as they encode how evidence is accumulated, reinforced, attenuated, or revised across depth\. We therefore read activations not only as states, but as representational trajectories; we show that their geometry reveals uncertainty that final probabilities alone may obscure\.

We calibrate by tracing the answer\-position residual stream as the cumulative path induced by per\-layer MLP write\-vectors during the forward pass\. We summarize this trajectory by computing scale\-invariant geometric features and feed them to a sparse linear probe\. Confidence\-calibration is evaluated under selective abstention via AURC\(Geifmanet al\.,[2018](https://arxiv.org/html/2605.22864#bib.bib44)\)\. This leads to an interpretable UQ method that goes beyond single\-point evaluation \(MSP\) while avoiding the opacity of dense probes on raw activations\.

To summarize, our main contributions are as follows:

1. 1\.We propose a compact set of geometric features that describe how representations evolve across network\-depth, and feed them to a sparse linear probe that outperforms MSP under selective abstention, with gains scaling with the baseline’s miscalibration\.
2. 2\.We consider end to end interpretability of the probe: each feature has a closed\-form geometric meaning, and its coefficients reveal not only*whether*the model is likely to err but*also how*, which layers commit prematurely, which contradict the running state, and where trajectories drift away from their endpoint\. We show that correct and wrong predictions sharing the same MSP score leave different trajectory signatures, exposing information otherwise flattened by the output distribution\.
3. 3\.We perform a comprehensive set of empirical experiments across 9 instruction\-tuned LLMs from three model families \(Qwen, Llama, DeepSeek\) spanning 3B to 72B parameters on five representative natural language processing tasks\.

Table 1:Eleven per\-layer trajectory features grouped by what they measure\. Symbols are defined in the text\. Features involvingmℓ−1m\_\{\\ell\-1\}orsℓ−1s\_\{\\ell\-1\}are undefined atℓ=1\\ell=1and assigned conventional values \(11for*Consecutive cosine*,0for*Curvature*and*Update\-state alignment*\)\.†*Signed final support*is numerically identical to*Update to final*; both rows are retained because they correspond to distinct geometric formulations \(the L1 probe is invariant to this\)\.## 2Background

### 2\.1Uncertainty Quantification

Predictive uncertainty is commonly decomposed into two components: epistemic uncertainty, which stems from limited data or model misspecification and in principle reducible, and aleatoric uncertainty, which reflects the intrinsic stochasticity of the data\-generating process and therefore irreducible\(Kendall and Gal,[2017](https://arxiv.org/html/2605.22864#bib.bib35)\)\. A related dichotomy has emerged in the LLM literature: factual uncertainty pertains to the correctness of generated content with respect to ground\-truth knowledge, whereas semantic uncertainty arises from the multiplicity of valid continuations admitted by a prompt; the former is epistemic in nature, the latter aleatoric\(Liuet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib13)\)\. As factual uncertainty is the primary concern in automated decision\-making\(Dakhmoucheet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib38)\), we restrict our analysis to this component\. We employ multiple\-choice questions, which act as a noise\-reduction mechanism that isolates factual gaps by neutralizing the semantic uncertainty found in open\-ended text\(Liet al\.,[2026](https://arxiv.org/html/2605.22864#bib.bib28)\)\.

UQ methods aim to associate each prediction with a scalar score indicative of its correctness, enabling downstream decisions such as abstention, deferral, or selective generation\. Existing approaches for LLM uncertainty estimation fall into three groups that differ primarily in computational cost\. Single\-sample methods derive uncertainty from a single forward pass, using signals such as maximum token log\-probability\(Manakulet al\.,[2023](https://arxiv.org/html/2605.22864#bib.bib8)\), perplexity\(Margatinaet al\.,[2023](https://arxiv.org/html/2605.22864#bib.bib23)\), and entropy\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.22864#bib.bib36); Kuhnet al\.,[2023](https://arxiv.org/html/2605.22864#bib.bib14)\)\. Multi\-sample methods aggregate signals across multiple generations, scoring uncertainty by their consistency, similarity, or variability\. Representative examples include semantic\(Farquharet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib37)\)and predictive\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.22864#bib.bib36)\)entropy, conformal prediction\(Kumaret al\.,[2023](https://arxiv.org/html/2605.22864#bib.bib12)\), and pairwise similarity methods\(Linet al\.,[2023](https://arxiv.org/html/2605.22864#bib.bib11)\)\. Probing\-based methods instead train lightweight predictors over internal activations to infer uncertainty directly from the model’s hidden representations\(Azaria and Mitchell,[2023](https://arxiv.org/html/2605.22864#bib.bib33); Dakhmoucheet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib38); Liuet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib34)\)\.

### 2\.2Selective Classification

Selective classification\(Chow,[1970](https://arxiv.org/html/2605.22864#bib.bib40); Geifman and El\-Yaniv,[2017](https://arxiv.org/html/2605.22864#bib.bib39)\)augments a predictor with the option to abstain, trading coverage for reduced error on the retained inputs\. Formally, a selective classifier is a pair\(f,g\)\(f,g\), wheref:𝒳→𝒴f:\\mathcal\{X\}\\to\\mathcal\{Y\}is a predictor andg:𝒳→\{0,1\}g:\\mathcal\{X\}\\to\\\{0,1\\\}is a gating function: the predictionf​\(x\)f\(x\)is returned wheng​\(x\)=1g\(x\)=1and withheld otherwise\. In practice,ggis induced by thresholding a confidence scoreκ:𝒳→ℝ\\kappa:\\mathcal\{X\}\\to\\mathbb\{R\}, so that performance reduces to the quality ofκ\\kappaas a ranking of inputs by likely correctness\.

Two quantities characterize such a classifier:*coverage*, the fraction of inputs on whichffcommits to a prediction, and*selective risk*, the average loss on those inputs\. Varying the threshold onκ\\kappatraces the risk\-coverage curve\(Geifmanet al\.,[2019](https://arxiv.org/html/2605.22864#bib.bib32)\), whose area \(AURC\) summarizes performance across all operating points\. Hence, selective classification provides a natural testbed for uncertainty estimators: a reliable confidence signal should report low selective risk across coverage levels\(Dinget al\.,[2020](https://arxiv.org/html/2605.22864#bib.bib30)\)\.

## 3Methodology

We capture a language model’s uncertainty by considering trajectory\-information during sequential layer\-wise processing\. At each layer in a transformer architecture, the MLP contributes to the residual stream\. The cumulative sum of these contributions forms a trajectory through representation space\. We hypothesize that the geometry of these trajectories, including the distribution of update magnitudes across depth, local changes in direction, and the efficiency with which the path approaches its endpoint, contains information about uncertainty \(Figure[1](https://arxiv.org/html/2605.22864#S3.F1)\)\. We summarize trajectory geometry using eleven scalar descriptors, which we combine with the model’s MSP in a sparse linear model\.

The remainder of this section describes the representational substrate \(§[3\.1](https://arxiv.org/html/2605.22864#S3.SS1)\), the extracted trajectory features \(§[3\.2](https://arxiv.org/html/2605.22864#S3.SS2)\), and the sparse linear probe \(§[3\.3](https://arxiv.org/html/2605.22864#S3.SS3)\)\. Code to reproduce the experiments is available at[https://anonymous\.4open\.science/r/uq\-motion\-66CC/](https://anonymous.4open.science/r/uq-motion-66CC/)\. All experiments were conducted on a server equipped with an NVIDIA H100 NVL GPU\.

### 3\.1Setup and Substrate

We consider finite\-choice classification with a candidate set𝒴\\mathcal\{Y\},\|𝒴\|=K\|\\mathcal\{Y\}\|=K\. Given a promptx∈𝒳x\\in\\mathcal\{X\}, the modelfθf\_\{\\theta\}induces a distributionpθ​\(y∣x\)p\_\{\\theta\}\(y\\mid x\)over𝒴\\mathcal\{Y\}by applying a softmax to the next\-token logits of theKKcandidate\-identifying tokens\. From a single forward pass, we estimate an uncertainty scoreu:𝒳→\[0,1\]u:\\mathcal\{X\}\\to\[0,1\]corresponding to the probability of error\. We use the canonical MSP\-based score1−m​\(x\)1\-m\(x\)as the baseline\(Hendrycks and Gimpel,[2016](https://arxiv.org/html/2605.22864#bib.bib24)\)\.

As the basis foruuwe extract a layer\-indexed sequence of MLP residual updates at the final prompt position, the readout position used to compute the next\-token logits over𝒴\\mathcal\{Y\}\. Each transformer blockℓ\\ellcontains a multi\-layer perceptron \(MLP\) sub\-layer that writes a contributionmℓ​\(x\)∈ℝHm\_\{\\ell\}\(x\)\\in\\mathbb\{R\}^\{H\}to the residual stream, whereHHis the model’s hidden dimension; We capturemℓ​\(x\)m\_\{\\ell\}\(x\)for each blockℓ=1,…,L\\ell=1,\\dots,Lvia forward hooks placed on the MLP sub\-module, recording its output at the final token of the prompt\.

We focus on MLP write\-vectors because prior mechanistic interpretability work identifies transformer MLPs as important sites for factual knowledge storage and recall\(Gevaet al\.,[2021](https://arxiv.org/html/2605.22864#bib.bib25); Menget al\.,[2022](https://arxiv.org/html/2605.22864#bib.bib26); Yuet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib27)\)\. This motivates using MLP write\-vectors as the unit of analysis for studying how predictions are assembled across layers\. We also consider the partial sums

sℓ​\(x\)=∑k≤ℓmk​\(x\),\\displaystyle s\_\{\\ell\}\(x\)=\\sum\_\{k\\leq\\ell\}m\_\{k\}\(x\),\(1\)wheremk​\(x\)m\_\{k\}\(x\)is the layer\-kkMLP write\-vector\. The sequence\{sℓ​\(x\)\}ℓ=1L\\\{s\_\{\\ell\}\(x\)\\\}\_\{\\ell=1\}^\{L\}traces a discrete trajectory in residual\-stream space, withsL​\(x\)s\_\{L\}\(x\)the total MLP\-driven displacement over the forward pass andu^​\(x\)=sL​\(x\)/‖sL​\(x\)‖\\hat\{u\}\(x\)=s\_\{L\}\(x\)/\\\|s\_\{L\}\(x\)\\\|its unit direction\. We writen¯=L−1​∑k‖mk‖\\bar\{n\}=L^\{\-1\}\\sum\_\{k\}\\\|m\_\{k\}\\\|for the mean update norm andT=∑k‖mk‖T=\\sum\_\{k\}\\\|m\_\{k\}\\\|for the total path length\.

### 3\.2Trajectory Features

We describe each trajectory\(mℓ,sℓ\)ℓ=1L\(m\_\{\\ell\},s\_\{\\ell\}\)\_\{\\ell=1\}^\{L\}via eleven layer\-wise scalar features \(Table[1](https://arxiv.org/html/2605.22864#S1)\)\. The features are scale\-invariant, enabling comparison across models with different hidden dimensions, and fall into four geometric descriptors:

1. \(G1\)Depth allocation: how computational effort is distributed across layers;
2. \(G2\)Local shape: the local geometry of the trajectory;
3. \(G3\)Endpoint alignment: how each layer relates to the trajectory’s endpoint; and
4. \(G4\)Trajectory efficiency: how directly the trajectory reaches its endpoint\.

G1captures the depth distribution of update magnitude\. Therelative update magnitudeflags layers whose contribution is disproportionately large relative to the layer\-wise average, while thecumulative path fractionreveals whether the model front\- or back\-loads its computation\.

G2characterizes local trajectory shape\. Theconsecutive cosinemeasures the directional consistency of successive updates, high values indicate a smooth path through representation space, while thecurvaturehighlights abrupt changes of direction\. Theupdate–state alignmentdistinguishes layers that reinforce prior computation from those that contradict it\.

G3relates each layer to the trajectory’s endpoint\. Two cosines track convergence: thedirection to finalmeasures whether the cumulative state aligns with the endpoint, and theupdate to finalmeasures whether individual updates point toward it\. Thesigned final supportquantifies an update’s alignment with the final direction; its negative part, thecontradictory support, isolates layers that actively oppose the endpoint, while theorthogonal mass fractioncaptures the share of an update’s magnitude orthogonal to the final direction\.

Finally,G4measures efficiency via thecumulative coherence, the ratio of net displacement to total path length: it equals one for a perfectly straight trajectory and approaches zero when updates cancel rather than accumulate\.

### 3\.3Sparse Linear Probe

We model the uncertainty score via sparse logistic regression:

s​\(x\)\\displaystyle s\(x\)=σ​\(w⊤​z​\(x\)\+b\),\\displaystyle=\\sigma\\\!\\left\(w^\{\\top\}z\(x\)\+b\\right\),\(2\)z​\(x\)\\displaystyle z\(x\)=\[φ​\(x\),pmsp​\(x\)​φ​\(x\)\],\\displaystyle=\\bigl\[\\,\\varphi\(x\),\\;p\_\{\\mathrm\{msp\}\}\(x\)\\,\\varphi\(x\)\\,\\bigr\],\(3\)
whereφ​\(x\)∈ℝL×11\\varphi\(x\)\\in\\mathbb\{R\}^\{L\\times 11\}stacks the layer\-wise features of Section[3\.2](https://arxiv.org/html/2605.22864#S3.SS2)andpmsp​\(x\)p\_\{\\mathrm\{msp\}\}\(x\)is the MSP of the model’s prediction\. The interaction term lets the probe weight each feature by confidence: curvature under a high\-confidence prediction is a stronger error signal than the same curvature when the model is already uncertain\. We minimize class\-balanced binary cross\-entropy against the error indicator of Section[3\.1](https://arxiv.org/html/2605.22864#S3.SS1)under an elastic\-net penalty\(Zou and Hastie,[2005](https://arxiv.org/html/2605.22864#bib.bib31)\); sparsity in the fittedwwthen identifies the predictive features directly, with no separate selection stage\.

Each dataset is split 65/15/20 into train, validation, and test folds, stratified by the error indicator so that the error rate is matched across folds\. We sweep a 64\-point grid over the regularization strengthCCandℓ1\\ell\_\{1\}mixing ratioρ\\rho\(Appendix[B](https://arxiv.org/html/2605.22864#A2)\), selecting the setting with lowest validation AURC\. The selected hyperparameters are refit on train\+\+validation before evaluation on the held\-out test fold\.

![Refer to caption](https://arxiv.org/html/2605.22864v1/x1.png)Figure 1:Cumulative MLP write\-vectors traced layer\-wise\. The trajectory\-geometry separates the two populations\. Correct trajectories converge tou^\\hat\{u\}; errored trajectories curve, double back, and drift\.## 4Experimental Setup

#### Models\.

We conduct the experiments using 9 instruction\-tuned LLMs from four distinct model families, spanning scales from 3B to 72B parameters: Qwen \(Qwen2\.5\-7B\-Instruct, Qwen2\.5\-14B\-Instruct, Qwen2\-72B\-Instruct, Qwen2\.5\-72B\-Instruct\)\(Yanget al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib1)\), Llama \(Llama\-3\.2\-3B\-Instruct, Llama\-3\.1\-8B\-Instruct, Llama\-3\.3\-70B\-Instruct\)\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib3)\), and DeepSeek \(deepseek\-llm\-7b\-chat, deepseek\-llm\-67b\-chat\)\(Biet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib4)\)\.

#### Datasets\.

We use five benchmark datasets adapted fromYeet al\.\([2024](https://arxiv.org/html/2605.22864#bib.bib6)\)spanning question answering \(MMLU;Hendryckset al\.,[2020](https://arxiv.org/html/2605.22864#bib.bib7)\), reading comprehension \(CosmosQA;Huanget al\.,[2019](https://arxiv.org/html/2605.22864#bib.bib41)\), commonsense inference \(HellaSwag;Zellerset al\.,[2019](https://arxiv.org/html/2605.22864#bib.bib42)\), dialogue response selection \(HaluDial;Liet al\.,[2023a](https://arxiv.org/html/2605.22864#bib.bib43)\), and document summarization \(HaluSum;Liet al\.,[2023a](https://arxiv.org/html/2605.22864#bib.bib43)\)\. Each dataset contains 10,000 instances formatted as four\-option \(A–D\) multiple\-choice questions in a zero\-shot setting, using a base prompt that presents the question and options directly with the prefixAnswer:\. The predicted answer is the option with maximum softmax probability \(MSP\), which also serves as our primary baseline for uncertainty estimation\.

#### Metrics\.

We evaluate the quality of our probe’s uncertainty profile using AURC\(Geifmanet al\.,[2019](https://arxiv.org/html/2605.22864#bib.bib32)\)\. In essence, a reliable uncertainty profile should allow the model to abstain on observations it is likely to get wrong, so that errors decrease as we restrict predictions to the most confident observations\.

Letffbe a predictor with a confidence functionκ\\kappa, evaluated on\{\(xi,yi\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}under the0/10/1lossℓ\\ell, and re\-index the samples so thatκ​\(x1\)≥⋯≥κ​\(xn\)\\kappa\(x\_\{1\}\)\\geq\\dots\\geq\\kappa\(x\_\{n\}\)\. The selective risk at coveragec∈\(0,1\]c\\in\(0,1\]is the average loss on the top⌈c​n⌉\\lceil cn\\rceilsamples, defined as

R​\(c\)=1⌈c​n⌉​∑i=1⌈c​n⌉ℓ​\(f​\(xi\),yi\),R\(c\)\\;=\\;\\frac\{1\}\{\\lceil cn\\rceil\}\\sum\_\{i=1\}^\{\\lceil cn\\rceil\}\\ell\\bigl\(f\(x\_\{i\}\),\\,y\_\{i\}\\bigr\),\(4\)where AURC averagesR​\(c\)R\(c\)over all coverage levels:

AURC​\(f,κ\)=1n​∑k=1nR​\(kn\)\.\\mathrm\{AURC\}\(f,\\kappa\)\\;=\\;\\frac\{1\}\{n\}\\sum\_\{k=1\}^\{n\}R\\\!\\left\(\\tfrac\{k\}\{n\}\\right\)\.\(5\)Lower AURC indicates that the confidence function more effectively separates correct from incorrect predictions, assigning higher confidence to the former\.

#### Relation to calibration\.

Expected Calibration Error \(ECE\)\(Naeiniet al\.,[2015](https://arxiv.org/html/2605.22864#bib.bib10)\)assesses whether confidence values match empirical accuracy, but calibration and selective performance are distinct\(Dinget al\.,[2020](https://arxiv.org/html/2605.22864#bib.bib30)\)\. A constant score equal to the empirical accuracy can achieve ECE=0=0while providing no useful ranking for abstention\. Conversely, a score can induce a near\-optimal ranking for selective prediction while being numerically miscalibrated\. Since AURC directly evaluates the risk–coverage trade\-off determined by the ordering induced byκ\\kappa, we use it as our primary metric\.

#### Baselines\.

We compare our probe with two baselines and one ablation\. The first baseline is maximum softmax probability \(MSP\), the probabilitym​\(x\)m\(x\)assigned to the predicted token, which we convert to the uncertainty score1−m​\(x\)1\-m\(x\)\. MSP requires no auxiliary computation beyond the model forward pass and is a strong standard baseline for confidence estimation\(Vashurinet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib15); Dakhmoucheet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib38)\)\. The second baseline is a raw\-activation probe in the spirit ofAzaria and Mitchell\([2023](https://arxiv.org/html/2605.22864#bib.bib33)\), extended to all layers: we replace our eleven trajectory features with the per\-layer hidden states and train a layer\-weighted MLP on binary error labels\. This provides a high\-capacity reference for the error signal recoverable from activations, without the interpretability constraints of our feature\-based probe\. The*trajectory\-only*ablation restricts the input toφ​\(x\)\\varphi\(x\), removing MSP and its interaction terms; its gap to the full probe quantifies the complementarity between trajectory geometry and MSP\.

## 5Results

Table 2:We report AURC \(×100\\times 100; lower is better\) for our method, together with the absolute reductionΔ=MSP−Ours\\Delta=\\mathrm\{MSP\}\-\\mathrm\{Ours\}relative to the MSP baseline\. PositiveΔ\\Delta\(▲\\blacktriangle\) denotes improvement; negativeΔ\\Delta\(▼\\blacktriangledown\) denotes regression\. Full comparisons against the Ceiling and Ablation experiments are reported to Appendix[A](https://arxiv.org/html/2605.22864#A1)\.![Refer to caption](https://arxiv.org/html/2605.22864v1/pics/risk_coverage_qwen2.5_72b.png)

Figure 2:Risk\-coverage curves for Qwen2\.5\-72B across the five evaluation datasets\. An ideal curve is monotonically non\-decreasing as predictions are rejected in order of decreasing confidence, approaching zero risk at low coverage and the base error rate at full coverage\. The MSP baseline is shown in dashed blue and our probe in solid green; the shaded region between them indicates the probe’s gain\.![Refer to caption](https://arxiv.org/html/2605.22864v1/pics/probe_auroc_by_conf_bin.png)

Figure 3:Probe and MSP AUROC across confidence bins, aggregated over models\. Lines show mean AUROC, bands the interquartile range, and dots per\-model values\. Green shading indicates bins where the probe outperforms MSP\.We compare the probe with the baselines and ablation from Section[4](https://arxiv.org/html/2605.22864#S4), reporting headline results in Table[2](https://arxiv.org/html/2605.22864#S5.T2)and full results in Appendix Table[3](https://arxiv.org/html/2605.22864#A1.T3)\. We then assess its uncertainty profile and the geometric signals it uses across depth\.

![Refer to caption](https://arxiv.org/html/2605.22864v1/pics/zfeat_pair.png)Figure 4:Two Qwen2\.5\-14B predictions on MMLU at MSP≈0\.97\\approx 0\.97, one correct and one incorrect\. Layer\-wise z\-scores of the eleven trajectory features across normalized depth for the correct prediction \(top\) and incorrect \(bottom\)\.#### Probe performance\.

Our probe improves over MSP on 41 of 45 model–dataset pairs, with gains largest where MSP performs worst \(Spearmanρ=0\.78\\rho=0\.78\)\. For the five configurations with MSP AURC above 40, it reduces AURC by 11\.45–21\.83 points, with the largest reductions on Llama\-3\.2\-3B MMLU, Llama\-3\.1\-8B HaluSum, and Llama\-3\.2\-3B HaluSum\. The probe also matches or outperforms the raw\-activation reference on 24 configurations and is within 2 AURC points on another 6, despite using only eleven scalar features instead of full hidden states\. Finally, the*trajectory\-only*ablation outperforms MSP on 35 configurations, showing that trajectory geometry is independently informative, while the full probe further benefits from MSP as a complementary signal\.

Figure[3](https://arxiv.org/html/2605.22864#S5.F3)dissects these AURC gains into risk\-coverage curves for Qwen2\.5\-72B\. The probe \(solid green\) lies at or below MSP \(dashed blue\) throughout the coverage range, with the largest gap on HaluDial and HaluSum\. MSP exhibits an unstable low\-coverage spike on every dataset, whereas the probe rises smoothly from zero and sustains a low\-risk regime up to roughly 50–60% coverage\.

Figure[3](https://arxiv.org/html/2605.22864#S5.F3)stratifies AUROC by MSP confidence\. Across all five datasets, MSP performs close to random\-chance levels \(AUROC≈0\.5\\approx 0\.5–0\.60\.6\) outside its highest\-confidence bin\. In contrast, the probe consistently outperforms MSP across nearly all bins, with the largest improvements emerging in the mid and high\-confidence regimes for HaluSum, HaluDial, HellaSwag, and MMLU\.

To see whether the probe finds geometric signal at fixed MSP, Figure[4](https://arxiv.org/html/2605.22864#S5.F4)compares two Qwen2\.5\-14B predictions on MMLU with MSP≈0\.97\\approx 0\.97\- one correct, one incorrect \- via the eleven trajectory features z\-scored against the per\-layer population\. In early layers \(depth0\-0\.30\.3\), the incorrect prediction has already aligned with its final direction \(*Direction to final*,*Update to final*, and*Signed final support*saturated atz≈\+3z\\approx\+3\), while the correct prediction shows none of this early alignment\. The correct prediction also front\-loads its computation \(*Cumulative path fraction*above the population mean at0\-0\.50\.5\), whereas the incorrect prediction back\-loads its path length and traces an unusually straight cumulative path in the second half \(*Cumulative coherence*positive at depths0\.550\.55\-0\.850\.85\)\.

#### Probe interpretability\.

We examine what signal the probe uses by decomposing its total coefficient mass across the four feature groups in Table[1](https://arxiv.org/html/2605.22864#S1), separating mass assigned to raw trajectory featuresφ​\(x\)\\varphi\(x\)from that assigned to their MSP interactions,pmsp⋅φ​\(x\)p\_\{\\mathrm\{msp\}\}\\cdot\\varphi\(x\)\. Figure[5](https://arxiv.org/html/2605.22864#S5.F5)reports this decomposition by model, aggregated across datasets since the pattern is stable; solid bars indicate trajectory features and hatched bars indicate interactions\. The two blocks contribute roughly equally overall\. Trajectory features carry a larger share for Llama\-3\.3\-70B, Qwen2\.5\-14B, Qwen2\-72B, and Qwen2\.5\-72B, whereas DeepSeek\-7B and Llama\-3\.1\-8B assign relatively less mass to depth\-distribution features\. Depth distribution is also the only feature group whose mass lies mainly on the raw features rather than on MSP interactions; the remaining groups split their mass more evenly\.

Figure[6](https://arxiv.org/html/2605.22864#S5.F6)maps the median probe coefficient at each \(feature, depth\) cell for Qwen2\.5\-14B on HaluSum and CosmosQA, with direct effectsφ​\(x\)\\varphi\(x\)on the left and MSP interactions on the right\. The two datasets show distinct signatures\. On HaluSum, signal concentrates in the final quarter \(depth≳0\.75\\gtrsim 0\.75\): errors carry a large, smooth late update\-positive weight on*Relative update magnitude*,*Consecutive cosine*, and*Cumulative path fraction*near the output\-counterbalanced by negative*Update\-state alignment*and*Cumulative coherence*, indicating that updates aligned with the running state and efficient cumulative paths predict correctness rather than error\.*Direction to final*is also negative at depth0\.50\.5–0\.60\.6, so trajectories that align with their endpoint earlier are less error\-prone\. HaluSum errors reflect premature commitment followed by a late, state\-breaking correction\. CosmosQA shows no such localization: coefficients are smaller and spread across depth, the strongest cell at mid\-depth \(*Orthogonal mass fraction*at0\.50\.5–0\.60\.6\), so error trajectories drift sideways relative to the endpoint\. Endpoint\-alignment features \(*Update to final*,*Signed final support*,*Direction to final*\) light up in the MSP\-interaction panel near the output, predicting error only when weighted by confidence\.

![Refer to caption](https://arxiv.org/html/2605.22864v1/x2.png)Figure 5:Probe coefficient composition across models, showing coefficient mass by geometric feature family \(color\), split into main effects \(solid\) and MSP×\\timestrajectory interactions \(hatched\)\.![Refer to caption](https://arxiv.org/html/2605.22864v1/pics/fig_depth_profile_qwen14b_two_datasets.png)Figure 6:Median probe coefficients per \(feature, depth\-bin\) cell for Qwen2\.5\-14B on HaluSum \(top\) and CosmosQA \(bottom\)\. Left: direct effects; right: MSP interactions\. Pink increases predicted error probability, green decreases it\. Values are normalized by the dataset\-specific 95th\-percentile magnitude\.The coefficient maps in Figure[6](https://arxiv.org/html/2605.22864#S5.F6)describe the probe’s learned structure; to see it deployed, we turn to the pairs on which MSP confidence is silent\. Within the top 30% most\-confident Llama\-3\.2\-3B\-Instruct predictions, we match each probe\-flagged error to a probe\-cleared non\-error with MSP agreeing within 0\.02\. For each \(feature, depth\) cell, we multiply the feature\-value difference between the paired examples by the probe’s effective coefficient at the shared confidence \(so both trajectory and MSP\-interaction blocks contribute\), then average absolute values across pairs\. A cell is bright only when trajectories diverge there*and*the probe weights it, localizing signal MSP alone misses\. On both datasets the signal concentrates in the second half of the network, with*Relative update magnitude*dominant – engaging from depth∼0\.5\\sim\\\!0\.5on HaluSum but only near depth1\.01\.0on HaluDial\. HaluSum additionally leans on*Update\-state alignment*near the output, while HaluDial shifts to endpoint\-alignment channels \(*Update to final*,*Signed final support*\) at depth0\.70\.7–0\.850\.85\.

![Refer to caption](https://arxiv.org/html/2605.22864v1/pics/fig_attribution_map.png)Figure 7:Aggregate attribution maps for Llama\-3\.2\-3B\-Instruct on HaluSum \(left\) and HaluDial \(right\), averaged over the 100 MSP\-matched probe\-correct pairs with the largest error\-probability gaps\. Maps share a normalized\-depth axis and color scale\.## 6Discussion

We find that the trajectory of a model’s layer\-wise computation reveals when its answer should be trusted\. Correct predictions and confident errors can be indistinguishable in the output distribution yet diverge in the geometry that produced them, and a few interpretable descriptors of that geometry suffice to recover the distinction\. A sparse linear probe trained on them improves over MSP across nearly all configurations, with gains largest where MSP is most miscalibrated; on more than half of those configurations, the eleven scalar features match a probe given the full hidden states\. The resulting selective prediction is well\-behaved: the probe traces a smooth, monotone risk\-coverage curve, sustaining low risk where MSP tends to spike at low coverage\. Because the features are interpretable by construction, the probe’s coefficients localize*how*a model fails, not just*whether*it does\. On HaluSum, errors take the form of premature endpoint commitment in the first half of the network, followed by a late, oversized MLP write that breaks the running state\. On CosmosQA, errors instead manifest as mid\-depth drift orthogonal to the eventual answer direction\. These mechanisms remain legible even when MSP fails to discriminate: among predictions matched in confidence, correct and erroneous trajectories diverge at specific \(feature, depth\) cells the probe weights, localizing discriminative structure that the output distribution has flattened away\. The signatures, however, are task\-specific, in line with the recent finding that “geometries of truth” are largely orthogonal across tasks\(Azizianet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib21)\)\. Together, these results point to a practical path to trustworthy selective prediction: the features add negligible cost at inference, and their interpretability makes failures auditable rather than opaque\.

#### Limitations

Our analysis is restricted to discrete\-choice settings, where the prediction is localized to a single answer position\. The probe is also fit per \(model, task\), and preliminary leave\-one\-dataset\-out experiments show gains over MSP on only a subset of configurations\. Identifying trajectory features that transfer across models and tasks remains open\.

## 7Related Work

Probing assesses what information is linearly encoded in neural network representations\(Alain and Bengio,[2016](https://arxiv.org/html/2605.22864#bib.bib17)\)and language models\(Petroniet al\.,[2019](https://arxiv.org/html/2605.22864#bib.bib16); Azaria and Mitchell,[2023](https://arxiv.org/html/2605.22864#bib.bib33)\): a classifier is trained on activations extracted from a frozen model to predict a target property\.Azaria and Mitchell\([2023](https://arxiv.org/html/2605.22864#bib.bib33)\)introduced this approach for UQ in LLMs, showing that internal representations contain reliable signals of output correctness\. Subsequent work on the “geometry of truth” suggests a direction in model internals along which truthful and erroneous outputs are approximately linearly separable\(Azizianet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib21); Liet al\.,[2023b](https://arxiv.org/html/2605.22864#bib.bib22); Marks and Tegmark,[2023](https://arxiv.org/html/2605.22864#bib.bib20); Kossenet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib19); Dakhmoucheet al\.,[2025](https://arxiv.org/html/2605.22864#bib.bib38); Liuet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib34); Beigiet al\.,[2024](https://arxiv.org/html/2605.22864#bib.bib18); Burnset al\.,[2022](https://arxiv.org/html/2605.22864#bib.bib29)\), differing mainly in how this direction is recovered\.Liet al\.\([2023b](https://arxiv.org/html/2605.22864#bib.bib22)\)train linear probes on individual attention heads for detection and inference\-time steering\.Burnset al\.\([2022](https://arxiv.org/html/2605.22864#bib.bib29)\)propose Contrast Consistent Search \(CCS\), recovering the direction without labels by enforcing consistency between a statement and its negation\.Dakhmoucheet al\.\([2025](https://arxiv.org/html/2605.22864#bib.bib38)\)take a Bayesian route, fitting layer\-to\-layer linear maps and combining their posterior log\-likelihoods via sparse regression\. In contrast, we extract trajectory geometry rather than classifying individual hidden states: eleven interpretable scalar descriptors of how MLP contributions accumulate across depth, fed to a sparse linear probe whose inputs and decision rule are interpretable end to end\.

## 8Conclusion

We introduced a method for reading uncertainty from the motion of language model computation\. By tracing the answer\-position residual stream as a cumulative path of per\-layer MLP write\-vectors and condensing it through eleven scale\-invariant geometric descriptors, we obtain an uncertainty signal whose inputs and decision rule are inspectable end to end\. Each feature has a direct geometric reading, computed in closed form from the activations with no auxiliary model, and the sparse probe trained on them localizes not only*whether*a model is likely to err but*how*– which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint\. Treating activations as motion rather than as snapshots offers a tractable middle path between the opacity of dense probes and the single scalar of MSP, recovering signal from how a prediction is assembled, not only from where it lands\. We hope this work encourages the development of simpler, more interpretable representations for uncertainty quantification in LLMs\.

## Impact Statement

UQ is a core component of trustworthy AI, enabling LLMs to flag predictions that should be deferred, reviewed, or abstained from in high\-stakes settings\. Our method reads uncertainty from the geometry of MLP write\-vectors to the residual stream within a single forward pass, offering an efficient and interpretable alternative to costly sampling\-based or ensemble approaches\. Because each feature has a closed\-form geometric meaning, the resulting confidence signal is auditable end\-to\-end, allowing practitioners to inspect not only whether a model is uncertain but where in its computation that uncertainty arises\.

## References

- G\. Alain and Y\. Bengio \(2016\)Understanding intermediate layers using linear classifier probes\.arXiv preprint arXiv:1610\.01644\.Cited by:[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- A\. Azaria and T\. Mitchell \(2023\)The internal state of an llm knows when it’s lying\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 967–976\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1),[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px5.p1.3),[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- W\. Azizian, M\. Kirchhof, E\. Ndiaye, L\. Bethune, M\. Klein, P\. Ablin, and M\. Cuturi \(2025\)The geometries of truth are orthogonal across tasks\.arXiv preprint arXiv:2506\.08572\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1),[§6](https://arxiv.org/html/2605.22864#S6.p1.1),[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- M\. Beigi, Y\. Shen, R\. Yang, Z\. Lin, Q\. Wang, A\. Mohan, J\. He, M\. Jin, C\. Lu, and L\. Huang \(2024\)Internalinspector i2: robust confidence estimation in llms through internal states\.InFindings of the association for computational linguistics: EMNLP 2024,pp\. 12847–12865\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1),[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- X\. Bi, D\. Chen, G\. Chen, S\. Chen, D\. Dai, C\. Deng, H\. Ding, K\. Dong, Q\. Du, Z\. Fu,et al\.\(2024\)Deepseek llm: scaling open\-source language models with longtermism\.arXiv preprint arXiv:2401\.02954\.Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px1.p1.1)\.
- C\. Burns, H\. Ye, D\. Klein, and J\. Steinhardt \(2022\)Discovering latent knowledge in language models without supervision\.arXiv preprint arXiv:2212\.03827\.Cited by:[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- C\. Chow \(1970\)On optimum recognition error and reject tradeoff\.IEEE Transactions on information theory16\(1\),pp\. 41–46\.Cited by:[§2\.2](https://arxiv.org/html/2605.22864#S2.SS2.p1.8)\.
- R\. Dakhmouche, A\. Letellier, and H\. Gorji \(2025\)Can linear probes measure llm uncertainty?\.arXiv preprint arXiv:2510\.04108\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1),[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px5.p1.3),[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- Y\. Ding, J\. Liu, J\. Xiong, and Y\. Shi \(2020\)Revisiting the evaluation of uncertainty estimation and its application to explore model complexity\-uncertainty trade\-off\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,pp\. 4–5\.Cited by:[§2\.2](https://arxiv.org/html/2605.22864#S2.SS2.p2.2),[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px4.p1.2)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1)\.
- Y\. Geifman and R\. El\-Yaniv \(2017\)Selective classification for deep neural networks\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 4885–4894\.External Links:ISBN 9781510860964Cited by:[§2\.2](https://arxiv.org/html/2605.22864#S2.SS2.p1.8)\.
- Y\. Geifman, G\. Uziel, and R\. El\-Yaniv \(2018\)Bias\-reduced uncertainty estimation for deep neural classifiers\.arXiv preprint arXiv:1805\.08206\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p4.1)\.
- Y\. Geifman, G\. Uziel, and R\. El\-Yaniv \(2019\)Bias\-reduced uncertainty estimation for deep neural classifiers\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SJfb5jCqKm)Cited by:[§2\.2](https://arxiv.org/html/2605.22864#S2.SS2.p2.2),[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px3.p1.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5484–5495\.Cited by:[§3\.1](https://arxiv.org/html/2605.22864#S3.SS1.p3.8)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px1.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning \- Volume 70,ICML’17,pp\. 1321–1330\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px2.p1.1)\.
- D\. Hendrycks and K\. Gimpel \(2016\)A baseline for detecting misclassified and out\-of\-distribution examples in neural networks\.arXiv preprint arXiv:1610\.02136\.Cited by:[§3\.1](https://arxiv.org/html/2605.22864#S3.SS1.p1.9)\.
- L\. Huang, R\. Le Bras, C\. Bhagavatula, and Y\. Choi \(2019\)Cosmos qa: machine reading comprehension with contextual commonsense reasoning\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 2391–2401\.Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px2.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1)\.
- A\. Kendall and Y\. Gal \(2017\)What uncertainties do we need in bayesian deep learning for computer vision?\.Advances in neural information processing systems30\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p1.1)\.
- J\. Kossen, J\. Han, M\. Razzak, L\. Schut, S\. Malik, and Y\. Gal \(2024\)Semantic entropy probes: robust and cheap hallucination detection in llms\.arXiv preprint arXiv:2406\.15927\.Cited by:[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.arXiv preprint arXiv:2302\.09664\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1)\.
- B\. Kumar, C\. Lu, G\. Gupta, A\. Palepu, D\. Bellamy, R\. Raskar, and A\. Beam \(2023\)Conformal prediction with large language models for multi\-choice question answering\.arXiv preprint arXiv:2305\.18404\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1)\.
- J\. Li, X\. Cheng, W\. X\. Zhao, J\. Nie, and J\. Wen \(2023a\)Halueval: a large\-scale hallucination evaluation benchmark for large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 6449–6464\.Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px2.p1.1)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2023b\)Inference\-time intervention: eliciting truthful answers from a language model\.Advances in Neural Information Processing Systems36,pp\. 41451–41530\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1),[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- X\. Li, Z\. Yu, Z\. Zhang, Y\. Zhuang, S\. Shah, N\. Sadagopan, and A\. Beniwal \(2026\)Semantic volume: quantifying and detecting both external and internal uncertainty in llms\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 31751–31759\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p1.1)\.
- Z\. Lin, S\. Trivedi, and J\. Sun \(2023\)Generating with confidence: uncertainty quantification for black\-box large language models\.arXiv preprint arXiv:2305\.19187\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1)\.
- L\. Liu, Y\. Pan, X\. Li, and G\. Chen \(2024\)Uncertainty estimation and quantification for llms: a simple supervised approach\.arXiv preprint arXiv:2404\.15993\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1),[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- X\. Liu, T\. Chen, L\. Da, C\. Chen, Z\. Lin, and H\. Wei \(2025\)Uncertainty quantification and confidence calibration in large language models: a survey\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 6107–6117\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p1.1)\.
- P\. Manakul, A\. Liusie, and M\. Gales \(2023\)Selfcheckgpt: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 9004–9017\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1)\.
- K\. Margatina, T\. Schick, N\. Aletras, and J\. Dwivedi\-Yu \(2023\)Active learning principles for in\-context learning with large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5011–5034\.Cited by:[§2\.1](https://arxiv.org/html/2605.22864#S2.SS1.p2.1)\.
- S\. Marks and M\. Tegmark \(2023\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.arXiv preprint arXiv:2310\.06824\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1),[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.Advances in neural information processing systems35,pp\. 17359–17372\.Cited by:[§3\.1](https://arxiv.org/html/2605.22864#S3.SS1.p3.8)\.
- M\. P\. Naeini, G\. Cooper, and M\. Hauskrecht \(2015\)Obtaining well calibrated probabilities using bayesian binning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.29\.Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px4.p1.2)\.
- F\. Petroni, T\. Rocktäschel, S\. Riedel, P\. Lewis, A\. Bakhtin, Y\. Wu, and A\. Miller \(2019\)Language models as knowledge bases?\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 2463–2473\.Cited by:[§7](https://arxiv.org/html/2605.22864#S7.p1.1)\.
- R\. Vashurin, E\. Fadeeva, A\. Vazhentsev, L\. Rvanova, D\. Vasilev, A\. Tsvigun, S\. Petrakov, R\. Xing, A\. Sadallah, K\. Grishchenkov,et al\.\(2025\)Benchmarking uncertainty quantification methods for large language models with lm\-polygraph\.Transactions of the Association for Computational Linguistics13,pp\. 220–248\.Cited by:[§1](https://arxiv.org/html/2605.22864#S1.p2.1),[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px5.p1.3)\.
- Q\. A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, G\. Dong, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, Z\. Qiu, S\. Quan, and Z\. Wang \(2024\)Qwen2\.5 technical report\.ArXivabs/2412\.15115\.External Links:[Link](https://api.semanticscholar.org/CorpusID:274859421)Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px1.p1.1)\.
- F\. Ye, M\. Yang, J\. Pang, L\. Wang, D\. F\. Wong, E\. Yilmaz, S\. Shi, and Z\. Tu \(2024\)Benchmarking llms via uncertainty quantification\.Advances in Neural Information Processing Systems37,pp\. 15356–15385\.Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px2.p1.1)\.
- L\. Yu, M\. Cao, J\. C\. Cheung, and Y\. Dong \(2024\)Mechanistic understanding and mitigation of language model non\-factual hallucinations\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 7943–7956\.Cited by:[§3\.1](https://arxiv.org/html/2605.22864#S3.SS1.p3.8)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 4791–4800\.Cited by:[§4](https://arxiv.org/html/2605.22864#S4.SS0.SSS0.Px2.p1.1)\.
- H\. Zou and T\. Hastie \(2005\)Regularization and variable selection via the elastic net\.Journal of the Royal Statistical Society Series B: Statistical Methodology67\(2\),pp\. 301–320\.Cited by:[§3\.3](https://arxiv.org/html/2605.22864#S3.SS3.p3.3)\.

## Appendix AFull AURC Results

Table[2](https://arxiv.org/html/2605.22864#S5.T2)in the main paper reports our method’s AURC alongside the absolute improvement over the MSP baseline\. For completeness, Table[3](https://arxiv.org/html/2605.22864#A1.T3)below provides the full breakdown across all four methods we consider\.

Table 3:Full AURC results \(×100\\times 100; lower is better\)\. We compare our full probe \(ours\) against the MSP baseline, a raw\-activationsceilingprobe that operates on per\-layer hidden states without capacity constraints, and a trajectory\-onlyablationthat removes MSP and its interactions from the probe input\.

## Appendix BProbe Hyperparameters

We fit the sparse linear probe of Section[3\.3](https://arxiv.org/html/2605.22864#S3.SS3)using scikit\-learn’sLogisticRegressionwith thesagasolver, class\-balanced sample weighting, a convergence tolerance of10−310^\{\-3\}, and a maximum of80008000iterations\. The6464configurations swept during model selection comprise1010pure\-ℓ1\\ell\_\{1\}candidates withC∈\{3×10−4,10−3,3×10−3,10−2,3×10−2,10−1,3×10−1,1,3,10\}C\\in\\\{3\\\!\\times\\\!10^\{\-4\},\\,10^\{\-3\},\\,3\\\!\\times\\\!10^\{\-3\},\\,10^\{\-2\},\\,3\\\!\\times\\\!10^\{\-2\},\\,10^\{\-1\},\\,3\\\!\\times\\\!10^\{\-1\},\\,1,\\,3,\\,10\\\}, and5454elastic\-net candidates overC∈\{10−3,3×10−3,10−2,3×10−2,10−1,3×10−1,1,3,10\}C\\in\\\{10^\{\-3\},\\,3\\\!\\times\\\!10^\{\-3\},\\,10^\{\-2\},\\,3\\\!\\times\\\!10^\{\-2\},\\,10^\{\-1\},\\,3\\\!\\times\\\!10^\{\-1\},\\,1,\\,3,\\,10\\\}andℓ1\\ell\_\{1\}mixing ratioρ∈\{0\.10,0\.25,0\.50,0\.75,0\.90,0\.95\}\\rho\\in\\\{0\.10,\\,0\.25,\\,0\.50,\\,0\.75,\\,0\.90,\\,0\.95\\\}\. All randomness in fitting and splitting is controlled by a fixed seed of4242\.

Similar Articles

Probabilistic Calibration Is a Trainable Capability in Language Models

arXiv cs.CL

This paper investigates whether probabilistic calibration in language models can be improved through fine-tuning, comparing soft-target and hard-target methods across 12 models. The results show that calibration is a trainable capability, though gains sometimes reduce downstream arithmetic reasoning capabilities.

Teaching models to express their uncertainty in words

OpenAI Blog

OpenAI researchers demonstrate that GPT-3 can learn to express calibrated uncertainty about its answers in natural language without using model logits, introducing the CalibratedMath benchmark suite to evaluate this capability. The approach shows robust generalization under distribution shift and represents the first evidence of models expressing well-calibrated verbal uncertainty about their own predictions.

Confidence Calibration in Large Language Models

arXiv cs.AI

This paper analyzes the confidence calibration of 11 popular LLMs, finding that they are generally overconfident, especially on hard tasks, and underconfident on easy tasks. It introduces LifeEval, a test for evaluating calibration across difficulty levels.

Retrieval-Augmented Linguistic Calibration

arXiv cs.CL

This paper proposes Retrieval-Augmented Linguistic Calibration (RALC), a post-hoc pipeline for calibrating confidence signals in LLMs by modeling linguistic confidence as a distribution and using retrieval-augmented rewriting. It introduces Faithfulness Divergence metric and shows significant improvements across benchmarks.