Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs

arXiv cs.LG 06/10/26, 04:00 AM Papers
calibration overconfidence probe-conditioned head-intervention llm inference-time uncertainty
Summary
The paper introduces Probe-Conditioned Head Intervention (PCHI), an inference-time method for LLMs that selectively reduces overconfidence on wrong answers without significantly reducing confidence on correct ones, by conditionally rescaling attention head outputs when the model is likely wrong but confident.
arXiv:2606.09876v1 Announce Type: new Abstract: Large language models often express high confidence in answers that are wrong. Standard calibration remedies typically act globally or at the score level, reducing unwarranted confidence but also risking erosion of warranted confidence on correct answers. We introduce Probe-Conditioned Head Intervention (PCHI), an inference-time method that uses a frozen probe to detect likely wrong-but-confident responses and conditionally rescales downstream attention-head outputs during confidence generation. On Qwen3-4B-Instruct solving OpenMathInstruct problems with a structured binary confidence field, readout-token PCHI converts 82.2% of originally wrong-yes confidence readouts to $\texttt{no}$, while a joint intervention across upstream confidence-template tokens reduces ECE from 21.9% to 9.2% and damages only 5.1% of originally correct-yes readouts. The readout-token effect also appears on Gemma3-4B, though upstream interventions are weaker and more mask-dependent. These results show that verbalized overconfidence can be selectively reduced through conditionally applied internal intervention, partially decoupling the suppression of unwarranted confidence from the loss of warranted confidence.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:15 AM
# Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs
Source: [https://arxiv.org/html/2606.09876](https://arxiv.org/html/2606.09876)
Ke Li1,2Chongzhe Zhang1,311footnotemark:1Zifan Zeng1,4Feng Liu1Qunli Zhang1Zheng Hu1 1Huawei Heisenberg Research Center2EPFL3TU Berlin4TUM

###### Abstract

Large language models often express high confidence in answers that are wrong\. Standard calibration remedies typically act globally or at the score level, reducing unwarranted confidence but also risking erosion of warranted confidence on correct answers\. We introduce*Probe\-Conditioned Head Intervention*\(PCHI\), an inference\-time method that uses a frozen probe to detect likely wrong\-but\-confident responses and conditionally rescales downstream attention\-head outputs during confidence generation\. On Qwen3\-4B\-Instruct solving OpenMathInstruct problems with a structured binary confidence field, readout\-token PCHI converts 82\.2% of originally wrong\-yes confidence readouts tono, while a joint intervention across upstream confidence\-template tokens reduces ECE from 21\.9% to 9\.2% and damages only 5\.1% of originally correct\-yes readouts\. The readout\-token effect also appears on Gemma3\-4B, though upstream interventions are weaker and more mask\-dependent\. These results show that verbalized overconfidence can be selectively reduced through conditionally applied internal intervention, partially decoupling the suppression of unwarranted confidence from the loss of warranted confidence\.

Calibrating Overconfidence Without Sacrificing Confidence: Probe\-Conditioned Head Intervention for LLMs

Ke Li1,2††thanks:These authors contributed equally\.Chongzhe Zhang1,311footnotemark:1Zifan Zeng1,4Feng Liu1Qunli Zhang1Zheng Hu11Huawei Heisenberg Research Center2EPFL3TU Berlin4TUM

## 1Introduction

For a language model to be useful in decision\-making, it is not enough that its answers are often correct; a system or person acting on those answers must also be able to tell when to trust them\(Guoet al\.,[2017](https://arxiv.org/html/2606.09876#bib.bib2)\)\. This makes the reliability of a model’s expressed confidence an important concern\. Yet large language models often express high confidence even when their answers are wrong, making verbalized confidence a poor guide to correctness\(Xionget al\.,[2024](https://arxiv.org/html/2606.09876#bib.bib1); Genget al\.,[2024](https://arxiv.org/html/2606.09876#bib.bib6)\)\.

A natural response is to recalibrate\. Standard post\-hoc methods such as temperature scaling\(Guoet al\.,[2017](https://arxiv.org/html/2606.09876#bib.bib2)\)apply a single adjustment to every response, and instruction\- or RLHF\-tuned models may inherit reward signals that favor high confidence regardless of correctness\(Lenget al\.,[2025](https://arxiv.org/html/2606.09876#bib.bib7)\)\. Such methods can improve aggregate calibration, but they do not explicitly target the wrong\-but\-confident subset\. Pushing confidence down far enough to suppress wrong\-but\-confident responses can also reduce warranted confidence on correct answers, a tension documented both for classical networks\(Joyet al\.,[2022](https://arxiv.org/html/2606.09876#bib.bib3)\)and for language models\(Xieet al\.,[2024](https://arxiv.org/html/2606.09876#bib.bib8)\)\.

This motivates a more selective form of calibration: one that acts only when a response is likely to be wrong\-but\-confident\. Conditional and adaptive interventions already provide evidence that model behavior can be controlled non\-uniformly, for example in refusal\(Leeet al\.,[2025](https://arxiv.org/html/2606.09876#bib.bib9)\), cross\-lingual transfer\(Maraiaet al\.,[2026](https://arxiv.org/html/2606.09876#bib.bib10)\), and two\-stage confidence steering\(Miao and Ungar,[2026](https://arxiv.org/html/2606.09876#bib.bib11)\)\. However, these methods do not directly isolate wrong\-but\-confident responses through an intervention on the internal computation that produces the final confidence readout\.

Recent mechanistic work suggests that such an internal intervention is plausible\. Verbalized confidence appears to be formed around answer\-adjacent or template positions and read out later\(Kumaranet al\.,[2026](https://arxiv.org/html/2606.09876#bib.bib16)\), while inflated confidence has been linked to specific late\-position components\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.09876#bib.bib17)\)\. If evidence about whether a confident answer is actually correct is represented before the final confidence value is emitted, then an intervention conditioned on that evidence may reduce unwarranted confidence without globally suppressing all confidence\.

We instantiate this idea as*probe\-conditioned head intervention*\(PCHI\)\. A frozen probe reads a hidden state on the confidence template and estimates whether the response is likely to be wrong\-but\-confident\. The probe score then gates a learned rescaling of downstream attention\-head outputs, so the intervention is applied in proportion to wrong\-confident evidence rather than uniformly across all responses\.

## 2Related Work

#### Calibrating language model confidence\.

A long line of work seeks to align a model’s confidence with its accuracy\. One line elicits confidence directly from the model in natural language, and finds that such verbalized confidence can be a meaningful but imperfect signal of correctness\(Linet al\.,[2022](https://arxiv.org/html/2606.09876#bib.bib20)\); relatedly, models retain some ability to self\-evaluate whether their own answers are correct\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.09876#bib.bib21)\)\. A second line adjusts confidence post hoc\. Methods such as temperature scaling\(Guoet al\.,[2017](https://arxiv.org/html/2606.09876#bib.bib2)\)fit a single parameter on a held\-out set and rescale all predictions uniformly; in language models this miscalibration is exacerbated by RLHF, whose reward models favor confident responses\(Lenget al\.,[2025](https://arxiv.org/html/2606.09876#bib.bib7)\)\. Because a uniform rescaling moves every prediction in the same direction, it cannot simultaneously raise confidence where it is too low and lower it where it is too high – a limitation noted for classical networks\(Joyet al\.,[2022](https://arxiv.org/html/2606.09876#bib.bib3)\)and addressed in language models by predicting an input\-dependent temperature\(Xieet al\.,[2024](https://arxiv.org/html/2606.09876#bib.bib8)\)\. These methods operate on the output confidence distribution\. Closer to our setting, recent work probes internal representations to*estimate*confidence, for example from the stability of perturbed representationsKhanmohammadiet al\.\([2025](https://arxiv.org/html/2606.09876#bib.bib22)\)\. Our intervention differs in kind: rather than estimating a confidence score, it acts inside the model to*modify*the confidence readout, and is applied only to the responses a probe judges wrong\-but\-confident, so that warranted confidence on the remaining responses is left intact\.

#### Activation interventions and conditional control\.

A growing body of work steers model behavior by modifying internal representations, typically by adding a fixed direction to the residual stream\(Turneret al\.,[2024](https://arxiv.org/html/2606.09876#bib.bib12); Liet al\.,[2024](https://arxiv.org/html/2606.09876#bib.bib13); Zouet al\.,[2025](https://arxiv.org/html/2606.09876#bib.bib14)\)\. Such unconditional steering is known to lack selectivity, affecting related behaviors together\(Wehneret al\.,[2025](https://arxiv.org/html/2606.09876#bib.bib15)\)\. Conditional variants make the intervention input\-dependent: conditional activation steering gates whether to steer in order to program refusal\(Leeet al\.,[2025](https://arxiv.org/html/2606.09876#bib.bib9)\), and probe\-gated steering selects a class\-specific transformation after a probe predicts the class\(Maraiaet al\.,[2026](https://arxiv.org/html/2606.09876#bib.bib10)\)\. Causal head gating, in contrast, learns a fixed, input\-independent scalar gate per attention head from a next\-token objective, in order to assign heads a causal role for interpretability\(Namet al\.,[2025](https://arxiv.org/html/2606.09876#bib.bib4)\)\. Our method shares the mechanism of per\-head gating, but differs in two ways: the gate is modulated per response by a frozen probe’s continuous estimate of wrong\-confidence, and it is deployed to selectively correct verbalized overconfidence rather than to characterize head roles\.

#### The mechanism of verbalized confidence\.

Recent work asks where and how verbalized confidence is computed\. Confidence appears to be formed at answer\-adjacent positions and read out later\(Kumaranet al\.,[2026](https://arxiv.org/html/2606.09876#bib.bib16)\); inflated confidence has been attributed to specific components writing at late token positions\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.09876#bib.bib17)\); and the direction governing calibration has been found nearly orthogonal to the one governing verbalized confidence\(Miao and Ungar,[2026](https://arxiv.org/html/2606.09876#bib.bib11)\)\. A complementary line shows that a model’s internal states carry decodable evidence of whether its own answer is correct, recoverable by simple probes\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.09876#bib.bib23); MacDiarmidet al\.,[2024](https://arxiv.org/html/2606.09876#bib.bib24)\)\. These analyses are largely descriptive, and where they intervene, they do so to characterize the mechanism rather than to selectively correct errors\. They nonetheless motivate our approach: if correctness\-relevant information is present internally and partly separable from the expressed\-confidence signal, then a conditional internal intervention should be able to act on the former without uniformly disturbing the latter\.

## 3Method

![Refer to caption](https://arxiv.org/html/2606.09876v1/figures/pipeline.png)Figure 1:Overview of probe\-conditioned head intervention \(PCHI\)\.\(1\)The model produces a structured self\-evaluation output\.\(2\)At a confidence\-template tokentt, a frozen linear probe reads the hidden state at layerℓp\\ell\_\{p\}and predicts whether the response is wrong\-but\-confident \(WY\) rather than correct\-confident \(CY\), yielding a scoreptp\_\{t\}\.\(3\)An optional attention mask, applied only in layers afterℓp\\ell\_\{p\}, restricts which context regions \(prompt, reasoning, answer\) the template query atttmay attend to; layers up to and includingℓp\\ell\_\{p\}retain the original attention, so the probe score is unaffected\.\(4\)In downstream layers, each attention head is rescaled by a learned coefficientghg\_\{h\}, most of which are kept near11by a sparsity penalty, with strength conditioned on the probe score:zt,h′=\(1\+\(gh−1\)st\)zt,hz^\{\\prime\}\_\{t,h\}=\(1\+\(g\_\{h\}\-1\)s\_\{t\}\)\\,z\_\{t,h\}wherest=pts\_\{t\}=p\_\{t\}ifpt≥τp\_\{t\}\\geq\\tauand0otherwise\. The intervention thus fires only on responses the probe flags as wrong\-confident\.Probe\-conditioned head intervention corrects verbalized overconfidence at the moment the model prepares to emit its confidence value\. The method has three conceptual components: a structured self\-evaluation format that exposes a binary confidence readout, a confidence\-template probe that detects latent evidence of wrong\-confident responses, and a probe\-conditioned head\-level intervention that is applied only when this evidence is present\. This design treats overconfidence correction as a selective control problem: the model should revise confidence on wrong answers while leaving warranted confidence largely unchanged\.

The central design principle is to separate*detection*from*intervention*\. The probe estimates whether a confident response is likely to be wrong, while the learned head parameters specify how downstream attention heads should be rescaled when such evidence is detected\. Separating these roles gives the method a direct selectivity target: convert wrong\-confident responses fromyestonowithout globally suppressing correct\-confident responses\.

We intervene only after the model has produced its reasoning and answer\. Intervening during the reasoning or answer span can alter the hidden\-state trajectory that supports problem solving, potentially moving the model into regions where its subsequent reasoning and final answer are unreliable\. In contrast, the confidence template is a fixed span between the answer and the confidence value\. At inference time, these template tokens can be forced rather than sampled, so the intervention changes the computation that determines the final self\-evaluation while leaving the generated answer and the template token sequence fixed\.

### 3\.1Structured Self\-Evaluation and Readout

We study verbalized confidence under a fixed JSON output schema\. For each input problemxix\_\{i\}, the model is instructed to generate a response with three fields \(Appendix[A\.1](https://arxiv.org/html/2606.09876#A1.SS1)\):

\{reasoning,answer,is\_confident\}\.\\\{\\texttt\{reasoning\},\\ \\texttt\{answer\},\\ \\texttt\{is\\\_confident\}\\\}\.The fieldanswercontains the final answer, andis\_confidentcontains a binary self\-evaluation, eitheryesorno\. Letaia\_\{i\}denote the parsed answer and letci∈\{0,1\}c\_\{i\}\\in\\\{0,1\\\}indicate whetheraia\_\{i\}is correct under the task evaluator\. Letri∈\{yes,no\}r\_\{i\}\\in\\\{\\texttt\{yes\},\\texttt\{no\}\\\}be the parsed confidence value\. Each response therefore belongs to one of four groups: correct\-yes \(CY\), correct\-no \(CN\), wrong\-yes \(WY\), and wrong\-no \(WN\)\.

Our intervention targets WY responses\. A WY response is problematic because the model gives an incorrect answer while explicitly reporting high confidence\. A useful intervention should therefore reduce confidence on WY examples without broadly suppressing confidence on CY examples\. We measure the confidence readout with the yes\-no logit gap

Δi=LSEv∈Vyes⁡oi,v−LSEv∈Vno⁡oi,v,\\Delta\_\{i\}=\\operatorname\{LSE\}\_\{v\\in V\_\{\\texttt\{yes\}\}\}o\_\{i,v\}\-\\operatorname\{LSE\}\_\{v\\in V\_\{\\texttt\{no\}\}\}o\_\{i,v\},\(1\)whereLSE\\operatorname\{LSE\}denotes log\-sum\-exp,oio\_\{i\}is the next\-token logit vector at the confidence\-value prediction position, andVyesV\_\{\\texttt\{yes\}\}andVnoV\_\{\\texttt\{no\}\}are the candidate tokens for the two confidence values\. PositiveΔi\\Delta\_\{i\}favorsyes; negativeΔi\\Delta\_\{i\}favorsno\.

### 3\.2Confidence\-Template Evidence Probes

The confidence template is the token span after theanswervalue and before theis\_confidentvalue\. Let𝒯i\\mathcal\{T\}\_\{i\}denote the aligned template coordinates for exampleii\. For a template coordinatet∈𝒯it\\in\\mathcal\{T\}\_\{i\}and transformer layerℓ\\ell, lethi,t\(ℓ\)∈ℝdh\_\{i,t\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\}be the hidden state at that coordinate after layerℓ\\ell\.

We train position\- and layer\-specific linear probes to detect wrong\-confident evidence\. For each template coordinatettand layerℓ\\ell, the probe predicts whether a confident response is wrong:

pi,t\(ℓ\)=σ\(\(wt,ℓ\)⊤hi,t\(ℓ\)\+bt,ℓ\),p\_\{i,t\}^\{\(\\ell\)\}=\\sigma\\\!\\left\(\(w\_\{t,\\ell\}\)^\{\\top\}h\_\{i,t\}^\{\(\\ell\)\}\+b\_\{t,\\ell\}\\right\),\(2\)where positive examples are WY responses and negative examples are CY responses\. We use calibrated linear probes so that the resulting score can be interpreted as a graded estimate of wrong\-confident evidence while keeping the detector simple enough to run online\.

The probe layerℓp\\ell\_\{p\}must satisfy two requirements: the probe should separate WY from CY reliably, and there must be downstream layers available for intervention in the same forward pass\. We restrict the intervention to layers afterℓp\\ell\_\{p\}so that the hidden states consumed by the probe remain on the same distribution as those used during probe training\. This is important because the probe begins to separate WY from CY in middle layers: modifying earlier layers would change the probe input itself and could make the probe score unreliable\. The resulting intervention therefore uses the evidence detected atℓp\\ell\_\{p\}to modulate only later head outputs at the same template coordinate\.

### 3\.3Probe\-Conditioned Head Intervention

Probe\-conditioned head intervention rescales attention\-head outputs only when the probe indicates likely wrong\-confident evidence\. Letℋ\\mathcal\{H\}be the set of intervention heads, where each headh=\(ℓ,a\)h=\(\\ell,a\)consists of layerℓ\\elland attention\-head indexaa, and allℓ\\ellare downstream ofℓp\\ell\_\{p\}\. For eachh∈ℋh\\in\\mathcal\{H\}, we learn a scalar intervention parameterghg\_\{h\}, initialized at the identity value11\. Unlike a global ablation, this parameter is not applied uniformly to every response; its strength is controlled by the probe score at the current confidence\-template coordinate\.

Letzi,t,hz\_\{i,t,h\}be the pre\-output\-projection attention output of headhhat template coordinatett\. We replace this head output with

zi,t,h′=\(1\+\(gh−1\)si,t\)zi,t,h,z^\{\\prime\}\_\{i,t,h\}=\\left\(1\+\(g\_\{h\}\-1\)s\_\{i,t\}\\right\)z\_\{i,t,h\},\(3\)wheresi,t∈\[0,1\]s\_\{i,t\}\\in\[0,1\]is the probe\-conditioned intervention strength\. Ifsi,t=0s\_\{i,t\}=0, thenzi,t,h′=zi,t,hz^\{\\prime\}\_\{i,t,h\}=z\_\{i,t,h\}and the model is unchanged\. Ifsi,t=1s\_\{i,t\}=1, then the head is fully scaled by its learned coefficientghg\_\{h\}\. Intermediate values interpolate between the original computation and the computation with learned scaling\. This formulation allows the coefficient to suppress or amplify a head, depending on the learned value ofghg\_\{h\}, while preserving exact identity behavior when the probe does not activate\.

During intervention\-parameter training, we use soft probe conditioning,si,t=pi,t\(ℓp\)s\_\{i,t\}=p\_\{i,t\}^\{\(\\ell\_\{p\}\)\}\. Soft conditioning makes the learning objective differentiable with respect to the learned coefficients and uses the probe score as a graded measure of wrong\-confident evidence\. During inference, we use a hard selection rule with thresholdτ=0\.5\\tau=0\.5:

si,t=\{pi,t\(ℓp\),pi,t\(ℓp\)≥τ,0,pi,t\(ℓp\)<τ\.s\_\{i,t\}=\\begin\{cases\}p\_\{i,t\}^\{\(\\ell\_\{p\}\)\},&p\_\{i,t\}^\{\(\\ell\_\{p\}\)\}\\geq\\tau,\\\\ 0,&p\_\{i,t\}^\{\(\\ell\_\{p\}\)\}<\\tau\.\\end\{cases\}\(4\)Thus, the probe makes a binary decision about whether to intervene, but the magnitude of the intervention above threshold remains proportional to the probe probability\. This inference rule uses only the probe score and does not require correctness labels\.

The method can be applied at one or more confidence\-template coordinates\. Let𝒮⊆𝒯i\\mathcal\{S\}\\subseteq\\mathcal\{T\}\_\{i\}be the set of intervention coordinates\. In the single\-coordinate setting, the method learns intervention parameters for one selected coordinate at a time\. In the multi\-coordinate setting, the same rule is applied independently as generation passes through each coordinate in𝒮\\mathcal\{S\}\.

### 3\.4Learning Selective Interventions

The base language model and the confidence\-template probes are frozen while learning the scalar intervention parameters\. Intervention learning uses examples from the WY and CY groups\. For a training exampleii, letΔi\\Delta\_\{i\}be the original yes\-no logit gap and letΔi′\\Delta^\{\\prime\}\_\{i\}be the gap after applying probe\-conditioned head intervention\. Since positive gaps favoryes, the objective should push WY examples below a target margin while keeping CY examples close to their original confident state\.

We optimize the following hinge\-style objective:

ℒ\\displaystyle\\mathcal\{L\}=ℒWY\+ℒCY\+λ∑h∈ℋ\|gh−1\|,\\displaystyle=\\mathcal\{L\}\_\{\\mathrm\{WY\}\}\+\\mathcal\{L\}\_\{\\mathrm\{CY\}\}\+\\lambda\\sum\_\{h\\in\\mathcal\{H\}\}\|g\_\{h\}\-1\|,\(5\)ℒWY\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{WY\}\}=𝔼i∈WY\[max⁡\(0,Δi′−m\)\],\\displaystyle=\\mathbb\{E\}\_\{i\\in\\mathrm\{WY\}\}\\left\[\\max\(0,\\Delta^\{\\prime\}\_\{i\}\-m\)\\right\],ℒCY\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{CY\}\}=𝔼i∈CY\[max⁡\(0,ρΔi−Δi′\)\]\.\\displaystyle=\\mathbb\{E\}\_\{i\\in\\mathrm\{CY\}\}\\left\[\\max\(0,\\rho\\Delta\_\{i\}\-\\Delta^\{\\prime\}\_\{i\}\)\\right\]\.The first term penalizes WY examples whose post\-intervention gap remains above the targetmm\. The second term protects CY examples by requiring the post\-intervention gap to remain at least a fractionρ\\rhoof the original gap\. The final term is an identity regularizer: head parameters that are unnecessary for the selective correction objective are encouraged to stay near11, limiting unnecessary changes to the model and encouraging a sparse set of influential heads without treating sparsity itself as a head\-selection claim\.

Intervention learning is performed on fixed replays of the confidence\-template span, which lets us compareΔi\\Delta\_\{i\}andΔi′\\Delta^\{\\prime\}\_\{i\}at the same confidence\-value prediction point\. At runtime, the probe score at layerℓp\\ell\_\{p\}controls only downstream heads at the same coordinate in the current forward pass\.

The intervention can also include an optional post\-probe attention mask\. We partition the pre\-template context into the problem prompt𝒫i\\mathcal\{P\}\_\{i\}, generated reasoning fieldℛi\\mathcal\{R\}\_\{i\}, and generated answer field𝒜i\\mathcal\{A\}\_\{i\}, and choose a visible contextCiC\_\{i\}such as𝒫i\\mathcal\{P\}\_\{i\},ℛi\\mathcal\{R\}\_\{i\},𝒜i\\mathcal\{A\}\_\{i\}, orℛi∪𝒜i\\mathcal\{R\}\_\{i\}\\cup\\mathcal\{A\}\_\{i\}\. Full context corresponds to no restriction\. When the mask is enabled, layers up to and includingℓp\\ell\_\{p\}keep the original attention pattern, while template queries in layersℓ\>ℓp\\ell\>\\ell\_\{p\}attend only toCiC\_\{i\}among the pre\-template tokens\. Thus the probe input distribution is unchanged, and the learned head intervention is applied only after the probe score has been produced\.

## 4Experiments

#### Experimental Setup\.

We evaluate on OpenMathInstruct\(Toshniwalet al\.,[2024](https://arxiv.org/html/2606.09876#bib.bib5)\), following prior preprocessing\(Namet al\.,[2025](https://arxiv.org/html/2606.09876#bib.bib4)\)by filtering invalid examples before selecting 5,000 problems from the original training split and 5,000 problems from the original validation split\. For each problem, the model produces one structured JSON response with greedy decoding \(temperature 0\)\. We use the training split to collect hidden states on the six\-token confidence template and train calibrated diagonal\-LDA probes to distinguish wrong\-yes from correct\-yes examples\.

Table[1](https://arxiv.org/html/2606.09876#S4.T1)reports the baseline group composition on the validation split for each model\. Both models exhibit a highyesconfidence rate, so the evaluation is dominated by the correct\-yes and wrong\-yes groups targeted by our intervention\.

Table 1:Baseline validation\-set group composition\. CY, WY, CN, and WN denote correct\-yes, wrong\-yes, correct\-no, and wrong\-no responses, respectively\.We evaluate two instruction\-tuned models: Qwen3\-4B\-Instruct and Gemma3\-4b\-it\. We select the probe layer using training\-set AUROC heatmaps over confidence\-template positions and transformer layers \(Appendix[A\.2](https://arxiv.org/html/2606.09876#A1.SS2)\)\. For Qwen, layer 18 is used as the probe layer and layers 19–22 as intervention layers, giving 128 learned head coefficients\. For Gemma, layer 16 is used as the probe layer and layers 17–20 as intervention layers, giving 32 learned head coefficients\. These probe layers lie in the middle part of each model, where WY/CY separability is already strong \(check[A\.2](https://arxiv.org/html/2606.09876#A1.SS2)for detailed explanation\), while leaving downstream layers available for intervention\. This choice is also consistent with prior findings that confidence\-relevant answer\-evaluative information is represented in middle\-to\-late transformer layers\(Kumaranet al\.,[2026](https://arxiv.org/html/2606.09876#bib.bib16)\)\. Intervention parameters are trained only on wrong\-yes and correct\-yes examples\. We train Qwen and Gemma both with batch size 8, learning rate 0\.04, step 200, L1 weight 0\.05, hinge ratio 0\.7, and random seed 42\.

#### Evaluation Metrics\.

We evaluate whether an intervention reduces unwarranted confidence while preserving warranted confidence\. Letgi∈\{CY,CN,WY,WN\}g\_\{i\}\\in\\\{\\mathrm\{CY\},\\mathrm\{CN\},\\mathrm\{WY\},\\mathrm\{WN\}\\\}be the baseline group of exampleii, and letΔi′\\Delta^\{\\prime\}\_\{i\}be the post\-intervention yes\-no logit gap at the confidence\-value position\. We classify the post\-intervention confidence asyeswhenΔi′\>0\\Delta^\{\\prime\}\_\{i\}\>0andnootherwise\. The wrong\-yes correction rate measures how often originally wrong\-yes confidence readouts are converted tono:

WYCorr\.=∑i𝟏\[gi=WY\]1\[Δi′≤0\]∑i𝟏\[gi=WY\]\.\\mathrm\{WY\\ Corr\.\}=\\frac\{\\sum\_\{i\}\\mathbf\{1\}\[g\_\{i\}=\\mathrm\{WY\}\]\\,\\mathbf\{1\}\[\\Delta^\{\\prime\}\_\{i\}\\leq 0\]\}\{\\sum\_\{i\}\\mathbf\{1\}\[g\_\{i\}=\\mathrm\{WY\}\]\}\.The correct\-yes damage rate measures how often originally correct\-yes confidence readouts are converted tono:

CYDmg\.=∑i𝟏\[gi=CY\]1\[Δi′≤0\]∑i𝟏\[gi=CY\]\.\\mathrm\{CY\\ Dmg\.\}=\\frac\{\\sum\_\{i\}\\mathbf\{1\}\[g\_\{i\}=\\mathrm\{CY\}\]\\,\\mathbf\{1\}\[\\Delta^\{\\prime\}\_\{i\}\\leq 0\]\}\{\\sum\_\{i\}\\mathbf\{1\}\[g\_\{i\}=\\mathrm\{CY\}\]\}\.We report calibration with expected calibration error \(ECE\)\. We define the model’s verbal confidence as the yes/no\-restricted probability ofyesat the confidence\-value position\. Letqiq\_\{i\}denote this probability:

siyes=LSEv∈Vyes⁡oi,v,sino=LSEv∈Vno⁡oi,v,s\_\{i\}^\{\\texttt\{yes\}\}=\\operatorname\{LSE\}\_\{v\\in V\_\{\\texttt\{yes\}\}\}o\_\{i,v\},\\qquad s\_\{i\}^\{\\texttt\{no\}\}=\\operatorname\{LSE\}\_\{v\\in V\_\{\\texttt\{no\}\}\}o\_\{i,v\},qi=exp⁡siyesexp⁡siyes\+exp⁡sino\.q\_\{i\}=\\frac\{\\exp s\_\{i\}^\{\\texttt\{yes\}\}\}\{\\exp s\_\{i\}^\{\\texttt\{yes\}\}\+\\exp s\_\{i\}^\{\\texttt\{no\}\}\}\.GivenM=10M=10equal\-width confidence bins\{Bm\}m=1M\\\{B\_\{m\}\\\}\_\{m=1\}^\{M\}over\[0,1\]\[0,1\]and correctness labelsyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}, ECE is

ECE=∑m=1M\|Bm\|n\|1\|Bm\|∑i∈Bmyi−1\|Bm\|∑i∈Bmqi\|\.\\mathrm\{ECE\}=\\sum\_\{m=1\}^\{M\}\\frac\{\|B\_\{m\}\|\}\{n\}\\left\|\\frac\{1\}\{\|B\_\{m\}\|\}\\sum\_\{i\\in B\_\{m\}\}y\_\{i\}\-\\frac\{1\}\{\|B\_\{m\}\|\}\\sum\_\{i\\in B\_\{m\}\}q\_\{i\}\\right\|\.Finally, we report AUROC using answer correctness as the positive label andqiq\_\{i\}as the score\. Because Qwen3\-4B reportsyeson 97\.5% of validation examples, the wrong\-yes and correct\-yes groups account for nearly all of the data, so this AUROC primarily reflects discrimination within the subset that the intervention targets\.

## 5Results

### 5\.1Selective Calibration Is Achievable at the Readout

We first evaluate the most direct intervention point, the final confidence\-template token whose forward pass produces the yes/no confidence logits\. As shown in Table[5\.1](https://arxiv.org/html/2606.09876#S5.SS1), PCHI substantially improves both calibration and discrimination at this readout position\. On Qwen3\-4B, ECE drops from 21\.9 to 11\.7 and AUROC rises from 66\.5 to 90\.3, while 82\.2% of wrong\-yes readouts are converted tono\. On Gemma3\-4B, ECE drops from 26\.3 to 16\.9 and AUROC rises from 76\.8 to 81\.7, with 52\.5% of wrong\-yes readouts converted\. This selectivity comes with measurable correct\-yes damage, 10\.8% on Qwen3\-4B and 11\.6% on Gemma3\-4B\.

The activation\-steering baseline uses the same probe layer as PCHI, but replaces learned head coefficients with a single hidden\-state shift\. For the selected confidence\-template token, we compute a normalized mean\-difference direction on the training set from wrong\-yes activations to wrong\-no activations at that layer, and add a scaled version of this direction to the current\-token hidden state during inference\. We sweep the scaling coefficient over\[1,8\]\[1,8\]and report the best result\. At the readout token, this baseline leaves ECE and AUROC close to the no\-intervention baseline and converts almost no wrong\-yes readouts\. This contrast suggests that the improvement is not explained by a generic mean\-difference shift at the same layer\.

Table 2:Readout\-token intervention results\. All values are percentages\.The training dynamics provide a direct diagnostic of this selectivity\. Figure[2](https://arxiv.org/html/2606.09876#S5.F2)plots the mean post\-intervention yes\-no logit gap during training for the same readout\-position PCHI runs as Table[5\.1](https://arxiv.org/html/2606.09876#S5.SS1)\. In both models, the wrong\-yes gap decreases sharply over training, while the correct\-yes gap changes more modestly and remains on the confident side of the yes/no boundary\. This behavior matches the intended objective: the learned intervention mainly reduces the margin supporting unwarranted confidence, rather than uniformly suppressing all responses that initially read outyes\.

![Refer to caption](https://arxiv.org/html/2606.09876v1/x1.png)Figure 2:Training dynamics of the yes\-no logit gap for the readout\-position PCHI runs in Table[5\.1](https://arxiv.org/html/2606.09876#S5.SS1)\. Each panel shows the mean post\-intervention gap on wrong\-yes and correct\-yes training examples\. Faint curves are raw per\-step means, and bold curves are exponential moving averages\. The dashed horizontal line marks the yes/no decision boundary\.
### 5\.2Selectivity Along the Answer\-to\-Readout Pathway

This experiment tests whether PCHI is limited to the readout token or can also act at upstream confidence\-template tokens\. This distinction matters because upstream confidence\-template tokens are generated after the answer but before the final yes/no confidence readout, so successful intervention there would show that selective calibration is not confined to the last confidence\-token computation\. We therefore apply the same probe\-conditioned head intervention at individual upstream confidence\-template tokens and at a joint set of upstream tokens, while leaving the generated answer unchanged\.

Table[5\.2](https://arxiv.org/html/2606.09876#S5.SS2)reports PCHI at individual upstream confidence\-template tokens and at the joint token set 1–5\. For Qwen3\-4B, tokens 1 and 2 have little effect, token 3 worsens discrimination, and tokens 4 and 5 become substantially stronger\. The joint intervention over tokens 1–5 gives the best overall balance, reducing ECE to 9\.2 and raising AUROC to 91\.1 while correcting 63\.3% of wrong\-confident examples with 5\.1% correct\-yes damage\. For Gemma3\-4B, single upstream\-token interventions are much weaker: tokens 1–4 correct less than 1% of wrong\-confident examples, and token 5 corrects only 4\.5%\. The joint tokens 1–5 intervention is more useful than any single upstream token, improving AUROC from 76\.8 to 82\.1 and lowering ECE from 26\.3 to 23\.6, but it still corrects only 9\.5% of wrong\-confident examples\.

Table 3:Pathway intervention results on confidence\-template tokens before the readout token\. Single\-position rows apply PCHI at one template token, while the joint row applies PCHI across tokens 1–5\. All values are percentages, and no attention mask is used\.The weak upstream\-token results under full context suggest possible attenuation along the template\-to\-readout path\. Unlike the readout token, an upstream\-token intervention must affect later computations through attention, and its contribution can be reduced when subsequent template positions attend over the full context\. To test this possibility without changing the probe input, we apply the attention mask only after the probe layer and restrict post\-probe template queries to a specified visible context\. Table[5\.2](https://arxiv.org/html/2606.09876#S5.SS2)evaluates this control on Qwen3\-4B for tokens 1–3\. With prompt\-only visibility, tokens 1 and 2 become strong intervention points, reaching 70\.2% and 80\.6% wrong\-yes correction with large AUROC gains\. In contrast, token 3 remains ineffective even under the prompt\-only mask, correcting only 0\.9% of wrong\-confident examples\. This outcome aligns with the probe diagnostics in Appendix[A\.2](https://arxiv.org/html/2606.09876#A1.SS2), where token 3 shows weaker WY/CY separability than neighboring template positions\.

Table 4:Attention\-mask analysis for Qwen3\-4B on early confidence\-template positions\. Each row applies PCHI at the indicated token while restricting the visible context of post\-probe template queries\. All values are percentages\.The learned coefficients provide a complementary diagnostic for why the same post\-probe visibility restriction helps some template positions but not others\. Figure[3](https://arxiv.org/html/2606.09876#S5.F3)visualizes representative Qwen3\-4B settings\. Token 3 with prompt\-only visibility remains essentially at the identity, whereas token 2 under the same visibility learns sparse non\-identity coefficients\. The readout token and the joint tokens 1–5 setting also learn sparse non\-identity patterns\. This contrast supports the interpretation that attention masking does not by itself create correction; it can expose an effective intervention path only when the selected template coordinate provides usable wrong\-confident evidence\.

![Refer to caption](https://arxiv.org/html/2606.09876v1/x2.png)Figure 3:Learned head coefficients for representative Qwen3\-4B intervention settings\. Rows are settings and columns are attention heads grouped by layers 19–22; each cell showsgh−1g\_\{h\}\-1, with white indicating identity behavior\. T3 Prompt and T2 Prompt apply PCHI under prompt\-only post\-probe visibility at tokens 3 and 2, respectively\. T6 Full is the readout\-token setting, and J1–5 Full denotes joint intervention over tokens 1–5\. Active\-head counts with\|gh−1\|\>0\.5\|g\_\{h\}\-1\|\>0\.5are 0, 32, 35, and 48, respectively\.
Gemma3\-4B shows that context restriction can substantially strengthen joint upstream\-token correction while preserving most correct\-yes cases\. Table[5](https://arxiv.org/html/2606.09876#S5.T5)evaluates the joint tokens 1–5 intervention while varying the visible context\. The full\-context setting is conservative, with only 9\.5% wrong\-yes correction and 0\.7% correct\-yes damage\. Restricting visibility to the answer field yields the strongest correction and best ECE, raising wrong\-yes correction to 40\.4 and lowering ECE to 17\.3\. This setting lowers AUROC relative to full context and increases correct\-yes damage to 6\.1%, but the damage increase is limited compared with the 30\.9\-point gain in wrong\-yes correction, and AUROC remains above the no\-intervention baseline in Table[5\.1](https://arxiv.org/html/2606.09876#S5.SS1)\.

Table 5:Gemma3\-4B attention\-mask analysis for joint early\-token intervention\. PCHI is applied across confidence\-template tokens 1–5 while varying the visible context of post\-probe template queries\. All values are percentages\.
### 5\.3Ablation Study

Finally, we test whether the attention mask alone explains the upstream\-token gains\. Table[6](https://arxiv.org/html/2606.09876#S5.T6)compares a prompt\-only mask without head intervention to PCHI under the same visible\-context restriction on Qwen3\-4B\. Masking alone slightly worsens ECE and AUROC relative to no intervention and corrects only 1\.0% of wrong\-confident examples\. Under the same prompt\-only context, PCHI at token 1 corrects 70\.2% of wrong\-confident examples and PCHI at token 2 corrects 80\.6%\. Token 3, however, remains close to the mask\-only behavior\. These controls indicate that the gains require both the learned head intervention and a template coordinate with usable wrong\-confident evidence\. The attention mask by itself is insufficient\.

Table 6:Mask\-only ablation on Qwen3\-4B\. The prompt\-only mask is compared against PCHI under the same visible\-context restriction\. Masking alone does not explain the large correction gains at tokens 1 and 2, while token 3 remains close to the mask\-only behavior\. All values are percentages\.## 6Conclusion

This paper treats verbalized confidence not only as an output to be recalibrated, but as an internal computation that can be selectively controlled\. We introduced probe\-conditioned head intervention: a frozen probe estimates, at a confidence\-template position, whether a confident response is likely wrong, and a learned head\-level intervention rescales downstream attention\-head outputs only in proportion to that evidence\. On structured mathematical self\-evaluation, this lets the model suppress many unwarrantedyesconfidence readouts while preserving most warranted ones\.

The results suggest a more fine\-grained view of overconfidence correction\. At the readout token, wrong\-yes confidence can be directly suppressed, most strongly on Qwen3\-4B\. At upstream confidence\-template tokens, the effect depends on position, model, and visible context, showing that confidence\-relevant evidence is not uniformly usable along the answer\-to\-readout pathway\. Together with the mask\-only ablation, these findings indicate that the gains are not simply a byproduct of restricting attention or globally lowering confidence, but arise from evidence\-conditioned changes to downstream head computation\.

More broadly, PCHI shows that calibration need not be treated only as a post\-hoc mapping from scores to probabilities\. When confidence\-relevant evidence is available inside the model, selective internal interventions can reduce unwarranted confidence while preserving much of the confidence assigned to correct answers\. Extending this idea beyond structured binary confidence fields is an important next step toward models whose expressed uncertainty is both useful and trustworthy\.

## Limitations

Our experiments are limited to mathematical question answering with a structured JSON output format and a binaryyes/noconfidence field\. This setting makes the confidence template explicit and allows the template tokens to be fixed during inference\. The method has not yet been tested on free\-form confidence expressions, multi\-level confidence scales, or open\-ended tasks where the boundary between answer generation and confidence assessment is less controlled\.

The intervention also depends on model\- and data\-specific probe training\. Probe layers, intervention layers, template positions, and attention masks are selected from diagnostics on the training distribution, and different models show different effective positions and correction\-damage trade\-offs\. Extending the approach to broader model families and task distributions will require testing how stable these choices are and whether they can be selected automatically\.

Finally, PCHI provides evidence of selective control, but it is not a complete mechanistic explanation of verbalized confidence\. Probe separability shows that wrong\-confident evidence is decodable at confidence\-template positions, and the intervention results show that downstream head outputs can alter the final confidence readout\. The learned scalar coefficients do not by themselves explain how individual heads encode or combine correctness\-related information, leaving circuit\-level analysis as an important direction for future work\.

## References

- A\. Azaria and T\. Mitchell \(2023\)The internal state of an llm knows when it’s lying\.External Links:2304\.13734,[Link](https://arxiv.org/abs/2304.13734)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Geng, F\. Cai, Y\. Wang, H\. Koeppl, P\. Nakov, and I\. Gurevych \(2024\)A survey of confidence estimation and calibration in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 6577–6595\.External Links:[Link](https://aclanthology.org/2024.naacl-long.366/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.366)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.External Links:1706\.04599,[Link](https://arxiv.org/abs/1706.04599)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p1.1),[§1](https://arxiv.org/html/2606.09876#S1.p2.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Joy, F\. Pinto, S\. Lim, P\. H\. S\. Torr, and P\. K\. Dokania \(2022\)Sample\-dependent adaptive temperature scaling for improved calibration\.External Links:2207\.06211,[Link](https://arxiv.org/abs/2207.06211)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p2.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. El\-Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan \(2022\)Language models \(mostly\) know what they know\.External Links:2207\.05221,[Link](https://arxiv.org/abs/2207.05221)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Khanmohammadi, E\. Miahi, M\. Mardikoraem, S\. Kaur, I\. Brugere, C\. Smiley, K\. S\. Thind, and M\. M\. Ghassemi \(2025\)Calibrating LLM confidence by probing perturbed representation stability\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 10448–10514\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.530/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.530),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Kumaran, A\. Conmy, F\. Barbero, S\. Osindero, V\. Patraucean, and P\. Veličković \(2026\)How do llms compute verbal confidence\.External Links:2603\.17839,[Link](https://arxiv.org/abs/2603.17839)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p4.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2606.09876#S4.SS0.SSS0.Px1.p3.1)\.
- B\. W\. Lee, I\. Padhi, K\. N\. Ramamurthy, E\. Miehling, P\. Dognin, M\. Nagireddy, and A\. Dhurandhar \(2025\)Programming refusal with conditional activation steering\.External Links:2409\.05907,[Link](https://arxiv.org/abs/2409.05907)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p3.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Leng, C\. Huang, B\. Zhu, and J\. Huang \(2025\)Taming overconfidence in llms: reward calibration in rlhf\.External Links:2410\.09724,[Link](https://arxiv.org/abs/2410.09724)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p2.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2024\)Inference\-time intervention: eliciting truthful answers from a language model\.External Links:2306\.03341,[Link](https://arxiv.org/abs/2306.03341)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Teaching models to express their uncertainty in words\.External Links:2205\.14334,[Link](https://arxiv.org/abs/2205.14334)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px1.p1.1)\.
- M\. MacDiarmid, T\. Maxwell, N\. Schiefer, J\. Mu, J\. Kaplan, D\. Duvenaud, S\. Bowman, A\. Tamkin, E\. Perez, M\. Sharma, C\. Denison, and E\. Hubinger \(2024\)External Links:[Link](https://www.anthropic.com/news/probes-catch-sleeper-agents)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Maraia, L\. Ranaldi, M\. Valentino, and F\. M\. Zanzotto \(2026\)Can activation steering generalize across languages? a study on syllogistic reasoning in language models\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 2739–2753\.External Links:[Link](https://aclanthology.org/2026.eacl-long.125/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.125),ISBN 979\-8\-89176\-380\-7Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p3.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px2.p1.1)\.
- M\. M\. Miao and L\. Ungar \(2026\)Closing the confidence\-faithfulness gap in large language models\.External Links:2603\.25052,[Link](https://arxiv.org/abs/2603.25052)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p3.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Nam, H\. Conklin, Y\. Yang, T\. Griffiths, J\. Cohen, and S\. Leslie \(2025\)Causal head gating: a framework for interpreting roles of attention heads in transformers\.External Links:2505\.13737,[Link](https://arxiv.org/abs/2505.13737)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.09876#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Toshniwal, W\. Du, I\. Moshkov, B\. Kisacanin, A\. Ayrapetyan, and I\. Gitman \(2024\)OpenMathInstruct\-2: accelerating ai for math with massive open\-source instruction data\.arXiv preprint arXiv:2410\.01560\.Cited by:[§4](https://arxiv.org/html/2606.09876#S4.SS0.SSS0.Px1.p1.1)\.
- A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2024\)Steering language models with activation engineering\.External Links:2308\.10248,[Link](https://arxiv.org/abs/2308.10248)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wehner, S\. Abdelnabi, D\. Tan, D\. Krueger, and M\. Fritz \(2025\)Taxonomy, opportunities, and challenges of representation engineering for large language models\.External Links:2502\.19649,[Link](https://arxiv.org/abs/2502.19649)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Xie, A\. S\. Chen, Y\. Lee, E\. Mitchell, and C\. Finn \(2024\)Calibrating language models with adaptive temperature scaling\.External Links:2409\.19817,[Link](https://arxiv.org/abs/2409.19817)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p2.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2024\)Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.External Links:2306\.13063,[Link](https://arxiv.org/abs/2306.13063)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p1.1)\.
- T\. Zhao, Y\. He, W\. Zheng, Y\. Zhang, and C\. Chen \(2026\)Wired for overconfidence: a mechanistic perspective on inflated verbalized confidence in llms\.External Links:2604\.01457,[Link](https://arxiv.org/abs/2604.01457)Cited by:[§1](https://arxiv.org/html/2606.09876#S1.p4.1),[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2025\)Representation engineering: a top\-down approach to ai transparency\.External Links:2310\.01405,[Link](https://arxiv.org/abs/2310.01405)Cited by:[§2](https://arxiv.org/html/2606.09876#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AImplementation Details

### A\.1Generation Prompt

``

`A\.2 Probe Diagnostics and Layer Selection Figure 4 reports the training\-set AUROC of the WY\-vs\-CY probes across confidence\-template positions and transformer layers\. We use training\-set diagnostics for layer selection so that the validation split remains reserved for final intervention evaluation\. The heatmaps make the layer selection criterion explicit: the intervention can only act on wrong\-vs\-correct\-confident evidence that is already linearly decodable at the probe layer, so we look for a layer where WY/CY separability is high while several downstream layers remain available for intervention in the same forward pass\. Two features of the heatmaps drive the choice\. First, separability is not present from the start: in both models the early layers are close to chance, and the AUROC rises into the middle of the network before saturating\. This is why the probe layer is placed in the middle rather than early – earlier layers do not yet carry a reliably decodable wrong\-confident signal, and intervening before that signal exists would have nothing to condition on\. Second, because we restrict the intervention to layers strictly after the probe layer \(so that the hidden states the probe consumes stay on their training distribution\), the probe layer must be early enough to leave a usable band of downstream layers\. The selected layers satisfy both constraints: for Qwen3\-4B, layer 18 combines strong separability across the template with four clean downstream layers, so we probe at layer 18 and intervene at layers 19–22; for Gemma3\-4B, the analogous point is layer 16, with intervention at layers 17–20\. The heatmaps also explain two results reported in the main text\. The Qwen3\-4B token\-3 row is visibly weaker than its neighbors \(tokens 2 and 4\) across most layers, indicating that this template coordinate carries a weaker wrong\-confident signal; this lower separability is consistent with token 3 being an ineffective intervention point in Table 5\.2, even under attention masking\. The token corresponds to a low\-information position in the confidence\-template span \("\_conf"\), which plausibly accounts for its weak separability relative to adjacent positions\. More broadly, Gemma3\-4B is systematically less separable than Qwen3\-4B: its AUROC values are lower across nearly all positions and layers\. This weaker and less peaked separability is consistent with the smaller and more context\-dependent intervention effect we observe for Gemma throughout Section 5\.2: when the wrong\-confident signal available to the probe is weaker, the conditional head intervention has correspondingly less evidence to act on\. Figure 4: Training\-set AUROC of WY\-vs\-CY confidence\-template probes across template positions and transformer layers\. Dashed outlines mark the selected probe layers, and solid outlines mark the downstream intervention layers\.`
Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs

Similar Articles

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Large Language Models Are Overconfident in Their Own Responses

Confidence Calibration in Large Language Models

HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

Submit Feedback

Similar Articles

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]
Large Language Models Are Overconfident in Their Own Responses
Confidence Calibration in Large Language Models
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling