Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

arXiv cs.AI 05/12/26, 04:00 AM Papers
Summary
This paper challenges the 'Attention-Confidence Assumption' by demonstrating that attention map sharpness is a poor predictor of correctness in Vision-Language Models. Instead, it shows that reliability is better indicated by hidden-state geometry and self-consistency, with significant findings on architectural differences between late-fusion and early-fusion models.
arXiv:2605.08200v1 Announce Type: new Abstract: A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline -- the VLM Reliability Probe (VRP) -- that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p<0.001). (ii) Reliability becomes legible later in the computation: a single hidden-state linear probe reaches AUROC>0.95 on POPE for two of three families, and self-consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron-level ablations expose a sharp architectural split with direct monitor-design implications: late-fusion LLaVA concentrates reliability in a fragile late bottleneck (-8.3 pp object-identification accuracy after top-5 probe-neuron ablation), whereas early-fusion PaliGemma and Qwen2-VL distribute it widely and absorb destruction of ~50% of their peak-layer hidden dimension with <=1 pp degradation. The takeaway is narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/12/26, 07:11 AM
# Where Reliability Lives in Vision–Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
Source: [https://arxiv.org/html/2605.08200](https://arxiv.org/html/2605.08200)
Logan Mann1,∗Ajit Saravanan1Ishan Dave2Shikhar Shiromani3 Saadullah Ismail4Yi Xia4Emily Huang5 1UC Santa Barbara2UC Berkeley3NVIDIA4Algoverse AI Research5Brown University ∗Correspondence:loganmann@ucsb\.edu

###### Abstract

A pervasive intuition holds that vision–language models \(VLMs\) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer\. We test this*Attention–Confidence Assumption*directly\. We instrument three open\-weight VLM families \(LLaVA\-1\.5, PaliGemma, Qwen2\-VL; 3–7B parameters\) with a unified mechanistic pipeline—the*VLM Reliability Probe*\(Vrp\)—that compares attention structure, generation dynamics, and hidden\-state geometry against a single correctness label\. Three results emerge\.\(i\)Attention structure is a near\-zero predictor of correctness \(Rpb\(Ck,y\)=0\.001R\_\{\\mathrm\{pb\}\}\(C\_\{k\},y\)\{=\}0\.001,95%95\\%CI\[−0\.034,0\.036\]\[\-0\.034,0\.036\];Rpb\(Hs,y\)=−0\.012R\_\{\\mathrm\{pb\}\}\(H\_\{\\mathrm\{s\}\},y\)\{=\}\{\-\}0\.012,\[−0\.047,0\.024\]\[\-0\.047,0\.024\]on a pooledn=3,090n\{=\}3\{,\}090split\), even though attention remains*causally*necessary for feature extraction \(top\-30% patch masking drops accuracy by8\.28\.2–11\.311\.3pp,p<0\.001p\{<\}0\.001\)\.\(ii\)Reliability becomes legible later in the computation: a single hidden\-state linear probe reachesAUROC\>0\.95\\mathrm\{AUROC\}\{\>\}0\.95on POPE for two of three families, and self\-consistency atK=10K\{=\}10is the strongest behavioral predictor we measure at10×10\\timesinference cost \(Rpb=0\.43R\_\{\\mathrm\{pb\}\}\{=\}0\.43\)\.\(iii\)Causal neuron\-level ablations expose a sharp architectural split with direct monitor\-design implications: late\-fusion LLaVA concentrates reliability in a fragile late bottleneck \(−8\.3\-8\.3pp object\-identification accuracy after top\-5 probe\-neuron ablation\), whereas early\-fusion PaliGemma and Qwen2\-VL distribute it widely and absorb destruction of∼\\sim50% of their peak\-layer hidden dimension with≤1\\leq 1pp degradation\. The takeaway is narrow but consequential: in 3–7B VLMs, reliability is read more reliably off hidden\-state geometry, layer\-wise margin formation, and sparse late\-layer circuits than off attention\-map sharpness\.

††footnotetext:Accepted at the*ICLR 2026 Workshop on Multimodal Reasoning*\.## 1Introduction

Vision–language models can answer richly compositional questions about images, yet routinely produce*fluent*mistakes: confident, well\-formed answers that are not supported by the pixels they purport to describe\[[18](https://arxiv.org/html/2605.08200#bib.bib1),[3](https://arxiv.org/html/2605.08200#bib.bib3),[27](https://arxiv.org/html/2605.08200#bib.bib4)\]\. For deployment in settings where errors carry cost \(scientific image analysis, medical triage, robotic perception\), we need reliability signals that are simultaneously*predictive of correctness*and*mechanistically interpretable*\. This raises a sharp interpretability question: where, inside a VLM, is the information that distinguishes a correct answer from an incorrect one?

A natural and visually intuitive hypothesis is that reliability lives in attention\. Cross\-attention maps are easy to extract, easy to visualize, and are frequently treated as a window onto what the model “used” to produce its answer\[[12](https://arxiv.org/html/2605.08200#bib.bib16),[29](https://arxiv.org/html/2605.08200#bib.bib17)\]\. We refer to the operationalization of this intuition as the*Attention–Confidence Assumption*:*if a VLM concentrates its visual attention on the relevant region, the resulting answer should be more trustworthy; diffuse attention should signal lower reliability*\. The Attention–Confidence Assumption is strictly stronger than the \(well\-supported\) claim that attention is causally involved in computation\. It additionally requires that the*structure*of attention \(its sharpness, fragmentation, or entropy\) be calibrated to the model’s probability of being right\.

We test this assumption head\-on\. We introduce the*VLM Reliability Probe*\(Vrp\), a unified mechanistic pipeline that instruments three open VLM families \(LLaVA\-1\.5\-7B, PaliGemma\-3B, Qwen2\-VL\-7B\) and compares attention structure against generation dynamics and hidden\-state readouts on the same inputs and the same correctness labels\.Vrpextracts cross\-attention tensors, hidden states, and per\-token confidences via forward hooks; reduces attention to per\-layer spatial vectors and structural summaries \(entropyHsH\_\{\\mathrm\{s\}\}, secondary\-component countCkC\_\{k\}\); applies the logit lens\[[22](https://arxiv.org/html/2605.08200#bib.bib22)\]to track when the correct token first separates from competitors in the residual stream; trainsL1L\_\{1\}\-regularized linear probes to localize sparse reliability circuits; and validates findings with targeted neuron ablation and patch masking\.

#### Findings\.

Three results emerge across families\. \(i\) Attention*structure*is a near\-zero predictor of correctness, even though attention remains causally necessary for feature extraction; a supervised non\-linear ensemble over 32 attention layers tops out atAUROC=0\.725\\mathrm\{AUROC\}\{=\}0\.725\. \(ii\) Reliability becomes legible only later: the logit\-lens truth margin peaks deep in the stack and is dominated by MLP residual contributions \(∼70\\sim 70–8282%\), and single hidden\-state probes reachAUROC\>0\.95\\mathrm\{AUROC\}\{\>\}0\.95on POPE for LLaVA and Qwen2\-VL\. \(iii\) Architectures organize this signal differently—LLaVA concentrates it in a fragile late bottleneck, whereas PaliGemma and Qwen2\-VL distribute it across a wide manifold robust to massive ablation\.

#### Contributions\.

We \(i\) pose and falsify the Attention–Confidence Assumption under a uniform protocol across three VLM families and four benchmarks; \(ii\) map*when and where*reliability becomes linearly decodable using logit\-lens trajectories,L1L\_\{1\}\-regularized neuron probes, and residual\-update analysis; \(iii\) provide causal evidence—negative \(top\-kkand random ablation, MLP bypass\) and positive \(top\-30% patch masking\)—that the located circuit is not merely correlational, and document a sharp robustness asymmetry across families; and \(iv\) extend a probing literature\[[4](https://arxiv.org/html/2605.08200#bib.bib24),[21](https://arxiv.org/html/2605.08200#bib.bib25),[10](https://arxiv.org/html/2605.08200#bib.bib26)\]so far applied mostly to text\-only models, arguing that VLM monitor design should prefer hidden\-state and consistency\-based signals over attention\-map heuristics\.

## 2Related Work

#### Vision–language models and hallucination benchmarks\.

Large VLMs build on contrastive and encoder–decoder vision–language pretraining combined with strong language backbones, enabling instruction following and open\-ended multimodal generation\[[23](https://arxiv.org/html/2605.08200#bib.bib7),[16](https://arxiv.org/html/2605.08200#bib.bib5),[1](https://arxiv.org/html/2605.08200#bib.bib6),[18](https://arxiv.org/html/2605.08200#bib.bib1),[6](https://arxiv.org/html/2605.08200#bib.bib2),[3](https://arxiv.org/html/2605.08200#bib.bib3),[27](https://arxiv.org/html/2605.08200#bib.bib4)\]\. Their fluency makes reliability difficult to judge: models produce confident answers that are weakly grounded in the image\. This concern has motivated benchmark\-driven work on object hallucination and multimodal evaluation, including POPE, LLaVA\-Bench, MME, SEED\-Bench, MM\-Vet, and the CHAIR family\[[17](https://arxiv.org/html/2605.08200#bib.bib8),[31](https://arxiv.org/html/2605.08200#bib.bib9),[7](https://arxiv.org/html/2605.08200#bib.bib10),[15](https://arxiv.org/html/2605.08200#bib.bib11),[30](https://arxiv.org/html/2605.08200#bib.bib12),[24](https://arxiv.org/html/2605.08200#bib.bib13)\]\. These benchmarks establish*where*models fail; they do not, by themselves, locate*where*the failure\-relevant computation lives\.

#### Attention as explanation\.

Whether attention is a faithful explanation of model behavior has been debated in NLP\[[12](https://arxiv.org/html/2605.08200#bib.bib16),[29](https://arxiv.org/html/2605.08200#bib.bib17),[25](https://arxiv.org/html/2605.08200#bib.bib18)\]\. For VLMs, recent evidence shows that correct localization and correct answering can come apart: models often attend to the right region while reasoning incorrectly about it\[[19](https://arxiv.org/html/2605.08200#bib.bib20)\]\. Saliency\- and attribution\-based interpretability\[[5](https://arxiv.org/html/2605.08200#bib.bib19)\]provides finer spatial maps, but the question of whether*any*spatial summary of attention predicts correctness has not been answered cleanly across families\. We target precisely that question\.

#### Mechanistic interpretability and probing for truthfulness\.

A growing literature reads model state for evidence of correctness or truthfulness\.Burnset al\.\[[4](https://arxiv.org/html/2605.08200#bib.bib24)\]discover linear directions associated with truthful belief in language models without supervision;Marks and Tegmark \[[21](https://arxiv.org/html/2605.08200#bib.bib25)\]show that truthful and false statements separate along a low\-dimensional geometry in the residual stream; andGevaet al\.\[[10](https://arxiv.org/html/2605.08200#bib.bib26),[9](https://arxiv.org/html/2605.08200#bib.bib27)\]characterize the role of MLP layers as key–value memories that promote tokens in the vocabulary space\. The logit lens\[[22](https://arxiv.org/html/2605.08200#bib.bib22)\]and tuned lens variants\[[2](https://arxiv.org/html/2605.08200#bib.bib23)\]provide layer\-wise readouts of the residual stream\. To date, these tools have been applied mostly to text\-only models\.Longet al\.\[[20](https://arxiv.org/html/2605.08200#bib.bib21)\]introduce a hidden\-state perspective on VLMs via the Visual Integration Point\. Our work combines these perspectives in an explicitly mechanistic pipeline that compares attention structure, layer\-wise hidden\-state readouts, sparse unit\-level probes, and causal interventions within a single cross\-family analysis of VLM reliability\.

#### Behavioral reliability\.

Self\-consistency\[[28](https://arxiv.org/html/2605.08200#bib.bib29)\]aggregates agreement across sampled reasoning paths; semantic\-entropy\[[14](https://arxiv.org/html/2605.08200#bib.bib30)\]and p\(True\) self\-evaluation\[[13](https://arxiv.org/html/2605.08200#bib.bib31)\]extend this to free\-form output\. We include self\-consistency as a strong behavioral baseline and compare it directly against single\-pass internal readouts\.

## 3The VLM Reliability Probe

We instrument each model with forward hooks that record \(i\) cross\-attention tensorsA\(l,h\)∈ℝT×SA^\{\(l,h\)\}\\in\\mathbb\{R\}^\{T\\times S\}at every decoder layerlland headhh\(whereTTis the number of generated answer tokens andSSis the number of image patches\), \(ii\) residual hidden statesh\(ℓ\)∈ℝdh^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\}at every layer, and \(iii\) per\-token output probabilities\. From these signals we derive three families of metrics; see Figure[1](https://arxiv.org/html/2605.08200#S3.F1)\. The pipeline is designed to disentangle two competing hypotheses:

H1: Structural Hypothesis\.Reliability is grounded in the spatial coherence of the visual encoder’s attention, namely*how the model looks*\.

H2: Mechanistic–Consistency Hypothesis\.Reliability emerges from generation dynamics and the geometry of late\-layer hidden states, namely*what the model is converging toward*\.

### 3\.1Stage 1: Structural Metrics from Attention

For each layerll, we averageA\(l,h\)A^\{\(l,h\)\}over heads and over answer\-token positions to obtain a single spatial vectorm\(l\)∈ℝSm^\{\(l\)\}\\in\\mathbb\{R\}^\{S\}over image patches, then normalize to a probability distributionm~\(l\)\\tilde\{m\}^\{\(l\)\}\. We summarize this distribution with two structural quantities:

Hs\(l\)\\displaystyle H\_\{\\mathrm\{s\}\}^\{\(l\)\}=−∑s=1Sm~s\(l\)log⁡m~s\(l\)\\displaystyle=\-\\sum\_\{s=1\}^\{S\}\\tilde\{m\}^\{\(l\)\}\_\{s\}\\log\\tilde\{m\}^\{\(l\)\}\_\{s\}\(spatial entropy\)\(1\)Ck\(l\)\\displaystyle C\_\{k\}^\{\(l\)\}=Ktot\(l\)−1\\displaystyle=K\_\{\\mathrm\{tot\}\}^\{\(l\)\}\-1\(secondary\-component count\)\.\(2\)
To computeKtot\(l\)K\_\{\\mathrm\{tot\}\}^\{\(l\)\}, we thresholdm~\(l\)\\tilde\{m\}^\{\(l\)\}at the top30%30\\%of attention mass, binarize on the patch grid, and count connected components under 4\-neighbor adjacency, mirroring the saliency\-thresholding convention used in attention\-based interpretability\[[5](https://arxiv.org/html/2605.08200#bib.bib19)\]\.Ktot\(l\)=1K\_\{\\mathrm\{tot\}\}^\{\(l\)\}=1corresponds to a single contiguous focus, henceCk\(l\)=0C\_\{k\}^\{\(l\)\}=0\. Throughout the paper we reportCkC\_\{k\}rather thanKtotK\_\{\\mathrm\{tot\}\}unless explicitly noted, so that “zero” corresponds to the maximally focused case\. We also track layer\-wise attention\-evolution deltasΔHs\(l\)=Hs\(l\)−Hs\(l−1\)\\Delta H\_\{\\mathrm\{s\}\}^\{\(l\)\}=H\_\{\\mathrm\{s\}\}^\{\(l\)\}\-H\_\{\\mathrm\{s\}\}^\{\(l\-1\)\}to characterize how attention sharpens or diffuses through the stack\. As a robustness check, we re\-run all attention analyses with a DBSCAN variant \(ε=1\.5\\varepsilon\{=\}1\.5,min\_samples=3\\mathrm\{min\\\_samples\}\{=\}3\); results agree to within±0\.01\\pm 0\.01inRpbR\_\{\\mathrm\{pb\}\}\.

### 3\.2Stage 2: Mechanistic Readouts via the Logit Lens and Probes

LetWU∈ℝ\|V\|×dW\_\{U\}\\in\\mathbb\{R\}^\{\|V\|\\times d\}denote the unembedding matrix and letzℓ=WULN\(h\(ℓ\)\)∈ℝ\|V\|z\_\{\\ell\}=W\_\{U\}\\,\\mathrm\{LN\}\(h^\{\(\\ell\)\}\)\\in\\mathbb\{R\}^\{\|V\|\}be the layer\-ℓ\\elllogit\-lens projection\[[22](https://arxiv.org/html/2605.08200#bib.bib22)\], whereLN\\mathrm\{LN\}is the model’s final\-layer norm applied to the residual stream\. We define the*truth margin*

ΔMℓ=zℓ\(y⋆\)−maxy≠y⋆⁡zℓ\(y\),\\Delta M\_\{\\ell\}=z\_\{\\ell\}\(y^\{\\star\}\)\-\\max\_\{y\\neq y^\{\\star\}\}z\_\{\\ell\}\(y\),\(3\)wherey⋆y^\{\\star\}is the reference answer token under our evaluation protocol \(§[4](https://arxiv.org/html/2605.08200#S4)\)\. For closed\-form benchmarks \(POPE, yes/no\)y⋆y^\{\\star\}is unambiguous; for open\-ended benchmarks we follow the protocol in §[4](https://arxiv.org/html/2605.08200#S4)and use the first content token of the canonicalized ground\-truth answer string, mirroring the convention adopted in recent logit\-lens analyses of multimodal models\[[20](https://arxiv.org/html/2605.08200#bib.bib21)\]\.

At every layer we additionally train a learned probefℓ:ℝd→\[0,1\]f\_\{\\ell\}:\\mathbb\{R\}^\{d\}\\to\[0,1\]predicting binary correctness fromh\(ℓ\)h^\{\(\\ell\)\}alone\. We report two variants: \(a\) a logistic probe withL2L\_\{2\}regularization \(dense\), and \(b\) a logistic probe withL1L\_\{1\}regularization atλ=0\.1\\lambda\{=\}0\.1\(sparse\)\. The sparse probe selects compact units that we use for the neuron\-level and causal ablation analyses in §[5\.3](https://arxiv.org/html/2605.08200#S5.SS3)\. To attribute the layerwise growth ofΔMℓ\\Delta M\_\{\\ell\}, we decompose the residual update at layerℓ\\ellinto its MLP and attention contributions and report their relative magnitudes, followingGevaet al\.\[[9](https://arxiv.org/html/2605.08200#bib.bib27)\]\.

### 3\.3Stage 3: Behavioral Metrics from Generation Dynamics

For each example we drawK=10K\{=\}10samples\{y1,…,yK\}\\\{y\_\{1\},\\dots,y\_\{K\}\\\}under nucleus sampling \(p=0\.9p\{=\}0\.9,T=0\.7T\{=\}0\.7\)\. We compute self\-consistency as the support of the majority answer:

SC=maxa⁡1K∑k=1K𝟏\[Φ\(yk\)=a\],\\mathrm\{SC\}=\\max\_\{a\}\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\mathbf\{1\}\[\\,\\Phi\(y\_\{k\}\)=a\\,\],\(4\)whereΦ\\Phiis a canonicalization function that lower\-cases, strips punctuation, and applies benchmark\-specific normalization \(e\.g\., yes/no collapsing on POPE, integer extraction on counting\)\. We additionally record the single\-pass token confidencePtokP\_\{\\mathrm\{tok\}\}assigned to the emitted answer token, and, for free\-form benchmarks, the geometric mean of token probabilities up to the first newline\. All structural, mechanistic, and behavioral signals are evaluated against the same binary correctness labels usingRpbR\_\{\\mathrm\{pb\}\}andAUROC\\mathrm\{AUROC\}\.

InputimageII\+ questionQQStage 1visual encoderStage 2LM backbone,LLlayersStage 3sampling,K=10K\{=\}10StructuralA\(l,h\)⇒Hs,CkA^\{\(l,h\)\}\\Rightarrow H\_\{\\mathrm\{s\}\},C\_\{k\}Mechanisticlogit lensΔMℓ\\Delta M\_\{\\ell\},sparse probefℓ\(h\(ℓ\)\)f\_\{\\ell\}\(h^\{\(\\ell\)\}\)Behavioralself\-consistency,PtokP\_\{\\mathrm\{tok\}\}\|R\|<0\.02\|R\|\{<\}0\.02AUROC\>0\.95\\mathrm\{AUROC\}\{\>\}0\.95R=0\.43R\{=\}0\.43causal maskneuron ablateFigure 1:The VLM Reliability Probe \(Vrp\)\.A unified pipeline that extracts three classes of evidence on a common footing\. Stage 1 reduces cross\-attention to per\-layer spatial vectors and structural summaries \(Hs,CkH\_\{\\mathrm\{s\}\},C\_\{k\}\)\. Stage 2 reads the residual stream via the logit lens andL1L\_\{1\}\-sparse probes\. Stage 3 samplesK=10K\{=\}10outputs to compute self\-consistency\. Dashed orange edges denote causal interventions: top\-30% patch masking on attention and top\-kkneuron ablation on the residual stream\. Headline numbers below each metric family preview the central finding of §5\.

## 4Experimental Protocol

Table 1:Models evaluated\.Three open\-weight VLMs spanning late\-fusion \(LLaVA\), early\-fusion \(PaliGemma\), and dynamic\-resolution early\-fusion \(Qwen2\-VL\) designs\.Table[1](https://arxiv.org/html/2605.08200#S4.T1)summarizes the three open\-weight VLMs we evaluate\[[18](https://arxiv.org/html/2605.08200#bib.bib1),[3](https://arxiv.org/html/2605.08200#bib.bib3),[27](https://arxiv.org/html/2605.08200#bib.bib4)\], spanning late\-fusion, early\-fusion, and dynamic\-resolution early\-fusion designs\. All experiments use HuggingFace implementations on NVIDIA A100\-80GB GPUs\.

#### Benchmarks\.

We evaluate on: \(i\)POPE\-Adversarial\[[17](https://arxiv.org/html/2605.08200#bib.bib8)\],n=1,000n\{=\}1\{,\}000binary yes/no object\-existence queries that stress object hallucination; \(ii\)LLaVA\-Bench\[[31](https://arxiv.org/html/2605.08200#bib.bib9)\],n=90n\{=\}90open\-ended reasoning prompts; \(iii\) acustom counting \+ spatialsuite ofn=2,000n\{=\}2\{,\}000items \(1,0001\{,\}000counting,1,0001\{,\}000spatial relations\) constructed from COCO\-style images with manually verified integer / relation labels; \(iv\)VQAv2\-val\[[11](https://arxiv.org/html/2605.08200#bib.bib14)\]for general scene understanding, and \(v\)TextVQA\[[26](https://arxiv.org/html/2605.08200#bib.bib15)\]for OCR\-heavy questions\. We reportRpbR\_\{\\mathrm\{pb\}\}with binary correctness for primary claims andAUROC\\mathrm\{AUROC\}for reliability prediction\. Sample accounting and 95% bootstrap confidence intervals \(10,000 resamples\) for all headline numbers are summarized in Table[8](https://arxiv.org/html/2605.08200#S5.T8)\.

#### Reference\-token protocol\.

For closed\-form benchmarks,y⋆y^\{\\star\}in Eq\. \([3](https://arxiv.org/html/2605.08200#S3.E3)\) is the canonical answer token \(e\.g\.,YesorNoon POPE; the integer on counting\)\. For open\-ended benchmarks, we tokenize the canonicalized ground\-truth string with the model’s tokenizer and use the*first content token*\(skipping leading whitespace and BOS\) asy⋆y^\{\\star\}\. When the ground\-truth string admits multiple gold answers \(e\.g\., VQAv2’s ten\-annotator setup\), we evaluateΔMℓ\\Delta M\_\{\\ell\}separately against each and report the maximum over golds, consistent with the official VQAv2 scoring rule\.

#### Probe training\.

Hidden\-state probes use a stratified60/20/2060/20/20train/validation/test split, with Adam \(lr10−410^\{\-4\}, batch 64, 50 epochs, early stopping on validation loss\)\. The sparseL1L\_\{1\}probe usesλ=0\.1\\lambda\{=\}0\.1\.*All hyperparameters, including the per\-architecture probe layer, are selected on the validation split alone*; the test split is queried only once for the headline numbers, so reported AUROCs are not inflated by data\-adaptive layer choice\.

#### Self\-consistency\.

K=10K\{=\}10samples with nucleus sampling \(p=0\.9p\{=\}0\.9,T=0\.7T\{=\}0\.7\)\.KKis chosen to balance variance and inference cost: largerKKwould only sharpen the behavioral predictor and would not affect the cheap single\-pass methods we are comparing against, while making the comparison less practically relevant for low\-latency deployment\. The canonicalizationΦ\\Phiin Eq\. \([4](https://arxiv.org/html/2605.08200#S3.E4)\) is benchmark\-specific and is documented in the released code\.

#### Reproducibility\.

All prompts, split definitions, hook code, probe weights, and evaluation pipelines are released\. Random seeds are fixed at4242for probe training and\{1,…,10\}\\\{1,\\dots,10\\\}for self\-consistency sampling\.

## 5Results

We present the results as a six\-step mechanistic argument\. We first show that attention structure fails as a reliability surface \(§[5\.1](https://arxiv.org/html/2605.08200#S5.SS1)\); trace the emergence of reliability in the residual stream \(§[5\.2](https://arxiv.org/html/2605.08200#S5.SS2)\); localize it in sparse late\-layer circuits \(§[5\.3](https://arxiv.org/html/2605.08200#S5.SS3)\); characterize the causal\-robustness asymmetry across architectures \(§[5\.4](https://arxiv.org/html/2605.08200#S5.SS4)\); compare reliability predictors head\-to\-head \(§[5\.5](https://arxiv.org/html/2605.08200#S5.SS5)\); and close by tying these results to a single mechanism,*symbolic detachment*, that explains why attention structure fails \(§[5\.6](https://arxiv.org/html/2605.08200#S5.SS6)\)\.

### 5\.1Visual Attention Does Not Predict Reliability

#### Spatial attention metrics are statistically uninformative\.

On the pooledn=3,090n\{=\}3\{,\}090structural\-analysis split \(Table[8](https://arxiv.org/html/2605.08200#S5.T8)\), the secondary\-component countCkC\_\{k\}achievesRpb\(Ck,y\)=0\.001R\_\{\\mathrm\{pb\}\}\(C\_\{k\},y\)=0\.001\(95% CI\[−0\.034,0\.036\]\[\-0\.034,0\.036\]\) and spatial entropy achievesRpb\(Hs,y\)=−0\.012R\_\{\\mathrm\{pb\}\}\(H\_\{\\mathrm\{s\}\},y\)=\-0\.012\(95% CI\[−0\.047,0\.024\]\[\-0\.047,0\.024\]\); both are statistically indistinguishable from zero \(p\>0\.05p\>0\.05under a two\-sided permutation test with10410^\{4\}permutations\)\. The conclusion survives Bonferroni correction across the six\(model×metric\)\(\\textsc\{model\}\\times\\textsc\{metric\}\)comparisons in Table[2](https://arxiv.org/html/2605.08200#S5.T2)\(α=0\.05/6\\alpha\{=\}0\.05/6\) as well as Benjamini–Hochberg control atq=0\.05q\{=\}0\.05\. The result is robust to attention\-head selection: even when filtering to the top\-kkheads ranked by direct logit contribution\[[9](https://arxiv.org/html/2605.08200#bib.bib27)\], the bestR2R^\{2\}over a non\-linear ensemble of attention features remains≤0\.08\\leq 0\.08\(Table[2](https://arxiv.org/html/2605.08200#S5.T2)\)\.

#### Supervised stress test\.

To close the loophole that simple structural metrics may discard signal that a learned classifier could exploit, we train an XGBoost–Random\-Forest ensemble on1111attention\-derived features \(per\-layer entropy, fragmentation, peakiness, polynomial interactions\) with direct access to ground\-truth labels\. On the pooled cross\-family split this classifier reaches5252–5555% accuracy, near chance for balanced binary labels\. A deeper architecture\-specific probe over all 32 layers of attention \(Appendix[B](https://arxiv.org/html/2605.08200#A2), Table[9](https://arxiv.org/html/2605.08200#A2.T9)\) lifts performance toAUROC=0\.725\\mathrm\{AUROC\}\{=\}0\.725, confirming that attention does carry*some*non\-linear, supervised signal about correctness—but with a∼\\sim0\.23 AUROC gap below what a single hidden state delivers \(AUROC=0\.956\\mathrm\{AUROC\}\{=\}0\.956\)\. The gap is itself the finding: attention information about correctness is high\-order and distributed, not the kind of spatially compact signal that user\-facing heatmaps suggest \(§[5\.5](https://arxiv.org/html/2605.08200#S5.SS5)\)\.

#### Attention is causally necessary, not informationally sufficient\.

The near\-zero structural correlation does*not*imply that attention is dispensable\. Masking the top\-30% attended patches reduces accuracy by8\.28\.2pp on LLaVA and11\.311\.3pp on PaliGemma \(p<0\.001p<0\.001, paired bootstrap\)\. The conclusion is therefore narrow but precise: attention enables feature extraction but does not encode*calibrated*uncertainty about those features \(see §[5\.6](https://arxiv.org/html/2605.08200#S5.SS6)for the mechanistic account\)\. The structure of attention \(its sharpness or fragmentation\) is essentially uncorrelated with whether the resulting computation will be correct\.

Table 2:Attention structure as a reliability signal is near\-random across families\.Top\-kkattentionR2R^\{2\}is the bestR2R^\{2\}over an unsupervised ensemble of attention features for each model\. The supervised classifier is an XGBoost–RF ensemble trained on 11 per\-layer attention features with full access to labels; it remains within±3\\pm 3pp of chance\.†On the counting subset, where Qwen2\-VL exhibits the calibration anomaly described in Appendix[C](https://arxiv.org/html/2605.08200#A3); its POPE accuracy is87\.487\.4%\.

### 5\.2Logit Lens: Where Reliability Emerges

We project each layer’s residual stream through the unembedding to obtain a layer\-wise truth margin \(Eq\.[3](https://arxiv.org/html/2605.08200#S3.E3)\)\. Three patterns emerge \(Figure[2](https://arxiv.org/html/2605.08200#S5.F2), Table[3](https://arxiv.org/html/2605.08200#S5.T3)\)\. First, families differ sharply in*when*the correct token starts to dominate competitors\. LLaVA\-1\.5 exhibits a long “silent phase” \(layers 0–16\) followed by emergence beginning around layer 21 and a peak at layer 24 \(lvis⋆=24l^\{\\star\}\_\{\\mathrm\{vis\}\}\{=\}24\); the maximum absolute final\-layer margin occurs atlfinal⋆=31l^\{\\star\}\_\{\\mathrm\{final\}\}\{=\}31withΔM=\+9\.20\\Delta M\{=\}\+9\.20\. PaliGemma integrates earlier \(lvis⋆=14l^\{\\star\}\_\{\\mathrm\{vis\}\}\{=\}14, peakΔM=\+10\.85\\Delta M\{=\}\+10\.85\); Qwen2\-VL exhibits cyclical re\-separation \(lvis⋆=27l^\{\\star\}\_\{\\mathrm\{vis\}\}\{=\}27, peakΔM=\+8\.40\\Delta M\{=\}\+8\.40\)\.

Second, the margin is built primarily by MLP writes rather than attention writes: across families, MLP contributions account for47\.647\.6–82\.182\.1% of the margin growth at the integration peak \(Table[3](https://arxiv.org/html/2605.08200#S5.T3)\)\. This is consistent with the mechanistic finding that transformer MLP layers act as content\-addressable memories that promote latent concepts in vocabulary space\[[10](https://arxiv.org/html/2605.08200#bib.bib26),[9](https://arxiv.org/html/2605.08200#bib.bib27)\], and suggests that VLM reliability—unlike early visual feature selection—depends on vocabulary\-space promotion rather than spatial coherence in the attention map\. Third, and crucially, this peak is strongly predictive of correctness: the per\-layer truth margin separates correct from incorrect trajectories withAUROC=0\.72\\mathrm\{AUROC\}\{=\}0\.72\(LLaVA\),0\.700\.70\(PaliGemma\), and0\.630\.63\(Qwen2\-VL\) using the margin alone \(Table[6](https://arxiv.org/html/2605.08200#S5.T6)\)\.

Table 3:Logit\-lens dynamics across families\.Visual\-integration peak locationlvis⋆l^\{\\star\}\_\{\\mathrm\{vis\}\}, peak final\-margin layerlfinal⋆l^\{\\star\}\_\{\\mathrm\{final\}\}, and the share of the residual update attributable to MLP layers at the integration peak\.00\.250\.250\.50\.50\.750\.751103366991212silent phaseearly integrationlate buildupNormalized layer indexℓ/L\\ell/LTruth marginΔMℓ\\Delta M\_\{\\ell\}\(nats\)Truth margin grows monotonically across depthLLaVA\-1\.5 \(32L\)PaliGemma\-3B \(18L\)Qwen2\-VL \(28L\)Figure 2:Truth\-margin across depth\.Each curve plotsΔMℓ\\Delta M\_\{\\ell\}averaged over the POPE\-Adversarial split, with depth normalized toℓ/L\\ell/Lfor cross\-architecture comparison\. Shaded bands report 95% bootstrap intervals over 1,000 resamples \(n=2,500n\{=\}2\{,\}500items per family\)\. LLaVA exhibits a∼\\sim60%\-of\-depth silent phase before late emergence; PaliGemma integrates early with peak at layer 14 of 18 and partial decay; Qwen2\-VL displays cyclical re\-separation\. Markers denoteℓfinal⋆\\ell^\{\\star\}\_\{\\mathrm\{final\}\}per family \(Table[3](https://arxiv.org/html/2605.08200#S5.T3)\)\.
### 5\.3Sparse Reliability Circuits

If reliability is built into hidden states, is it distributed holistically or concentrated in a small set of units? We train anL1L\_\{1\}\-regularized logistic probe \(λ=0\.1\\lambda\{=\}0\.1\) on per\-layer hidden states and inspect the selected features\. On LLaVA\-1\.5 layer 31, the probe selects roughly55–66% of units as active and identifies a small set of consistently large\-coefficient neurons\. The activation distribution \(Figure[3](https://arxiv.org/html/2605.08200#S5.F3)\) is heavy\-tailed: most units carry near\-zero discriminative weight, while a handful \(e\.g\., N1512, N1360, N3839, N2660\) account for the bulk of the probe’s decision boundary, with mean activation shifts between correct and incorrect trajectories ofΔact∈\{\+27\.2,−3\.1,−3\.1,−3\.0\}\\Delta\_\{\\mathrm\{act\}\}\\in\\\{\+27\.2,\\,\-3\.1,\\,\-3\.1,\\,\-3\.0\\\}respectively \(Appendix[G](https://arxiv.org/html/2605.08200#A7), Table[10](https://arxiv.org/html/2605.08200#A7.T10)\)\.

#### Layer specificity\.

To rule out that the choice of layer drives the probe’s strength, we replicate the analysis at layers\{10,17,21,27,29,31\}\\\{10,17,21,27,29,31\\\}\. Single\-neuron ablation of any of the top\-5 selected neurons at any of these layers produces≤0\.5\\leq 0\.5pp accuracy change, even under extreme activation clamping at±100\\pm 100\(p=1\.00p\{=\}1\.00under a paired\-bootstrap test onn=200n\{=\}200\)\.*Joint*ablation of the top\-5 produces a measurable effect \(−2\.0\-2\.0pp overall,−8\.3\-8\.3pp on object\-identification questions; Table[4](https://arxiv.org/html/2605.08200#S5.T4)\), while ablating five randomly chosen neurons produces no effect\. Reliability in LLaVA is therefore not a single “truth neuron” but a small\-circuit structure distributed across a handful of units\.

−0\.4\-0\.4−0\.2\-0\.200\.20\.20\.40\.40100100200200300300long\-tailed sparsityProbe coefficientβi\\beta\_\{i\}\(LLaVA\-1\.5, layer 31\)CountA heavy\-tailed minority of units carries discriminative weightbulk \(4,087\)task\-positive \(37\)task\-negative \(24\)030306060top\-9 single\-neuron ablationsCausal ablation: bars aligned withβi\\beta\_\{i\}aboveΔ\\DeltaAcc \(%\)Figure 3:Sparse reliability circuit \(LLaVA\-1\.5, layer 31\)\.*Top*: distribution of probe\-coefficient magnitudesβi\\beta\_\{i\}across all 4,096 hidden units, separated into bulk neurons \(gray,\|β\|<0\.15\|\\beta\|<0\.15\), task\-positive outliers \(orange\), and task\-negative outliers \(navy\)\. The distribution is heavy\-tailed but only a small fraction of units carry non\-zero discriminative weight\.*Bottom*: single\-neuron causal ablation accuracy drop on POPE\-Adversarial; nine units account for 61\.4% of decision capacity, with meanΔ\\DeltaAcc=30\.1=30\.1% \(Table[4](https://arxiv.org/html/2605.08200#S5.T4)\)\.

### 5\.4Architectural Robustness: Late Bottlenecks vs\. Distributed Circuits

The LLaVA result above shows that small probe\-selected sets are causally active, but raises an obvious question: is fragility a property of the finding or of the architecture? We replicate the ablation setup on PaliGemma \(layer 15,d=2,048d\{=\}2\{,\}048\) and Qwen2\-VL \(layer 25,d=3,584d\{=\}3\{,\}584\)\.

The contrast is stark \(Table[5](https://arxiv.org/html/2605.08200#S5.T5)\)\. Ablating the top\-10 probe\-selected neurons in PaliGemma changes accuracy by−0\.7\-0\.7pp; the same intervention in Qwen2\-VL produces0\.00\.0pp\. We then escalate to aggressive random ablation, zeroing500500,1,0001\{,\}000, and2,0002\{,\}000randomly selected neurons in the peak layer\. PaliGemma loses1\.01\.0pp at1,0001\{,\}000neurons \(∼49%\\sim 49\\%of layer dimension\); Qwen2\-VL is essentially flat \(and even mildly improves\) at up to2,0002\{,\}000neurons \(∼56%\\sim 56\\%of dimension\)\. Finally, completely bypassing the MLP at layer 25 of Qwen2\-VL leaves accuracy fully intact and, on this validation split, marginally improves it\. We confirm via paired\-bootstrap that allΔ\\Deltabounds for PaliGemma and Qwen2\-VL fall within±2\\pm 2pp\.

#### Interpretation\.

The two early\-fusion / cyclically\-refining architectures distribute reliability across a wide manifold; the residual stream patches around missing dimensions effortlessly\. LLaVA, in contrast, stores its decisive representation in a fragile late bottleneck where small circuits matter\. This is consistent with the divergent logit\-lens profiles \(Figure[2](https://arxiv.org/html/2605.08200#S5.F2)\): LLaVA’s late, sharp emergence concentrates risk in a narrow temporal window, while PaliGemma’s earlier integration and Qwen2\-VL’s cyclical refinement hedge across many layers\.

Table 4:LLaVA\-1\.5 causal ablation \(layer 31,n=200n\{=\}200\)\.Joint ablation of probe\-selected neurons produces a measurable drop concentrated on object\-identification questions; single\-neuron and matched\-size random ablations do not\.Table 5:Cross\-family causal robustness \(n=100n\{=\}100validation split\)\.Unlike LLaVA’s localized fragility, PaliGemma and Qwen2\-VL absorb destruction of∼\\sim50% of their peak\-layer hidden dimension with≤1\\leq 1pp degradation\.Δ\\Deltais reported relative to the architecture\-specific baseline\.Model \(peak layer\)ConditionAcc\.Δ\\Delta\(pp\)PaliGemma\-3B \(L15,d=2,048d\{=\}2\{,\}048\)Baseline97\.0%n/aTop\-10 probe neurons96\.3%−0\.7\-0\.7500 random \(24%24\\%\)97\.0%0\.00\.01,000 random \(49%49\\%\)96\.0%−1\.0\-1\.0Qwen2\-VL\-7B \(L25,d=3,584d\{=\}3\{,\}584\)Baseline55\.0%n/a500 random \(14%14\\%\)58\.0%\+3\.0\+3\.01,000 random \(28%28\\%\)56\.0%\+1\.0\+1\.02,000 random \(56%56\\%\)57\.0%\+2\.0\+2\.0MLP bypass \(all tokens\)60\.0%\+5\.0\+5\.0

### 5\.5Reliability Prediction: Probes vs\. Attention vs\. Consistency

The ultimate test of an internal signal is whether it predicts correctness at inference time\. We compare four reliability predictors on POPE\-Adversarial \(Table[6](https://arxiv.org/html/2605.08200#S5.T6)\): logit entropy and output confidence \(cheap baselines\); spatial\-attention summaries; the truth marginΔMℓ\\Delta M\_\{\\ell\}alone; the hidden\-state probe \(best\-layer\); a multi\-layer stacked probe combining the last55layers; and self\-consistency atK=10K\{=\}10\(behavioral,10×10\\timesinference cost\)\.

Two conclusions stand out\. First, standard uncertainty baselines fail decisively: logit entropy remains at chance \(AUROC≈0\.50\\mathrm\{AUROC\}\\approx 0\.50\), and spatial attention is likewise near chance\. Output confidence improves only marginally, to0\.530\.53–0\.550\.55\. Second, hidden\-state probes dominate single\-pass methods\. On POPE they reachAUROC\>0\.95\\mathrm\{AUROC\}\>0\.95for LLaVA and Qwen2\-VL, but only0\.7380\.738for PaliGemma\. That drop is consistent with PaliGemma’s earlier visual integration \(Table[3](https://arxiv.org/html/2605.08200#S5.T3)\): the late\-layer separation between correct and hallucinated trajectories that LLaVA and Qwen2\-VL exploit is partly compressed in PaliGemma’s shallower decoder, leaving less linear separability at any single layer\. Self\-consistency atK=10K\{=\}10still yields a strongAUROC=0\.78\\mathrm\{AUROC\}=0\.78–0\.810\.81, but at10×10\\timesinference cost\.

Generalization across benchmarks is more nuanced\. Table[7](https://arxiv.org/html/2605.08200#S5.T7)reports hidden\-state probe AUROCs on LLaVA\-Bench, VQAv2, and TextVQA in addition to POPE\. The probe outperforms output confidence in77of1212model×\\,\\times\\,task comparisons, with the largest gains on LLaVA across all four benchmarks\. On PaliGemma, output confidence is competitive with or stronger than the probe on VQAv2 and TextVQA, again consistent with its more diffuse representation of truth\. The pattern indicates that hidden\-state probes are a strong but not universal reliability readout, and that probe layer\-selection should be architecturally informed\.

Table 6:Reliability prediction on POPE\-Adversarial \(AUROC\)\.Hidden\-state probes dominate single\-pass methods on LLaVA and Qwen2\-VL; self\-consistency is competitive at10×10\\timesinference cost\. Spatial attention is at chance\.Table 7:Hidden\-state probe vs\. output confidence across benchmarks \(AUROC\)\.Probe layer is selected per architecture on a held\-out validation slice\.Boldindicates the higher of the two within a model–task pair\.
### 5\.6Symbolic Detachment: Why Attention Structure Fails

We define*symbolic detachment*operationally: a layer\-wise sequence in which \(a\) cross\-attention entropy collapses early \(ΔHs\(ℓlock⋆\)≤−2\\Delta H\_\{\\mathrm\{s\}\}\(\\ell^\{\\star\}\_\{\\mathrm\{lock\}\}\)\\leq\-2\), \(b\) the residual visual stream then stagnates \(‖hvis\(ℓ\)−hvis\(ℓ−1\)‖2\\\|h^\{\(\\ell\)\}\_\{\\mathrm\{vis\}\}\-h^\{\(\\ell\-1\)\}\_\{\\mathrm\{vis\}\}\\\|\_\{2\}near zero\) for≥50%\\geq 50\\%of model depth, and \(c\) linguistic prediction commits before attention re\-engages\. Layer\-wise attention evolution exposes the mechanism behind the structural failure \(Figure[4](https://arxiv.org/html/2605.08200#S5.F4)\)\. LLaVA exhibits*early locking*: a dramatic sharpening of visual attention at layer22\(ΔHs≈−2\.5\\Delta H\_\{\\mathrm\{s\}\}\\approx\-2\.5\), followed by∼\\sim28 layers of stagnation, and a late diffusion at the final layer \(ΔHs≈\+1\.0\\Delta H\_\{\\mathrm\{s\}\}\\approx\+1\.0\)\. By the time linguistic prediction occurs, attention has effectively decoupled from the visual features it once selected\. PaliGemma exhibits a steady decay; Qwen2\-VL re\-sharpens cyclically at layers\{17,25\}\\\{17,25\\\}, consistent with its strong late\-layer probe AUROC\.

We corroborate this account with a residual\-update analysis\. The layer\-wise L2 norm of visual\-token residual updates,‖hvis\(l\)−hvis\(l−1\)‖2\\\|h^\{\(l\)\}\_\{\\mathrm\{vis\}\}\-h^\{\(l\-1\)\}\_\{\\mathrm\{vis\}\}\\\|\_\{2\}, remains low across LLaVA’s middle layers and surges only in the final few layers \(Appendix[D](https://arxiv.org/html/2605.08200#A4), Figure[5](https://arxiv.org/html/2605.08200#A4.F5)\)\. The visual stream is effectively dormant during the silent phase, so the attention map at layerℓ\\ellis a stale record of perception that occurred many layers prior\.*Symbolic detachment*is therefore an architectural property of late visual\-linguistic translation in late\-fusion stacks, rather than a universal law: the early\-fusion PaliGemma does not exhibit it\.

00\.250\.250\.50\.50\.750\.751100\.50\.5111\.51\.5222\.52\.5early lock\-indiffuseNormalized layer indexℓ/L\\ell/LVision\-attention entropyHℓ\(vis\)H\_\{\\ell\}^\{\(\\mathrm\{vis\}\)\}\(nats\)Vision\-attention entropy collapses while reliability persistsLLaVA\-1\.5PaliGemma\-3BQwen2\-VLFigure 4:Vision\-attention entropy across depth\.Mean Shannon entropyHℓ\(vis\)H\_\{\\ell\}^\{\(\\mathrm\{vis\}\)\}over image\-token attention at the answer position, averaged over POPE\-Adversarial; bands are 95% bootstrap CIs \(n=2,500n\{=\}2\{,\}500per family\)\. LLaVA collapses to a low\-entropy regime by∼\\sim30% depth; PaliGemma stays broad; Qwen2\-VL re\-broadens non\-monotonically\. The entropy axis does not predict reliability \(ρ<0\.10\\rho<0\.10across families; §[5\.1](https://arxiv.org/html/2605.08200#S5.SS1)\)\.Table 8:Sample accounting and uncertainty for headline reliability claims\.Confidence intervals are 95% bootstrap intervals \(10,00010\{,\}000resamples\) on the listed evaluation subset\.

## 6Discussion

#### The illusion of grounding\.

A model can exhibit textbook\-perfect attention—low entropy, single dominant component, on the right object—and still hallucinate; conversely, it can answer correctly with diffuse attention by leveraging global scene statistics\. Using attention sharpness as a trust proxy, whether in user\-facing visualizations or automated monitors, is therefore epistemically misleading: attention answers a different question than reliability, namely*which features were retrieved*, not*whether the retrieved features will be interpreted correctly*\.

#### Reliability as a late, MLP\-driven phenomenon\.

Our logit\-lens, sparse\-probe, and residual\-update analyses converge: the computation distinguishing correct from incorrect answers happens late in the residual stream and is dominated by MLP writes, not attention writes\. This aligns with the key–value\-memory view of MLPs\[[10](https://arxiv.org/html/2605.08200#bib.bib26)\]and with linear\-probe results in text\-only models\[[4](https://arxiv.org/html/2605.08200#bib.bib24),[21](https://arxiv.org/html/2605.08200#bib.bib25)\], and we show the picture is even more pronounced in multimodal models, where one might expect grounding to live in attention\.

#### A spectrum of architectural fragility\.

LLaVA’s causal\-robustness gap is our most consequential monitor\-design finding\. Late\-fusion stacks concentrate reliability in a small late\-stage circuit whose failures propagate; early\-fusion and cyclically\-refining stacks distribute the same signal widely and tolerate substantial damage\. Distributional robustness must be evaluated architecturally, not assumed\.

#### Brief case study\.

PaliGemma on*“Is the dog wearing a collar?”*\(VQAv2, ground truthYes\) shows highly concentrated attention \(Hs=0\.321H\_\{s\}\{=\}0\.321,Ck=0C\_\{k\}\{=\}0\)—textbook trustworthy by attention heuristics—yet answersNo\. The logit lens reveals the correct token climbing through layers 0–10 before being suppressed at the layer\-14 visual\-integration peak \(ΔM=\+9\.57\\Delta M\{=\}\{\+\}9\.57for the wrong token\); the hidden\-state probe correctly flags this as unreliable\. Full panel in Appendix F\.

#### Practical recommendations\.

Three concrete design rules follow for safety\-sensitive deployment\.\(1\) Replace attention heatmaps with hidden\-state probes as the trust signal\.A single\-layer residual\-stream probe reachesAUROC\>0\.95\\mathrm\{AUROC\}\{\>\}0\.95on POPE for LLaVA and Qwen2\-VL at single\-pass cost; no spatial\-attention summary we tested rises above chance \(Rpb≈0R\_\{\\mathrm\{pb\}\}\{\\approx\}0,95%95\\%CI straddles0\)\. For object\-existence monitoring we recommend hidden\-state probes when validationAUROC≥0\.90\\mathrm\{AUROC\}\{\\geq\}0\.90on a held\-out development slice, and a fallback to self\-consistency below that threshold\.\(2\) Treat self\-consistency as a budget–reliability dial\.AtK=10K\{=\}10it is our strongest behavioral predictor \(Rpb=0\.43R\_\{\\mathrm\{pb\}\}\{=\}0\.43\) but costs10×10\\timesinference; the natural follow\-up is to distill consistency into a single\-pass value head\.\(3\) Architect the monitor to the model\.Late\-fusion stacks \(LLaVA\-1\.5\) concentrate reliability in a sparse late\-layer circuit \(∼\\sim5 neurons drive∼\\sim8 pp\), so compact unit\-level monitors suffice\. Early\-fusion and cyclically\-refining stacks \(PaliGemma, Qwen2\-VL\) distribute reliability across≥\\geq50% of the peak\-layer hidden dimension and require dense distributional readouts; they tolerate substantial single\-unit damage but are correspondingly opaque to neuron\-level interpretation\. Pre\-registered starting layers from our experiments—LLaVAℓ=31\\ell\{=\}31, PaliGemmaℓ=15\\ell\{=\}15, Qwen2\-VLℓ=25\\ell\{=\}25—are a reasonable default before per\-deployment validation tuning\.

## 7Limitations

Six scope limitations frame downstream extensions of this protocol\.

1. 1\.Model scale and post\-training\.We evaluate three open VLMs in the33–77B parameter range; larger or RLHF\-tuned closed models \(e\.g\., GPT\-4V, Gemini\-Pro\-Vision\) may couple attention more tightly to truthfulness, but are not testable without internals\.
2. 2\.Causal toolkit\.Our interventions are zero\-ablation and clamp\-ablation; activation patching and exchange interventions\[[8](https://arxiv.org/html/2605.08200#bib.bib28)\]would tighten the circuit\-level account\.
3. 3\.Cost of the strongest signal\.Self\-consistency atK=10K\{=\}10pays a10×10\\timesinference cost, which is prohibitive for low\-latency deployment; distilling self\-consistency into a single\-pass value head is the natural follow\-up\.
4. 4\.Reference\-token convention\.For free\-form benchmarksy⋆y^\{\\star\}uses the first content token of the canonicalized gold answer, inheriting multi\-token ambiguities; we report conservatively rather than searching canonicalizations\.
5. 5\.Architectural scope\.All three evaluated VLMs are open\-weight, late\- or early\-fusion stacks in the33–77B regime\. Our claims about*where*reliability lives are scoped to this regime; closed\-weight models,≥13\\geq 13B late\-fusion stacks \(e\.g\., LLaVA\-NeXT, InternVL\-2\), and tightly\-coupled architectures \(e\.g\., Idefics\-3, Llama\-3\.2\-Vision, Molmo\) may exhibit qualitatively different geometries and are an immediate target for follow\-up work\.
6. 6\.Layer\-selection effects on probes\.Although the probe layer is chosen on a held\-out validation slice and frozen before test evaluation, the data\-adaptive choice could in principle inflate AUROC relative to a pre\-registered layer; a fully pre\-registered evaluation would tighten the bound on hidden\-state predictiveness\.

## 8Conclusion

We tested a simple, falsifiable claim—that visual\-attention structure is a reliable readout of VLM correctness—and falsified it\. Across three architecturally diverse 3–7B families and four benchmarks, attention sharpness, entropy, and fragmentation are statistically indistinguishable from noise as predictors of correctness, even where attention is*causally*necessary for upstream feature extraction\. Reliability surfaces later in the computation: in MLP\-dominated truth\-margin formation, inL1L\_\{1\}\-sparse late\-layer circuits, and, behaviorally, in the consistency of sampled outputs\. The architectural organization of this signal diverges sharply between late\-fusion and early\-fusion / cyclical stacks, with direct consequences for both interpretability and monitor design\. The principled implication is concrete: build hidden\-state and consistency\-based reliability monitors, and retire the comfortable but empirically falsified metaphor of attention\-as\-trust \(Rpb≈0R\_\{\\mathrm\{pb\}\}\{\\approx\}0across three families onn=3,090n\{=\}3\{,\}090items\)\.

## Ethics Statement

Our findings carry direct implications for VLM deployment in high\-stakes settings\. The primary methodological consequence is cautionary: because attention\-map sharpness is statistically uninformative about correctness, attention\-based heuristics should not be used as user\-facing trust signals or as automated abstention triggers in medical, scientific, or safety\-critical pipelines\. Hidden\-state probes and self\-consistency offer better\-calibrated alternatives, and we release the corresponding training scripts\. A secondary risk is that improved reliability monitors could be misused to launder model outputs, presenting probe\-confirmed responses as ground truth\. We emphasize that AUROC values, even at0\.950\.95, leave substantial residual error and should never be interpreted as verifiable correctness; our probes are correlational mechanisms, not truth oracles\. We use only publicly released models and benchmarks; no human subjects, private data, or scraped facial imagery were used\.

## References

- \[1\]J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds,et al\.\(2022\)Flamingo: a visual language model for few\-shot learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]N\. Belrose, Z\. Furman, L\. Smith, D\. Halawi, I\. Ostrovsky, L\. McKinney, S\. Biderman, and J\. Steinhardt\(2023\)Eliciting latent predictions from transformers with the tuned lens\.arXiv preprint arXiv:2303\.08112\.Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px3.p1.1)\.
- \[3\]L\. Beyer, A\. Steiner, A\. S\. Pinto, A\. Kolesnikov, X\. Wang, D\. Salz, M\. Neumann, I\. Alabdulmohsin, M\. Tschannen, E\. Bugliarello,et al\.\(2024\)PaliGemma: a versatile 3B vision–language model for transfer\.arXiv preprint arXiv:2407\.07726\.Cited by:[§1](https://arxiv.org/html/2605.08200#S1.p1.1),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.08200#S4.p1.1)\.
- \[4\]\(2023\)Discovering latent knowledge in language models without supervision\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.08200#S1.SS0.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.08200#S6.SS0.SSS0.Px2.p1.1)\.
- \[5\]H\. Chefer, S\. Gur, and L\. Wolf\(2021\)Generic attention\-model explainability for interpreting bi\-modal and encoder–decoder transformers\.InIEEE/CVF International Conference on Computer Vision \(ICCV\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.08200#S3.SS1.p3.12)\.
- \[6\]W\. Dai, J\. Li, D\. Li, A\. M\. H\. Tiong, J\. Zhao, W\. Wang, B\. Li, P\. Fung, and S\. C\. H\. Hoi\(2023\)InstructBLIP: towards general\-purpose vision–language models with instruction tuning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]C\. Fu, P\. Chen, Y\. Shen, Y\. Qin, M\. Zhang, X\. Lin, J\. Yang, X\. Zheng, K\. Li, X\. Sun, Y\. Wu, and R\. Ji\(2023\)MME: a comprehensive evaluation benchmark for multimodal large language models\.arXiv preprint arXiv:2306\.13394\.Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]A\. Geiger, H\. Lu, T\. Icard, and C\. Potts\(2021\)Causal abstractions of neural networks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[item 2](https://arxiv.org/html/2605.08200#S7.I1.i2.p1.1)\.
- \[9\]M\. Geva, A\. Caciularu, K\. R\. Wang, and Y\. Goldberg\(2022\)Transformer feed\-forward layers build predictions by promoting concepts in the vocabulary space\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.08200#S3.SS2.p2.7),[§5\.1](https://arxiv.org/html/2605.08200#S5.SS1.SSS0.Px1.p1.14),[§5\.2](https://arxiv.org/html/2605.08200#S5.SS2.p2.5)\.
- \[10\]M\. Geva, R\. Schuster, J\. Berant, and O\. Levy\(2021\)Transformer feed\-forward layers are key\-value memories\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§1](https://arxiv.org/html/2605.08200#S1.SS0.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px3.p1.1),[§5\.2](https://arxiv.org/html/2605.08200#S5.SS2.p2.5),[§6](https://arxiv.org/html/2605.08200#S6.SS0.SSS0.Px2.p1.1)\.
- \[11\]Y\. Goyal, T\. Khot, D\. Summers\-Stay, D\. Batra, and D\. Parikh\(2017\)Making the V in VQA matter: elevating the role of image understanding in visual question answering\.InIEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[Appendix A](https://arxiv.org/html/2605.08200#A1.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.08200#S4.SS0.SSS0.Px1.p1.7)\.
- \[12\]S\. Jain and B\. C\. Wallace\(2019\)Attention is not explanation\.InConference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[§1](https://arxiv.org/html/2605.08200#S1.p2.1),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px2.p1.1)\.
- \[13\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px4.p1.1)\.
- \[14\]L\. Kuhn, Y\. Gal, and S\. Farquhar\(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px4.p1.1)\.
- \[15\]B\. Li, R\. Wang, G\. Wang, Y\. Ge, Y\. Ge, and Y\. Shan\(2023\)SEED\-Bench: benchmarking multimodal LLMs with generative comprehension\.arXiv preprint arXiv:2307\.16125\.Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]J\. Li, D\. Li, C\. Xiong, and S\. C\. H\. Hoi\(2022\)BLIP: bootstrapping language\-image pre\-training for unified vision–language understanding and generation\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1)\.
- \[17\]Y\. Li, Y\. Du, K\. Zhou, J\. Wang, W\. X\. Zhao, and J\. Wen\(2023\)Evaluating object hallucination in large vision–language models\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[Appendix A](https://arxiv.org/html/2605.08200#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.08200#S4.SS0.SSS0.Px1.p1.7)\.
- \[18\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee\(2023\)Visual instruction tuning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.08200#S1.p1.1),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.08200#S4.p1.1)\.
- \[19\]Y\. Liu, Z\. Chen, R\. Wang, and W\. X\. Zhao\(2025\)Seeing but not believing: vision–language models can attend correctly yet reason incorrectly\.arXiv preprint arXiv:2510\.17771\.Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px2.p1.1)\.
- \[20\]L\. Long, C\. Oh, S\. Park, and S\. Li\(2025\)Understanding the language prior of LVLMs by contrasting chain\-of\-embedding\.arXiv preprint arXiv:2509\.23050\.Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.08200#S3.SS2.p1.6)\.
- \[21\]S\. Marks and M\. Tegmark\(2024\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.InConference on Language Modeling \(COLM\),Cited by:[§1](https://arxiv.org/html/2605.08200#S1.SS0.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.08200#S6.SS0.SSS0.Px2.p1.1)\.
- \[22\]Nostalgebraist\(2020\)Interpreting GPT: the logit lens\.Note:LessWrong postCited by:[§1](https://arxiv.org/html/2605.08200#S1.p3.3),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.08200#S3.SS2.p1.4)\.
- \[23\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1)\.
- \[24\]A\. Rohrbach, L\. A\. Hendricks, K\. Burns, T\. Darrell, and K\. Saenko\(2018\)Object hallucination in image captioning\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1)\.
- \[25\]S\. Serrano and N\. A\. Smith\(2019\)Is attention interpretable?\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px2.p1.1)\.
- \[26\]A\. Singh, V\. Natarajan, M\. Shah, Y\. Jiang, X\. Chen, D\. Batra, D\. Parikh, and M\. Rohrbach\(2019\)Towards VQA models that can read\.InIEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[Appendix A](https://arxiv.org/html/2605.08200#A1.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.08200#S4.SS0.SSS0.Px1.p1.7)\.
- \[27\]P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024\)Qwen2\-VL: enhancing vision–language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§1](https://arxiv.org/html/2605.08200#S1.p1.1),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.08200#S4.p1.1)\.
- \[28\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px4.p1.1)\.
- \[29\]S\. Wiegreffe and Y\. Pinter\(2019\)Attention is not not explanation\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§1](https://arxiv.org/html/2605.08200#S1.p2.1),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px2.p1.1)\.
- \[30\]W\. Yu, Z\. Yang, L\. Li, J\. Wang, K\. Lin, Z\. Liu, X\. Wang, and L\. Wang\(2023\)MM\-Vet: evaluating large multimodal models for integrated capabilities\.arXiv preprint arXiv:2308\.02490\.Cited by:[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]L\. Zhou, W\. Fu, Y\. Chen, W\. Liu, Z\. Lin, S\. Yan, and W\. Chen\(2023\)LLaVA\-Bench: a benchmark for visual instruction following\.arXiv preprint arXiv:2308\.13692\.Cited by:[Appendix A](https://arxiv.org/html/2605.08200#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.08200#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.08200#S4.SS0.SSS0.Px1.p1.7)\.

## Appendix

## Appendix ADetailed Experimental Setup

Models and hooks\.We instrument LLaVA\-1\.5\-7B \(32 layers, CLIP ViT\-L/14, Vicuna\-7B\), PaliGemma\-3B \(18 layers, SigLIP, Gemma\-2B\), and Qwen2\-VL\-7B\-Instruct \(28 layers with grouped\-query attention and native multimodal tokenization\) using HuggingFacetransformers\. Cross\-attention tensors are extracted via PyTorch forward hooks \(register\_forward\_hook\) attached to the multi\-head attention modules in each decoder block\. Hidden states are read from the output of each decoder block, and per\-token logits are computed by tying to the model’s own last\-layer norm and unembedding \(i\.e\., the logit lens\)\.

#### Hardware\.

A100\-80GB GPUs \(RunPod, Lambda Labs\); AMD EPYC 7742 64\-core CPU; 512 GB system memory\. PyTorch 2\.1\.0, CUDA 12\.1, official HF checkpoints for all three models\.

#### Datasets\.

POPE\-Adversarial\[[17](https://arxiv.org/html/2605.08200#bib.bib8)\]; LLaVA\-Bench\[[31](https://arxiv.org/html/2605.08200#bib.bib9)\]; a custom counting \+ spatial suite of2,0002\{,\}000items built from COCO\-style images with manually verified labels; VQAv2\-val\[[11](https://arxiv.org/html/2605.08200#bib.bib14)\]; TextVQA\[[26](https://arxiv.org/html/2605.08200#bib.bib15)\]\.

#### Probe training details\.

Adam, lr10−410^\{\-4\}, batch 64, 50 epochs, early stopping on a held\-out 10% of train\.L2L\_\{2\}weight10−410^\{\-4\}for the dense probe;L1L\_\{1\}weightλ=0\.1\\lambda\{=\}0\.1for the sparse probe\. All AUROC numbers are computed on held\-out 20% test splits; standard errors over five seeds do not exceed±0\.012\\pm 0\.012\.

Robustness checks\.All structural metrics were recomputed under a DBSCAN clustering variant \(ε=1\.5\\varepsilon=1\.5, minimum samples=3=3\);RpbR\_\{\\mathrm\{pb\}\}changes by at most0\.0110\.011\. Causal ablation was repeated under zero\-ablation and large\-magnitude clamp\-ablation \(±100\\pm 100\); the results agree\.

## Appendix BExtended Analysis: Ensemble Attention Probe

The failure of unsupervised attention metrics could in principle reflect a failure of the metric rather than a failure of attention\. To rule this out, we trained an “Ensemble Attention Probe” that concatenates per\-layer spatial vectorsm\(l\)∈ℝSm^\{\(l\)\}\\in\\mathbb\{R\}^\{S\}over allL=32L\{=\}32layers of LLaVA and passes the result through a 3\-layer MLP with ReLU and dropout \(p=0\.1p\{=\}0\.1\):

x=Concat\(m\(1\),…,m\(32\)\)∈ℝ18432,din→1024→512→1\.x=\\mathrm\{Concat\}\(m^\{\(1\)\},\\dots,m^\{\(32\)\}\)\\in\\mathbb\{R\}^\{18432\},\\quad d\_\{\\mathrm\{in\}\}\\to 1024\\to 512\\to 1\.This probe has direct access to ground\-truth correctness during training\. As shown in Table[9](https://arxiv.org/html/2605.08200#A2.T9), it extracts non\-trivial signal \(AUROC=0\.725\\mathrm\{AUROC\}\{=\}0\.725\) but remains well below hidden\-state probes \(0\.9560\.956\) and self\-consistency \(0\.7840\.784\) under the same labels\. We interpret this as direct evidence that attention contains*some*reliability signal but that this signal is dominated by what the residual stream encodes\.

Table 9:Probe comparison on POPE\-Adversarial \(LLaVA\-1\.5\)\.Supervised attention probes extract some signal, but consistency and hidden\-state probes remain superior at any fixed inference cost\.
## Appendix CThe Counting Anomaly

On quantitative reasoning \(“How many \[X\] are in the image?”\), all three models exhibit severe miscalibration\. Token confidence on the emitted integer frequently exceeds90%90\\%even when the answer is wrong by an order of one\. A representative case: an image with33baseball players elicits “Four” from LLaVA atPtok=0\.92P\_\{\\mathrm\{tok\}\}\{=\}0\.92, while the visual encoder’s attention forms three distinct foci \(Ktot=3K\_\{\\mathrm\{tot\}\}\{=\}3, henceCk=2C\_\{k\}\{=\}2\)\.

This dissociation is a clean instance of*symbolic detachment*: the encoder correctly identifies three regions, but the projection into the language space maps them to the wrong integer token, and the autoregressive coherence of the language model then assigns high probability to that token\. Token probability measures fluency, not grounding\. Self\-consistency partially recovers calibration on these items: under sampling, the model frequently oscillates between “Four” and “Three”, loweringSC\\mathrm\{SC\}and flagging the prediction as unreliable\.

## Appendix DResidual\-Update Analysis

Figure[5](https://arxiv.org/html/2605.08200#A4.F5)reports the layer\-wise L2 norm of visual\-token residual updates in LLaVA\-1\.5\. Visual representations remain effectively dormant across the middle of the stack and undergo a sharp transformation only in the final three layers, corroborating the symbolic\-detachment account in §[5\.6](https://arxiv.org/html/2605.08200#S5.SS6)and the late truth\-margin emergence in Figure[2](https://arxiv.org/html/2605.08200#S5.F2)\.

11881616242428283232022446688101012121414dormant phase \(layers5−285\\\!\-\\\!28\)late translationTransformer layer indexℓ\\ell‖hvis\(ℓ\)−hvis\(ℓ−1\)‖2\\\|h\_\{\\mathrm\{vis\}\}^\{\(\\ell\)\}\-h\_\{\\mathrm\{vis\}\}^\{\(\\ell\-1\)\}\\\|\_\{2\}Visual residual updates concentrate at layer 14 and 28Figure 5:Visual\-token residual updates in LLaVA\-1\.5\.Layer\-wiseL2L\_\{2\}norm of the change in the visual\-token residual stream,‖hvis\(ℓ\)−hvis\(ℓ−1\)‖2\\\|h\_\{\\mathrm\{vis\}\}^\{\(\\ell\)\}\-h\_\{\\mathrm\{vis\}\}^\{\(\\ell\-1\)\}\\\|\_\{2\}\. Visual representations remain effectively dormant across layers5–285\\text\{\-\-\}28and undergo a sharp non\-linear transformation only at the end of the stack \(ℓ=29–31\\ell\{=\}29\\text\{\-\-\}31\), mechanically explaining the early\-locking phenomenon in Figure[4](https://arxiv.org/html/2605.08200#S5.F4)\.
## Appendix EQualitative Failure Analysis

We examine 100 sampled failure cases for LLaVA\-1\.5 on POPE\-Adversarial and classify them by the joint behavior of attention structure and answer correctness\.

#### False negatives \(good attention, bad answer\)\.

In∼15%\\sim 15\\%of failure cases, attention is textbook\-perfect \(low entropy, single tight component on the relevant object\)\. For object\-existence queries, the model attends solely to \(e\.g\.,\) the chair and answers “No” to “Is there a chair?” This is consistent with the symbolic\-detachment account: attention retrieves the right feature; the late stack mis\-translates\.

#### False positives \(bad attention, good answer\)\.

In∼22%\\sim 22\\%of the correct cases, attention is scattered \(Hs\>4\.5H\_\{\\mathrm\{s\}\}\>4\.5\)\. These are overwhelmingly background\-scene questions \(“Is this a rainy day?”\), for which global texture statistics suffice\. An attention\-based heuristic would incorrectly penalize these as low\-confidence\.

Taken together, these two patterns explain mechanically whyRpb\(Hs,y\)≈0R\_\{\\mathrm\{pb\}\}\(H\_\{\\mathrm\{s\}\},y\)\\approx 0: the same attention\-structure signal is mis\-aligned with truth in opposite directions for different question types\.

## Appendix FExtended Case Study

Figure[6](https://arxiv.org/html/2605.08200#A6.F6)reproduces the case study referenced in §[6](https://arxiv.org/html/2605.08200#S6)\. The model attends sharply to the dog, withHs=0\.321H\_\{\\mathrm\{s\}\}\{=\}0\.321in the bottom15%15\\%of the dataset and a single dominant focus \(Ck=0C\_\{k\}\{=\}0\)\. Attention\-based heuristics would classify the prediction as trustworthy\. The model nonetheless answers “No” to “Is the dog wearing a collar?”\. The hidden\-state probe correctly flags the prediction as unreliable; the logit lens reveals that the correct token “Yes” is suppressed at layer 14 \(the visual\-integration peak\)\. Looking well is not the same as knowing well\.

Figure 6:Case study \(PaliGemma, VQAv2 \#31\)\.Sharp attention on the dog \(Hs=0\.321H\_\{\\mathrm\{s\}\}\{=\}0\.321,Ck=0C\_\{k\}\{=\}0; bottom 15% of the spread distribution\) would lead any attention\-based heuristic to classify the answer as trustworthy\. The model nevertheless answers “No” to “Is the dog wearing a collar?” \(ground truth: “Yes”\); the hidden\-state probe correctly flags the prediction as unreliable, and the logit lens reveals that “Yes” is suppressed at the layer\-14 visual\-integration peak\.*Looking well is not the same as knowing well\.*
## Appendix GLLaVA Deep Dive

Table[10](https://arxiv.org/html/2605.08200#A7.T10)summarizes the layer\-wise computational pipeline and sparse\-circuit findings for LLaVA\-1\.5\-7B\. Margin trajectories diverge around layer2121and peak at the visual\-integration layerlvis⋆=24l^\{\\star\}\_\{\\mathrm\{vis\}\}\{=\}24, before final answer commitment atlfinal⋆=31l^\{\\star\}\_\{\\mathrm\{final\}\}\{=\}31, where MLP writes account for∼72%\\sim 72\\%of the residual update\.

Table 10:Layer\-wise computational pipeline \(LLaVA\-1\.5\-7B\)\.Decomposition of the 32\-layer stack into functional roles, with the per\-layer change in truth\-marginΔM\\Delta Mand the dominant component \(attention vs\. MLP\) responsible for that change\. Three regimes emerge: feature extraction \(0–16\), reliability emergence and consolidation \(17–19\), and an attention\-dominated suppression band \(21–28\) that ultimately decides correctness\.LayersRoleΔM\\Delta MDominant component0–16Feature extractionlow variancen/a17Early prediction onsetn/aprobe acc\.82\.3%82\.3\\%19Margin boost\+0\.53\+0\.53MLP21–28Suppression / re\-balance−0\.85→−2\.27\-0\.85\\to\-2\.27attention \(72%72\\%\)24Maximum separation \(vis\. peak\)n/alargest correct/incorrect gap29Neuron commitmentn/aprobe acc\.86\.3%86\.3\\%, sparse5\.7%5\.7\\%30Margin boost\+2\.61\+2\.61MLP31Final decision\+9\.20\+9\.20MLP \(72%72\\%\)*Key neurons \(layer 31\)*N1512success\-associated\+27\.23\+27\.23answer\-confidenceN1360failure\-associated−3\.11\-3\.11failure detectionN3839failure\-associated−3\.08\-3\.08failure detectionN2660failure\-associated−2\.95\-2\.95failure detection
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Similar Articles

Large Vision-Language Models Get Lost in Attention

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Architecture, Not Scale: Circuit Localization in Large Language Models

Revealing Interpretable Failure Modes of VLMs

Submit Feedback

Similar Articles

Large Vision-Language Models Get Lost in Attention
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
Architecture, Not Scale: Circuit Localization in Large Language Models
Revealing Interpretable Failure Modes of VLMs