Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding

arXiv cs.AI Papers

Summary

This paper reveals that hallucination in large vision-language models is caused by a dynamic structural misalignment where certain attention heads act as risky mediators, decoupling from visual evidence to lock onto language priors. The authors propose Fox, a training-free causal intervention framework that diagnoses and physically severs these pathological shortcuts, achieving state-of-the-art performance in faithful decoding.

arXiv:2606.27596v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination. Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework. Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path. Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency. Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness. Code is available at https://github.com/Cc2021start/Fox.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:29 AM

# Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding
Source: [https://arxiv.org/html/2606.27596](https://arxiv.org/html/2606.27596)
###### Abstract

Large Vision\-Language Models \(LVLMs\) exhibit sophisticated reasoning but remain susceptible to object hallucination\. Deviating from the prevailingattention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision\-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors\. This establishes a pathological shortcut that bypasses visual grounding\. To dismantle this, we proposeFox\(Faithfulness andObservational\-flow via eXpression\-rectification\), a training\-free inference\-time framework\. Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly\. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path\. Finally, a conflict\-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency\. Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by29\.1%29\.1\\%while preserving linguistic richness\. Code is available at[https://github\.com/Cc2021start/Fox](https://github.com/Cc2021start/Fox)\.

Large Vision\-Language Models, Hallucination Mitigation, Causal Intervention, Faithful Decoding

![Refer to caption](https://arxiv.org/html/2606.27596v1/x1.png)Figure 1:Motivation of our work\.\(a\)Global visual attention magnitudemV,t​a​i​lm\_\{V,tail\}and distribution lack discriminative power to identify hallucination\.\(b\)While global magnitude boosting \(Green\) fails to suppress the pathological peak on system instructions at decision\-critical steps, our structural intervention \(Blue\) onrisky mediatorseliminates this shortcut, restoring visual grounding\.\(c\)Unlike coarse\-grained enhancement across all layers \(Left\), Fox performs a sparse, surgical intervention on diagnosed risky mediators \(Right\), physically severing the prior\-driven shortcut\.## 1Introduction

Large Vision\-Language Models \(LVLMs\) have demonstrated remarkable capabilities in multimodal reasoning\(Liu et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib18); Wan et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib27)\)\. Despite these advancements, they frequently suffer from object hallucination—generating content that contradicts visual evidence\(Leng et al\.,[2024a](https://arxiv.org/html/2606.27596#bib.bib13); Nie et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib21)\)\. This poses severe risks in safety\-critical domains, such as medical imaging or embodied AI\(Wang et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib28); Tian et al\.,[2024](https://arxiv.org/html/2606.27596#bib.bib25)\), where a single hallucinated token can trigger catastrophic reasoning failures\.

Current mitigation strategies generally fall into training\-time alignment\(Bai et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib2); Liu et al\.,[2024a](https://arxiv.org/html/2606.27596#bib.bib17)\)or inference\-time intervention\(Leng et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib14); An et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib1); Li et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib15); Fazli et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib8)\)\. While training\-based methods incur substantial computational costs, inference\-time interventions have gained traction for their model\-agnostic efficiency\(Zhang et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib37); Chen et al\.,[2024](https://arxiv.org/html/2606.27596#bib.bib6); Che et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib3)\)\. Despite technical variations, most existing approaches share a common premise, which we term theattention intensity assumption: hallucination is primarily attributed to a quantitative deficit in visual attention\(Chen et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib5)\)\. Consequently, these methods seek to rectify failures by mechanically amplifying visual signals \(e\.g\., PAI\(Liu et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib19)\)\) or suppressing language priors\(Leng et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib14)\)\. However, this intuition proves empirically incomplete, particularly for strategies predicated on global magnitude enhancement\. As shown in Fig\.[1](https://arxiv.org/html/2606.27596#S0.F1)\(a\),mV,t​a​i​lm\_\{V,tail\}denotes the visual attention magnitude, representing the total attention weight allocated to image tokens\. A controlled analysis reveals no statistically significant reduction in global visual attention magnitude for hallucinated outputs, with their distributions largely overlapping\. This lack of discriminative power suggests that focusing solely on intensity overlooks the underlying structural misalignment of hallucination\. The decisive failure is therefore not only how much visual mass is assigned, but where the final prediction is routed at the moment of content generation\. More detailscf\.Appendix[A\.1](https://arxiv.org/html/2606.27596#A1.SS1)and[A\.2](https://arxiv.org/html/2606.27596#A1.SS2)\.

Motivated by this, we shift our focus from global magnitude to the transient pathology triggered at decision\-critical steps\. We observe that hallucination is driven by specific attention heads, i\.e\.,risky mediators, that functionally decouple from visual evidence precisely when the model commits to content\-bearing generation\. As depicted in Fig\.[1](https://arxiv.org/html/2606.27596#S0.F1)\(b\), a naive global boosting strategy \(e\.g\., PAI\(Liu et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib19)\)\) succeeds in increasing the total attention volume but fails to dismantle the localized, pathological peak on system instructions\. This persistent structural bias establishes a shortcut where latent language priors bypass visual grounding to dominate the output\. From a causal perspective, these heads act as unreliable mediators that reroute influence via spurious dependencies\. As shown in Fig\.[1](https://arxiv.org/html/2606.27596#S0.F1)\(c\), addressing this requires a shift from uniform, token\-level adjustments toward sparse, head\-level causal interventions\.

To dismantle this pathological structure, we propose Fox \(Faithfulness andObservational\-flow via eXpression\-rectification\), a training\-free framework grounded in a Structural Causal Model \(SCM\)\. We reformulate decoding as a causal process where attention heads at specific decision\-critical steps serve as mediators\. Specifically, we introduce visual attention entropy as an unsupervised probe to pinpoint risky mediators exhibiting high visual uncertainty\. Upon detection, we execute a targeted intervention via the𝐝𝐨\\mathbf\{do\}\-operator—implemented as numerical logit saturation—to physically sever the shortcut path, forcing the model to rely on direct visual evidence\. Finally, to reconcile interventional faithfulness with linguistic fluency, we implement a conflict\-gated cooperative decoding strategy that dynamically fuses observational and interventional distributions\. Our main contributions are summarized as follows:

- •We challenge the prevailing attention intensity assumption by revealing that hallucination stems from dynamic structural misalignment\. We identify risky mediators—sparse heads that structurally disconnect from visual inputs at decision\-critical steps—offering a novel mechanistic perspective on LVLM failures\.
- •We propose Fox, a principled inference\-time framework rooted in SCM\. By intersecting decision\-critical steps with visual attention entropy probes, we achieve precise, unsupervised localization and𝐝𝐨\\mathbf\{do\}\-driven suppression of pathological shortcuts\.
- •Extensive experiments demonstrate that Fox significantly outperforms existing baselines, achieving a 22\.9% improvement on CHAIR and mitigating hallucination while preserving descriptive richness\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x2.png)Figure 2:Structural Causal Model \(SCM\) of the LVLM Decoding Path\.\(a\)Observational SCM:The latent mediatorsHHare localized at decision\-critical steps\. While stable mediatorsHSH\_\{S\}maintain visual grounding, risky mediatorsHRH\_\{R\}trigger apathological shortcut\(red arrow\) from language priors𝐗s​y​s\\mathbf\{X\}\_\{sys\}to outputYtY\_\{t\}\. \(b\)Interventional SCM:By applying𝐝𝐨​\(HR\)\\mathbf\{do\}\(H\_\{R\}\), we sever the shortcut\. The final output is dynamically reconciled from observational \(Po​b​sP\_\{obs\}\) and interventional \(Pd​oP\_\{do\}\) distributions\.
## 2Related Work

Hallucination Mitigation in LVLMs\.Existing strategies generally fall into training\-time alignment\(Sun et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib24); Zhou et al\.,[2024](https://arxiv.org/html/2606.27596#bib.bib41)\)and inference\-time intervention\(Zhu et al\.,[2026](https://arxiv.org/html/2606.27596#bib.bib42); Tong et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib26); Yu et al\.,[2026](https://arxiv.org/html/2606.27596#bib.bib36)\)\. Given the substantial cost of retraining, recent research has favored inference\-time methods: contrastive decoding approaches such as VCD\(Leng et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib14)\)and ICD\(Wang et al\.,[2024](https://arxiv.org/html/2606.27596#bib.bib29)\)penalize language priors via negative constraints, while reweighting methods\(Liu et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib19); Zou et al\.,[2024](https://arxiv.org/html/2606.27596#bib.bib43)\)mechanically amplify visual signals\. More granular studies, such as OPERA\(Huang et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib11)\)and SEVI\(Zhao et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib38)\), attempt to regulate generation by penalizing over\-trust or emphasizing specific semantic layers\. However, they predominantly hinge on the attention intensity assumption, treating the magnitude of attention as the primary proxy for faithfulness\. Unlike these intensity\-based heuristics, we argue that hallucination stems from a dynamic structural misalignment\. By shifting the focus from global magnitude to the visual attention entropy at decision\-critical steps, we distinguish valid reasoning from confident but misaligned hallucinations, offering a more precise diagnostic granularity\.

Causal Inference in Multimodal Reasoning\.Structural Causal Models \(SCMs\) provide a rigorous framework for debiasing and interpretability in vision\-language tasks\(Pearl,[2009](https://arxiv.org/html/2606.27596#bib.bib22)\)\. Related efforts use invariant learning, mixup, generated sentences, and graph contrastive pre\-training to mitigate bias or enrich pretrained models\(Zhou et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib39); Mao et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib20); Yu et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib31),[2024](https://arxiv.org/html/2606.27596#bib.bib32),[2025a](https://arxiv.org/html/2606.27596#bib.bib33),[2025c](https://arxiv.org/html/2606.27596#bib.bib35)\), while bimodal debiasing extends this principle to text\-to\-image generation\(Yu et al\.,[2025b](https://arxiv.org/html/2606.27596#bib.bib34)\)\. Recent studies like CausalMM\(Zhou et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib40)\)andHuang et al\. \([2024a](https://arxiv.org/html/2606.27596#bib.bib10)\)employ SCMs to analyze hallucinations, typically utilizing input\-level counterfactuals—such as masking image regions or tokens—to estimate causal effects\. While effective for post\-hoc diagnosis, these input\-level perturbations are often too coarse to rectify the model’s internal reasoning dynamics\. In contrast, we reformulate internal attention heads as dynamic mediators\. This allows us to perform surgical interventions via the𝐝𝐨\\mathbf\{do\}\-operator directly on the latent information flow when the model reaches decision\-critical queries\. By physically severing pathological shortcut paths within the network rather than altering external inputs, Fox achieves a principled, training\-free restoration of visual grounding\.

## 3Preliminary

Problem Formulation\.We consider an LVLMℱθ\\mathcal\{F\}\_\{\\theta\}that processes a multimodal input𝐗\\mathbf\{X\}partitioned into three semantic subspaces: visual𝐗v​i​s\\mathbf\{X\}\_\{vis\}\(indicesℐv​i​s\\mathcal\{I\}\_\{vis\}\), system instructions𝐗s​y​s\\mathbf\{X\}\_\{sys\}\(ℐs​y​s\\mathcal\{I\}\_\{sys\}\), and textual history𝐗t​x​t\\mathbf\{X\}\_\{txt\}\(ℐt​x​t\\mathcal\{I\}\_\{txt\}\)\. The model generatesY=\{y1,…,yL\}Y=\\\{y\_\{1\},\\dots,y\_\{L\}\\\}autoregressively, where the next\-token probability isP​\(yt∣𝐗,y<t\)=Softmax​\(ℱθ​\(𝐗,y<t\)\)P\(y\_\{t\}\\mid\\mathbf\{X\},y\_\{<t\}\)=\\text\{Softmax\}\(\\mathcal\{F\}\_\{\\theta\}\(\\mathbf\{X\},y\_\{<t\}\)\)\. The internal routing mechanism is driven by multi\-head attention\. For headhhat layerll, the attention logits𝐋\(l,h\)\\mathbf\{L\}^\{\(l,h\)\}and weights𝐀\(l,h\)\\mathbf\{A\}^\{\(l,h\)\}are computed as:

𝐋\(l,h\)=𝐐\(l,h\)​\(𝐊\(l,h\)\)⊤dk;𝐀\(l,h\)=Softmax​\(𝐋\(l,h\)\)\.\\mathbf\{L\}^\{\(l,h\)\}=\\frac\{\\mathbf\{Q\}^\{\(l,h\)\}\(\\mathbf\{K\}^\{\(l,h\)\}\)^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\};\\quad\\mathbf\{A\}^\{\(l,h\)\}=\\text\{Softmax\}\(\\mathbf\{L\}^\{\(l,h\)\}\)\.Our causal intervention \(§[4](https://arxiv.org/html/2606.27596#S4)\) directly modulates𝐋\(l,h\)\\mathbf\{L\}^\{\(l,h\)\}to rectify the information flow before normalization\.

Causal Formulation\.We formalize the decoding process as an SCM in Fig\.[2](https://arxiv.org/html/2606.27596#S1.F2)\. We identify the attention heads acting at decision\-critical steps𝒬\\mathcal\{Q\}as the dynamic mediatorsHH, which transmit causal influence from inputs to output\. We posit that causal mediation is temporally sparse, where information is aggregated at specific nodes\(q∈𝒬,h∈ℋ\)\(q\\in\\mathcal\{Q\},h\\in\\mathcal\{H\}\)rather than uniformly across tokens\. Ideally, faithful generation requiresHHto reliably transmit evidence via the visual path:𝐗v​i​s→H→Yt\\mathbf\{X\}\_\{vis\}\\to H\\to Y\_\{t\}\.

However, we observe a structural misalignment of mediation in Fig\.[2](https://arxiv.org/html/2606.27596#S1.F2)\(a\), where mediators bifurcate into: \(1\) Stable mediatorsHSH\_\{S\}, maintaining grounded visual attention; \(2\) Risky mediatorsHRH\_\{R\}, which functionally decouple from visual evidence to lock onto language priors\. Specifically,HRH\_\{R\}establishes a pathological shortcut:𝐗s​y​s→HR→Yt\\mathbf\{X\}\_\{sys\}\\to H\_\{R\}\\to Y\_\{t\}\. While any text carries priors,𝐗s​y​s\\mathbf\{X\}\_\{sys\}\(e\.g\., “You are a helpful assistant”\) serves as the primary anchor for latent priors when visual grounding fails, leading to object hallucination\.

Causal Intervention\.To block this shortcut without retraining, we apply the intervention𝐝𝐨​\(HR:=noise\)\\mathbf\{do\}\(H\_\{R\}:=\\text\{noise\}\)\. This operation suppressesHRH\_\{R\}, forcing the model to rely on the faithful visual path\. This process yields two distributions: the observational anchorPo​b​sP\_\{obs\}and the interventional candidatePd​oP\_\{do\}\. As shown in Fig\.[2](https://arxiv.org/html/2606.27596#S1.F2)\(b\), the final outputYfinalY\_\{\\text\{final\}\}is derived via a conflict\-gated causal fusion that dynamically reconciles these branches:

Yfinal∼Softmax​\(fgate​\(𝐳o​b​s,𝐳d​o\)\)\.Y\_\{\\text\{final\}\}\\sim\\text\{Softmax\}\(f\_\{\\text\{gate\}\}\(\\mathbf\{z\}\_\{obs\},\\mathbf\{z\}\_\{do\}\)\)\.\(1\)The overall framework is guided by three research questions \(RQs\) targeting diagnosis, intervention, and cooperation\. The Algorithm and detailed methodologycf\.Appendix[B](https://arxiv.org/html/2606.27596#A2)\.

## 4Method

### 4\.1Causal Diagnosis of Risky Mediators

RQ1\.Given the dense multi\-head attention mechanism in LVLMs, how can we unsupervisedly identify the specific attention heads that facilitate the pathological shortcut𝐗s​y​s→HR→Yt\\mathbf\{X\}\_\{sys\}\\to H\_\{R\}\\to Y\_\{t\}?

Insight I:Hallucination is a dynamic structural misalignment rather than a uniform signal deficit\. From a causal perspective, the breakdown of visual grounding is concentrated at specific nodes where the model aggregates multimodal context to update its internal states\. Diagnosing hallucination thus requires dual\-axis localization: identifying the intersection of decision\-critical steps \(temporal axis\) and risky mediators \(spatial axis\)\.

Temporal Axis: Identifying Decision\-Critical Steps\.In autoregressive transformers, information flow is anisotropic and temporally sparse\. We posit that the shortcut mechanism is most detectable at decision\-critical steps𝒬\\mathcal\{Q\}, where the model transitions from prompt encoding to token synthesis\. Unlike prior methods that average signals across all tokens, we pinpoint two pivotal temporal anchors derived from the input𝐗\\mathbf\{X\}and historyy<ty\_\{<t\}:

- •The Multimodal Handshake \(𝐱last∈𝐗\\mathbf\{x\}\_\{\\text\{last\}\}\\in\\mathbf\{X\}\):The terminal token of the prefix sequence𝐗\\mathbf\{X\}\. As the final node of multimodal integration,𝐱last\\mathbf\{x\}\_\{\\text\{last\}\}acts as the aggregator that compresses the visual\-linguistic context \(𝐗v​i​s,𝐗s​y​s\\mathbf\{X\}\_\{vis\},\\mathbf\{X\}\_\{sys\}\) into the initial hidden state\. A failure to ground on𝐗v​i​s\\mathbf\{X\}\_\{vis\}here leads to a corrupted trajectory initialization\.
- •The Autoregressive Anchor \(yt−1y\_\{t\-1\}\):The immediate predecessor of the current predictionyty\_\{t\}\. As the proximal causal parent in the SCM,yt−1y\_\{t\-1\}serves as the most sensitive probe for immediate prior dominance by language\-biased heads during the generative phase\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x3.png)Figure 3:Empirical validation of the joint risk score \(SS\)\.\(a\) Diagnostic fidelity:The ROC curve demonstrates that the aggregated joint risk score reliably distinguishes hallucinated trajectories from faithful ones \(AUC=0\.818\)\.\(b\) Structural decoupling:The distribution shift confirms that hallucination \(orange\) is characterized by higher joint risk, signifying the concurrent collapse of visual reliability and activation of language priors\.Score construction:The sample\-level score is an aggregation of head\-level joint riskS\(l,h\)S^\{\(l,h\)\}weighted by their respective causal contributions \(Eq\. \([4](https://arxiv.org/html/2606.27596#S4.E4)\)\)\.Spatial Axis: Quantifying Structural Shortcuts\.At the identified steps𝒬\\mathcal\{Q\}, we inspect the latent mediators \(attention heads\) to detect structural decoupling\. Following the subspaces defined in §[3](https://arxiv.org/html/2606.27596#S3), we partition the key space into system \(ℐs​y​s\\mathcal\{I\}\_\{sys\}\), visual \(ℐv​i​s\\mathcal\{I\}\_\{vis\}\), and text \(ℐt​x​t\\mathcal\{I\}\_\{txt\}\) indices\. We quantify the prior\-path activation for headhhat layerllby computing the system attention magnitude at𝒬\\mathcal\{Q\}:

msys\(l,h\)=1\|𝒬\|​∑q∈𝒬∑k∈ℐs​y​s𝐀q,k\(l,h\),m\_\{\\mathrm\{sys\}\}^\{\(l,h\)\}=\\frac\{1\}\{\|\\mathcal\{Q\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\}\\sum\_\{k\\in\\mathcal\{I\}\_\{sys\}\}\\mathbf\{A\}^\{\(l,h\)\}\_\{q,k\},\(2\)wheremsysm\_\{\\mathrm\{sys\}\}measures the reliance on𝐗s​y​s\\mathbf\{X\}\_\{sys\}relative to the current multimodal context\. More details ofmsysm\_\{\\mathrm\{sys\}\}cf\.Appendix[A\.3](https://arxiv.org/html/2606.27596#A1.SS3)\.

Unsupervised Diagnosis via Visual Attention Entropy\.A shortcut is only verified when the reliance on priors occurs alongside the collapse of visual reliability\. To measure the causal uncertainty of the visual pathway𝐗v​i​s→H\\mathbf\{X\}\_\{vis\}\\to H, we re\-normalize the attention weights strictly over the visual indicesℐv​i​s\\mathcal\{I\}\_\{vis\}to obtain the local distribution𝐀^q\(l,h\)\\hat\{\\mathbf\{A\}\}^\{\(l,h\)\}\_\{q\}:

𝐀^q,j\(l,h\)\\displaystyle\\hat\{\\mathbf\{A\}\}\_\{q,j\}^\{\(l,h\)\}=𝐀q,j\(l,h\)∑k∈ℐv​i​s𝐀q,k\(l,h\),j∈ℐv​i​s;\\displaystyle=\\frac\{\\mathbf\{A\}\_\{q,j\}^\{\(l,h\)\}\}\{\\sum\_\{k\\in\\mathcal\{I\}\_\{vis\}\}\\mathbf\{A\}\_\{q,k\}^\{\(l,h\)\}\},\\quad j\\in\\mathcal\{I\}\_\{vis\};\(3\)Hvis\(l,h\)\\displaystyle H\_\{\\mathrm\{vis\}\}^\{\(l,h\)\}=1\|𝒬\|​∑q∈𝒬Entropy​\(𝐀^q\(l,h\)\)\.\\displaystyle=\\frac\{1\}\{\|\\mathcal\{Q\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\}\\text\{Entropy\}\(\\hat\{\\mathbf\{A\}\}\_\{q\}^\{\(l,h\)\}\)\.HighHvisH\_\{\\mathrm\{vis\}\}signifies attentional dispersion, where the mediator fails to extract grounded evidence \(further visualization in Appendix[A\.4](https://arxiv.org/html/2606.27596#A1.SS4)\)\.

Joint Risk Scoring and Mediator Selection\.We define the risky mediatorHRH\_\{R\}as a node that facilitates the pathological shortcut through two concurrent conditions: \(1\) high causal uncertaintyHvisH\_\{\\mathrm\{vis\}\}and \(2\) active prior\-path transmissionmsysm\_\{\\mathrm\{sys\}\}\. The joint risk score is:

S\(l,h\)=msys\(l,h\)⋅Hvis\(l,h\)\.S^\{\(l,h\)\}=m\_\{\\mathrm\{sys\}\}^\{\(l,h\)\}\\cdot H\_\{\\mathrm\{vis\}\}^\{\(l,h\)\}\.\(4\)This multiplicative form functions as a conjunctive filter: it selectively pinpoints mediators where the breakdown of visual grounding directly correlates with the dominance of language priors\. This avoids over\-penalizing heads that remain visually grounded, making the score more transferable across LVLM backbones with different attention scales\. As evidenced by Fig\.[3](https://arxiv.org/html/2606.27596#S4.F3),SSserves as a robust diagnostic probe to localizeHRH\_\{R\}\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x4.png)Figure 4:Overview of Fox\.\(1\) Causal Diagnosis:Identifying risky mediatorsHRH\_\{R\}by intersecting decision\-critical queries with the conjunctive measurement of prior\-path activationmsysm\_\{\\mathrm\{sys\}\}and visual uncertaintyHvisH\_\{\\mathrm\{vis\}\}\.\(2\) Causal Intervention:Executing the𝐝𝐨\\mathbf\{do\}\-operator via numerical logit saturation to physically sever pathological shortcuts\.\(3\) Adaptive Decoding:Reconciling observational and interventional distributions via a conflict\-gated causal fusion to ensure both faithfulness and linguistic fluency\.
### 4\.2Reliability\-Aware Causal Intervention

RQ2\.How can we physically block the pathological shortcut path𝐗s​y​s→HR→Yt\\mathbf\{X\}\_\{sys\}\\to H\_\{R\}\\to Y\_\{t\}without altering fixed model parameters?

Insight II: Effective mitigation requires a surgical intervention that severs the prior\-driven shortcut while maintaining the model’s structural reasoning capacity\. We implement the𝐝𝐨\\mathbf\{do\}\-operator vianumerical logit saturation\. By projecting the pre\-softmax activations of risky mediators into a low\-precision regime, we reset the pathological path to a baseline noise state without inducing the distributional shifts associated with coarse pruning\.

Causal Intervention via Numerical Saturation\.As established in §[3](https://arxiv.org/html/2606.27596#S3), the causal influence of a mediator is physically realized through its attention weights𝐀=Softmax​\(𝐋\)\\mathbf\{A\}=\\text\{Softmax\}\(\\mathbf\{L\}\)\. To execute the intervention𝐝𝐨​\(HR:=noise\)\\mathbf\{do\}\(H\_\{R\}:=\\text\{noise\}\), we modulate the logit\-level parents𝐋\\mathbf\{L\}directly to achieve finer control over information flow\. For any diagnosed risky mediator\(l,h\)∈HR\(l,h\)\\in H\_\{R\}at the decision\-critical stepsq∈𝒬q\\in\\mathcal\{Q\}, we apply a substantial negative biasγ\\gamma:

𝐋~\(l,h\)=\{Πdtype​\(𝐋\(l,h\)−γ\),if​\(l,h\)∈HR,𝐋\(l,h\),otherwise,\\tilde\{\\mathbf\{L\}\}^\{\(l,h\)\}=\\begin\{cases\}\\Pi\_\{\\text\{dtype\}\}\\left\(\\mathbf\{L\}^\{\(l,h\)\}\-\\gamma\\right\),&\\text\{if \}\(l,h\)\\in H\_\{R\},\\\\ \\mathbf\{L\}^\{\(l,h\)\},&\\text\{otherwise\},\\end\{cases\}\(5\)whereγ\\gammais a large intervention constant andΠdtype​\(⋅\)\\Pi\_\{\\text\{dtype\}\}\(\\cdot\)denotes the projection onto the model’s numerical precision\.

Causal Rationale of the Intervention\.While Softmax is shift\-invariant in exact arithmetic, its finite\-precision implementation provides a unique mechanism for path modulation\. By forcing logits into a saturation regime, the intervention achieves two primary objectives:

1. 1\.Suppression of Pathological Shortcuts: For typical logit ranges, the exponential termex−γe^\{x\-\\gamma\}rapidly approaches the machine epsilonϵmach\\epsilon\_\{\\text\{mach\}\}\. This precision loss smooths out minor variances encoding spurious language priors, effectively decoupling the mediator from𝐗s​y​s\\mathbf\{X\}\_\{sys\}\.
2. 2\.Preservation of Structural Anchors: Unlike binary masking, this numerical shift allows exceptionally strong structural signals to remain distinct\. If an attention peak is sufficiently robust, it can survive the precision collapse, ensuring the intervention does not destroy the essential connectivity of the model’s reasoning trajectory\.

The resulting distribution𝐀~\(l,h\)=Softmax​\(𝐋~\(l,h\)\)\\tilde\{\\mathbf\{A\}\}^\{\(l,h\)\}=\\text\{Softmax\}\(\\tilde\{\\mathbf\{L\}\}^\{\(l,h\)\}\)represents the post\-interventional state where the pathological prior\-dependency is neutralized, forcing the model to rely on grounded visual trajectories\.

### 4\.3Conflict\-Gated Cooperative Decoding

Following the intervention in §[4\.2](https://arxiv.org/html/2606.27596#S4.SS2), the model generates two concurrent distributions at each steptt: theobservational anchorPo​b​sP\_\{obs\}derived from the full causal graph, and theinterventional candidatePd​oP\_\{do\}derived from𝐝𝐨​\(HR\)\\mathbf\{do\}\(H\_\{R\}\)\. WhilePd​oP\_\{do\}enforces visual grounding, relying on it in isolation may degrade linguistic stability\. The final challenge is to adaptively reconcile these distributions to maximize faithfulness\.

RQ3\.How can we dynamically inject interventional evidence to correct hallucinations without disrupting the model’s global linguistic stability?

Insight III: Interventional signals should be calibrated by their alignment with the observational manifold\. We proposeconflict\-gated fusion: when the two branches reach a consensus, we amplify the interventional signal to solidify evidence; when they diverge, we apply a conservative adjustment to preserve linguistic fluency\.

Quantifying Causal Conflict via Truncated JSD\.We measure the disagreement between the biased observation and the debiased intervention via Jensen\-Shannon Divergence \(JSD\)\. To minimize tail noise, we truncate the vocabulary𝒱\\mathcal\{V\}to a candidate set𝒱t=\{y∈𝒱∣Po​b​s​\(y\)\>β⋅maxw⁡Po​b​s​\(w\)\}\\mathcal\{V\}\_\{t\}=\\\{y\\in\\mathcal\{V\}\\mid P\_\{obs\}\(y\)\>\\beta\\cdot\\max\_\{w\}P\_\{obs\}\(w\)\\\}\. The conflict signaldtd\_\{t\}is:

dt=JSD\(Po​b​s\(⋅\|𝒱t\)∥Pd​o\(⋅\|𝒱t\)\)\.d\_\{t\}=\\mathrm\{JSD\}\\big\(P\_\{obs\}\(\\cdot\|\\mathcal\{V\}\_\{t\}\)\\parallel P\_\{do\}\(\\cdot\|\\mathcal\{V\}\_\{t\}\)\\big\)\.\(6\)Functionally,dtd\_\{t\}quantifies the sensitivity of the prediction to the pathological shortcut\. Truncation reduces noise from low\-probability tail tokens that fluctuate across branches without changing the selected word\.

Conflict\-Gated Fusion Strategy\.Let𝐳o​b​s\\mathbf\{z\}\_\{obs\}and𝐳d​o\\mathbf\{z\}\_\{do\}denote the logits from the two branches\. We fuse them via a dynamic weightλt\\lambda\_\{t\}to obtain the final logits𝐳f​i​n​a​l=𝐳o​b​s\+λt⋅𝐳d​o\\mathbf\{z\}\_\{final\}=\\mathbf\{z\}\_\{obs\}\+\\lambda\_\{t\}\\cdot\\mathbf\{z\}\_\{do\}\. The weightλt\\lambda\_\{t\}is governed by the conflict magnitudedtd\_\{t\}relative to a thresholdτJS\\tau\_\{\\mathrm\{JS\}\}:

λt=\{α,dt<τJSdt,dt≥τJS\\lambda\_\{t\}=\\begin\{cases\}\\alpha,&d\_\{t\}<\\tau\_\{\\mathrm\{JS\}\}\\\\ d\_\{t\},&d\_\{t\}\\geq\\tau\_\{\\mathrm\{JS\}\}\\end\{cases\}\(7\)This strategy calibrates the interventional influence by its alignment with the observational manifold\. In the consensus regime \(dt<τJSd\_\{t\}<\\tau\_\{\\mathrm\{JS\}\}\), we apply a fixed gainα\\alphato solidify the grounded evidence\. In the conflict regime \(dt≥τJSd\_\{t\}\\geq\\tau\_\{\\mathrm\{JS\}\}\), the interventional signal acts as acalibrated correctionwhereλt=dt\\lambda\_\{t\}=d\_\{t\}ensures the shift toward faithfulness remains anchored to the structural stability ofPo​b​sP\_\{obs\}\. The final tokenyt∼Softmax​\(𝐳f​i​n​a​l\)y\_\{t\}\\sim\\text\{Softmax\}\(\\mathbf\{z\}\_\{final\}\)thus achieves a principled balance between interventional fidelity and linguistic fluency\.

## 5Experiments

Models and Baselines\.We conduct experiments on three representative LVLMs:LLaVA\-1\.5\(Liu et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib18)\),Shikra\(Chen et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib4)\), andInstructBLIP\(Dai et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib7)\)\. We compare Fox against five inference\-time methods:ICD\(Wang et al\.,[2024](https://arxiv.org/html/2606.27596#bib.bib29)\),VCD\(Leng et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib14)\),OPERA\(Huang et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib11)\),SID\(Huo et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib12)\), andCausalMM\(Zhou et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib40)\)\.

Benchmarks\.We assess performance on three standard benchmarks:POPE\(Li et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib16)\)for object existence verification;CHAIR\(Rohrbach et al\.,[2018](https://arxiv.org/html/2606.27596#bib.bib23)\)for hallucination rates in captioning \(reporting both CHAIRSand CHAIRI\); andMME\(Fu et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib9)\)for comprehensive perception evaluation \(reporting Accuracy and Accuracy\+\)\. Additionally, we employGPT\-4Vas a holistic judge to assess open\-ended generation quality\(Huang et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib11)\)\.

Implementation Details\.We adopt a unified*Nucleus Sampling*strategy \(p=0\.9,T=1\.0p=0\.9,T=1\.0\) for all models except OPERA \(which requires beam search\)\. For Fox, we fixα=2\\alpha\{=\}2andβ=0\.1\\beta\{=\}0\.1, and set\(k,τJS\)\(k,\\tau\_\{\\mathrm\{JS\}\}\)to\(0\.45,0\.2\)\(0\.45,0\.2\)for LLaVA\-1\.5,\(0\.4,0\.2\)\(0\.4,0\.2\)for InstructBLIP, and\(0\.4,0\.2\)\(0\.4,0\.2\)for Shikra\. All experiments are performed on NVIDIA A100 GPUs\. Detailed configurationscf\.Appendix[C\.2](https://arxiv.org/html/2606.27596#A3.SS2)\.

### 5\.1Main Results

##### Results on POPE\.

Table[1](https://arxiv.org/html/2606.27596#S5.T1)reports the performance across three LVLM backbones\. Fox achieves consistent improvements, notably reaching an Accuracy of81\.9381\.93% on LLaVA\-1\.5 in theAdversarialsetting\. This gain under high\-bias conditions validates that while intensity\-based methods \(e\.g\., VCD\) fail to decouple visual evidence from linguistic traps, Fox physically severs the pathological shortcut𝐗s​y​s→HR→Yt\\mathbf\{X\}\_\{sys\}\\to H\_\{R\}\\to Y\_\{t\}via surgical intervention on risky mediators\. This targeted suppression forces the model to rely on stable visual paths rather than learned co\-occurrence priors\. Unlike OPERA, Fox attains superior defense under standard sampling without search\-based overhead, confirming that rectifying structural misalignment is more robust than heuristic\-driven amplification\. CausalMM achieves competitive POPE by globally adjusting attention via backdoor\-based counterfactual reasoning, but its reliance on holistic causal correction without explicitly targeting a sparse set of high\-risk attention heads limits its effectiveness when hallucinations are driven by localized structural shortcuts\.

Table 1:Results on POPE\. Ran, Pop, and Adv stand forRandom,Popular, andAdversarialsettings, respectively\. The higher score indicates better performance\.
##### Results on CHAIR\.

Table[2](https://arxiv.org/html/2606.27596#S5.T2)reports sentence\-level \(CSC\_\{S\}\) and instance\-level \(CIC\_\{I\}\) hallucination rates for long\-form captioning\. Fox consistently achieves the lowest scores across all backbones, notably reducingCIC\_\{I\}to12\.9012\.90on LLaVA\-1\.5 and a record11\.9811\.98on InstructBLIP\. These results correspond to a relativeCIC\_\{I\}reduction of16\.2%16\.2\\%and29\.1%29\.1\\%over SID, respectively, significantly outperforming strong baselines like OPERA and CausalMM\. These results validate that instance\-level hallucinations in descriptive tasks often stem from the pathological propagation of co\-occurrence priors\. By intervening on risky mediators at decision\-critical steps, Fox effectively dismantles these shortcuts, forcing the model to re\-verify each entity against visual evidence\. The simultaneous reduction inCSC\_\{S\}andCIC\_\{I\}confirms that our conflict\-gated strategy successfully rectifies structural misalignment without sacrificing descriptive richness or linguistic fluency\.

Table 2:Results on the CHAIR\.CSC\_\{S\}andCIC\_\{I\}denoteCHAIRS\\text\{CHAIR\}\_\{S\}andCHAIRI\\text\{CHAIR\}\_\{I\}\(the smaller score indicates fewer hallucinations\)\.![Refer to caption](https://arxiv.org/html/2606.27596v1/x5.png)Figure 5:Performance on the MME benchmark\. Higher scores indicate better effectiveness\. Fox achieves the highest total scores across all evaluated backbones, particularly excelling in evidence\-driven subsets Position and Color\.![Refer to caption](https://arxiv.org/html/2606.27596v1/x6.png)Figure 6:GPT\-4V\-assisted hallucination evaluation\. Left: Correctness \(higher = less hallucination\); Right: Detailedness\. Lines highlight within\-backbone trends\.![Refer to caption](https://arxiv.org/html/2606.27596v1/x7.png)Figure 7:Layer\-wise diagnostic strength of visual uncertainty\.The\|AUC−0\.5\|\|\\mathrm\{AUC\}\-0\.5\|metric \(Y\-axis\) quantifies the separability between faithful and hallucinated samples based on entropy probes\. While deeper layers exhibit higher peak discriminative strength, the red box indicates the optimalintervention window\(early\-to\-mid layers\) where structural rectification yields the most significant grounding improvements\.
##### Results on MME Benchmark\.

Fig\.[5](https://arxiv.org/html/2606.27596#S5.F5)illustrates the performance across fine\-grained subsets\. Fox consistently improves the total score across all backbones \(results for Shikracf\.Appendix[C\.6](https://arxiv.org/html/2606.27596#A3.SS6)\), notably increasing LLaVA\-1\.5’s score to 613\.33, surpassing the strong baseline SID \(600\.00\)\. The most significant gains occur in evidence\-dependent dimensions, such asPosition\(93\.33→131\.3793\.33\\to 131\.37\) andColor\(150\.00→165\.00150\.00\\to 165\.00\)\. These results provide empirical proof that attribute\-level assertions are highly susceptible to prior\-path dominance\. By disrupting the pathological shortcut at decision\-critical steps, Fox forces the model to re\-verify fine\-grained visual evidence, explaining the marked improvements in thePositionandColorsubsets\. The consistent enhancement across backbones confirms that rectifying structural misalignment ensures model assertions remain anchored in visual reality rather than linguistic plausibility\.

##### Results on GPT\-4V Evaluation\.

Fig\.[6](https://arxiv.org/html/2606.27596#S5.F6)summarizes the qualitative evaluation using GPT\-4V as a holistic judge, following a standardized evaluation protocol in Appendix[C\.3](https://arxiv.org/html/2606.27596#A3.SS3.SSS0.Px4)\. Across all backbones, Fox consistently improvesCorrectnesswhile simultaneously enhancingDetailedness, indicating a superior fidelity–detail trade\-off\. Specifically, on LLaVA\-1\.5, Fox raises Correctness from5\.895\.89to7\.047\.04and Detailedness from5\.825\.82to6\.196\.19\. These upward trends confirm that our gains are not achieved via overly conservative or “evasive” descriptions, but via authentic visual grounding\. These results further validate our causal intervention mechanism\. Unlike global re\-weighting strategies that often lead to distribution shifts or information loss, Fox targets risky mediators only at decision\-critical steps\. By surgically severing pathological shortcuts while preserving the structural integrity of the generative manifold, our method effectively suppresses prior\-driven behaviors without incurring a loss in descriptive richness\. Consequently, Fox rectifies the dynamic structural misalignment of baseline decoding, ensuring responses are both faithful and expressive\.

### 5\.2Ablation Study

![Refer to caption](https://arxiv.org/html/2606.27596v1/x8.png)Figure 8:Effectiveness of diagnosis\-driven head selection\.We report the performance gains of Fox over a random\-intervention baseline across different intervention ratiosKKon the POPE benchmark\. Improvements in Mean Accuracy \(Δ\\DeltaMean Acc\) and Mean F1 \(Δ\\DeltaMean F1\) consistently validate that targeting specificrisky mediatorsis superior to stochastic head suppression\.![Refer to caption](https://arxiv.org/html/2606.27596v1/x9.png)Figure 9:Impact ofkk,τJS\\tau\_\{\\mathrm\{JS\}\}, andβ\\betaon hallucination and informativeness in LLaVA\-1\.5, evaluated on500500COCO samples\.![Refer to caption](https://arxiv.org/html/2606.27596v1/x10.png)Figure 10:Open\-ended captioning comparison\. Hallucinations are marked in red\. Fox effectively mitigates hallucination while preserving descriptive richness\.![Refer to caption](https://arxiv.org/html/2606.27596v1/x11.png)Figure 11:Qualitative examples of VLMs on POPE for object existence prediction\. Red: Hallucination; Green: Correct predictions\.##### Validation of Diagnosis and Structural Intervention\.

As shown in Fig\.[7](https://arxiv.org/html/2606.27596#S5.F7), visual uncertainty signals exhibit a non\-uniform distribution across layers\. Although peak discriminative patterns \(Max\|AUC−0\.5\|\|\\mathrm\{AUC\}\-0\.5\|\) are prominent in deeper layers, we empirically find that structural intervention at these late stages yields diminishing returns for hallucination mitigation\. In contrast, targeting early\-to\-mid layers \(Layers 2–13\) more effectively regulates the evidence aggregation process, achieving a superior trade\-off between faithfulness and fluency\. Consequently, we define this critical window as our default intervention range\.

To verify the precision of our head localization, we compare diagnosis\-driven selection against arandom selectionbaseline under a constant budgetKK\. As illustrated in Fig\.[8](https://arxiv.org/html/2606.27596#S5.F8), Fox outperforms random perturbations acrossKKvalues, confirming that the observed gains stem from accurately isolating a sparse set ofrisky mediatorsrather than stochastic noise\. To further eliminate potential confounders from sampling stochasticity, we evaluate Fox in a deterministic setting \(i\.e\., greedy decoding\)\. In this regime, Fox maintains clear advantages over strong baselines such as VCD and SID \(detailscf\.Appendix[C\.7](https://arxiv.org/html/2606.27596#A3.SS7)\)\. These results reinforce that our intervention effectively addresses the dynamic structural misalignment by surgically severing pathological shortcuts\.

##### Hyperparameter Sensitivity\.

Fig\.[9](https://arxiv.org/html/2606.27596#S5.F9)investigates the sensitivity of Fox using500500COCO samples with LLaVA\-1\.5, varying the per\-layer head suppression ratiokk, the conflict thresholdτJS\\tau\_\{\\mathrm\{JS\}\}, and the truncation ratioβ\\beta\. Overall, Fox remains stable across a broad range of values, withCSC\_\{S\},CIC\_\{I\}, and F1 changing smoothly, suggesting that the framework does not rely on fragile tuning\. Among these,τJS\\tau\_\{\\mathrm\{JS\}\}most strongly governs the trade\-off between faithfulness and informativeness\. A largerτJS\\tau\_\{\\mathrm\{JS\}\}makes the gate more selective, suppressing prior\-drivenshortcut propagationmore aggressively and typically reducing hallucination rates; however, excessive thresholds may over\-constrain the generation and slightly decrease F1\. In contrast,kkexhibits milder effects: increasingkkconsistently lowers hallucination with minimal F1 fluctuations\. Finally,β\\betacontrols candidate truncation in conflict estimation, where moderate values yield the most robust performance\. Similar trends are observed on InstructBLIP and Shikra \(More detailscf\.Appendix[C\.4](https://arxiv.org/html/2606.27596#A3.SS4)\)\.

### 5\.3Case Study

Qualitative analysis confirms Fox’s precision\. Fig\.[10](https://arxiv.org/html/2606.27596#S5.F10)shows that in open\-ended captioning, Fox restores visual grounding by suppressing prior\-driven behaviors\. The resulting descriptions remain strictly faithful to visual evidence without incurring a loss in descriptive richness, effectively addressing the dynamic structural misalignment inherent in standard autoregressive decoding\. These examples also show that the intervention is token\-selective: it does not erase the full descriptive context, but suppresses unsupported entities that emerge from prior\-dominated mediators\. Similarly, as illustrated in Fig\.[11](https://arxiv.org/html/2606.27596#S5.F11), while baselines succumb to existence biases by fabricating non\-existent objects, Fox rectifies these errors by intervening on risky mediators to dismantle pathological shortcuts\. This behavior suggests localized causal repair rather than generic conservativeness\.

## 6Conclusion & Future Work

In this paper, we challenge the prevailing attention intensity assumption and propose Fox, a training\-free causal framework for mitigating hallucination by rectifying dynamic structural misalignment\. By treating attention heads as dynamic mediators, Fox uses visual attention entropy to identify and suppress risky mediators that form pathological shortcuts at decision\-critical steps\. A conflict\-gated cooperative decoding strategy further balances visual faithfulness and linguistic fluency\. Extensive experiments show that Fox achieves superior hallucination reduction while preserving generation quality and maintaining the latency regime of contrastive decoding baselines\. Future work will extend this causal diagnosis to compositional failures, including relation, attribute, and counting errors, and to video\-language and tool\-augmented LVLMs, where temporal or external evidence may introduce new shortcuts\. We will also study adaptive mediator discovery under distribution shift\.

## Acknowledgements

This work was supported by the Sichuan Provincial Natural Science Foundation Project \(Grant No\. 2026NSFSC0427\), the Chengdu Technological Innovation and R&D Project \(Grant No\. 2026YF0800348GX\), and the China Scholarship Council \(CSC, Grant No\. 202506070076\)\.

## Impact Statement

This paper aims to improve the faithfulness of large vision\-language models by reducing object hallucinations at inference time\. More reliable multimodal generation can benefit applications such as assistive technologies, educational tools, content understanding, and information retrieval, where users may depend on model outputs to interpret visual evidence\. By framing hallucination as a causal shortcut induced by dynamic mediator heads, the proposed method also provides a more interpretable perspective on when and how visual grounding fails\.

At the same time, our method should not be viewed as a guarantee of factual correctness\. LVLMs may still inherit dataset biases, fail under distribution shift, or produce unsupported claims in complex scenes beyond the object\-centric settings studied here\. Reducing hallucination may also increase users’ trust in model outputs, so deployment in safety\-critical or high\-stakes domains should retain human oversight, uncertainty communication, and task\-specific evaluation\. We do not anticipate specific negative societal impacts beyond those generally associated with multimodal foundation models, and we hope this work encourages more transparent and accountable LVLM deployment\.

## References

- An et al\. \(2025\)An, W\., Tian, F\., Leng, S\., Nie, J\., Lin, H\., Wang, Q\., Chen, P\., Zhang, X\., and Lu, S\.Mitigating object hallucinations in large vision\-language models with assembly of global and local attention\.In*Proceedings of the Computer Vision and Pattern Recognition Conference*, pp\. 29915–29926, 2025\.
- Bai et al\. \(2025\)Bai, Z\., Wang, P\., Xiao, T\., He, T\., Han, Z\., Zhang, Z\., and Shou, M\. Z\.Hallucination of multimodal large language models: A survey, 2025\.URL[https://arxiv\.org/abs/2404\.18930](https://arxiv.org/abs/2404.18930)\.
- Che et al\. \(2025\)Che, L\., Liu, T\. Q\., Jia, J\., Qin, W\., Tang, R\., and Pavlovic, V\.Hallucinatory image tokens: A training\-free eazy approach on detecting and mitigating object hallucinations in lvlms, 2025\.URL[https://arxiv\.org/abs/2503\.07772](https://arxiv.org/abs/2503.07772)\.
- Chen et al\. \(2023\)Chen, K\., Zhang, Z\., Zeng, W\., Zhang, R\., Zhu, F\., and Zhao, R\.Shikra: Unleashing multimodal llm’s referential dialogue magic\.*arXiv preprint arXiv:2306\.15195*, 2023\.
- Chen et al\. \(2025\)Chen, X\., Zhang, Y\., Liu, Q\., Wu, J\., Zhang, F\., and Tan, T\.Mixture of decoding: An attention\-inspired adaptive decoding strategy to mitigate hallucinations in large vision\-language models, 2025\.URL[https://arxiv\.org/abs/2505\.17061](https://arxiv.org/abs/2505.17061)\.
- Chen et al\. \(2024\)Chen, Z\., Zhao, Z\., Luo, H\., Yao, H\., Li, B\., and Zhou, J\.Halc: Object hallucination reduction via adaptive focal\-contrast decoding, 2024\.URL[https://arxiv\.org/abs/2403\.00425](https://arxiv.org/abs/2403.00425)\.
- Dai et al\. \(2023\)Dai, W\., Li, J\., Li, D\., Tiong, A\., Zhao, J\., Wang, W\., Li, B\., Fung, P\. N\., and Hoi, S\.Instructblip: Towards general\-purpose vision\-language models with instruction tuning\.*Advances in neural information processing systems*, 36:49250–49267, 2023\.
- Fazli et al\. \(2025\)Fazli, M\., Wei, B\., Sari, A\., and Zhu, Z\.Mitigating hallucination in large vision\-language models via adaptive attention calibration, 2025\.URL[https://arxiv\.org/abs/2505\.21472](https://arxiv.org/abs/2505.21472)\.
- Fu et al\. \(2025\)Fu, C\., Chen, P\., Shen, Y\., Qin, Y\., Zhang, M\., Lin, X\., Yang, J\., Zheng, X\., Li, K\., Sun, X\., et al\.Mme: A comprehensive evaluation benchmark for multimodal large language models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track*, 2025\.
- Huang et al\. \(2024a\)Huang, P\.\-H\., Li, J\.\-L\., Chen, C\.\-P\., Chang, M\.\-C\., and Chen, W\.\-C\.Who brings the frisbee: Probing hidden hallucination factors in large vision\-language model via causality analysis\.*arXiv preprint arXiv:2412\.02946*, 2024a\.
- Huang et al\. \(2024b\)Huang, Q\., Dong, X\., Zhang, P\., Wang, B\., He, C\., Wang, J\., Lin, D\., Zhang, W\., and Yu, N\.Opera: Alleviating hallucination in multi\-modal large language models via over\-trust penalty and retrospection\-allocation\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 13418–13427, 2024b\.
- Huo et al\. \(2025\)Huo, F\., Xu, W\., Zhang, Z\., Wang, H\., Chen, Z\., and Zhao, P\.Self\-introspective decoding: Alleviating hallucinations for large vision\-language models, 2025\.URL[https://arxiv\.org/abs/2408\.02032](https://arxiv.org/abs/2408.02032)\.
- Leng et al\. \(2024a\)Leng, S\., Xing, Y\., Cheng, Z\., Zhou, Y\., Zhang, H\., Li, X\., Zhao, D\., Lu, S\., Miao, C\., and Bing, L\.The curse of multi\-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024a\.URL[https://arxiv\.org/abs/2410\.12787](https://arxiv.org/abs/2410.12787)\.
- Leng et al\. \(2024b\)Leng, S\., Zhang, H\., Chen, G\., Li, X\., Lu, S\., Miao, C\., and Bing, L\.Mitigating object hallucinations in large vision\-language models through visual contrastive decoding\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 13872–13882, 2024b\.
- Li et al\. \(2025\)Li, J\., Zhang, J\., Jie, Z\., Ma, L\., and Li, G\.Mitigating hallucination for large vision language model by inter\-modality correlation calibration decoding, 2025\.URL[https://arxiv\.org/abs/2501\.01926](https://arxiv.org/abs/2501.01926)\.
- Li et al\. \(2023\)Li, Y\., Du, Y\., Zhou, K\., Wang, J\., Zhao, W\. X\., and Wen, J\.\-R\.Evaluating object hallucination in large vision\-language models\.*arXiv preprint arXiv:2305\.10355*, 2023\.
- Liu et al\. \(2024a\)Liu, F\., Lin, K\., Li, L\., Wang, J\., Yacoob, Y\., and Wang, L\.Mitigating hallucination in large multi\-modal models via robust instruction tuning, 2024a\.URL[https://arxiv\.org/abs/2306\.14565](https://arxiv.org/abs/2306.14565)\.
- Liu et al\. \(2023\)Liu, H\., Li, C\., Wu, Q\., and Lee, Y\. J\.Visual instruction tuning\.*Advances in neural information processing systems*, 36:34892–34916, 2023\.
- Liu et al\. \(2024b\)Liu, S\., Zheng, K\., and Chen, W\.Paying more attention to image: A training\-free method for alleviating hallucination in lvlms\.In*European Conference on Computer Vision*, pp\. 125–140\. Springer, 2024b\.
- Mao et al\. \(2023\)Mao, Y\., Yu, L\., Yang, Y\., Zhou, F\., and Zhong, T\.Debiasing intrinsic bias and application bias jointly via invariant risk minimization \(student abstract\)\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 37, pp\. 16280–16281, 2023\.
- Nie et al\. \(2025\)Nie, J\., Zhang, G\., An, W\., Xing, Y\., Tan, Y\.\-P\., Kot, A\. C\., and Lu, S\.Mmrel: Benchmarking relation understanding in multi\-modal large language models, 2025\.URL[https://arxiv\.org/abs/2406\.09121](https://arxiv.org/abs/2406.09121)\.
- Pearl \(2009\)Pearl, J\.*Causality*\.Cambridge university press, 2009\.
- Rohrbach et al\. \(2018\)Rohrbach, A\., Hendricks, L\. A\., Burns, K\., Darrell, T\., and Saenko, K\.Object hallucination in image captioning\.*arXiv preprint arXiv:1809\.02156*, 2018\.
- Sun et al\. \(2023\)Sun, Z\., Shen, S\., Cao, S\., Liu, H\., Li, C\., Shen, Y\., Gan, C\., Gui, L\.\-Y\., Wang, Y\.\-X\., Yang, Y\., Keutzer, K\., and Darrell, T\.Aligning large multimodal models with factually augmented rlhf, 2023\.URL[https://arxiv\.org/abs/2309\.14525](https://arxiv.org/abs/2309.14525)\.
- Tian et al\. \(2024\)Tian, X\., Gu, J\., Li, B\., Liu, Y\., Wang, Y\., Zhao, Z\., Zhan, K\., Jia, P\., Lang, X\., and Zhao, H\.Drivevlm: The convergence of autonomous driving and large vision\-language models, 2024\.URL[https://arxiv\.org/abs/2402\.12289](https://arxiv.org/abs/2402.12289)\.
- Tong et al\. \(2025\)Tong, B\., Xia, J\., and Zhou, K\.Mitigating hallucination in multimodal llms with layer contrastive decoding\.*arXiv preprint arXiv:2509\.25177*, 2025\.
- Wan et al\. \(2025\)Wan, Z\., Xie, Y\., Zhang, C\., Lin, Z\., Wang, Z\., Stepputtis, S\., Ramanan, D\., and Sycara, K\.Instructpart: Task\-oriented part segmentation with instruction reasoning, 2025\.URL[https://arxiv\.org/abs/2505\.18291](https://arxiv.org/abs/2505.18291)\.
- Wang et al\. \(2023\)Wang, S\., Zhao, Z\., Ouyang, X\., Wang, Q\., and Shen, D\.Chatcad: Interactive computer\-aided diagnosis on medical image using large language models, 2023\.URL[https://arxiv\.org/abs/2302\.07257](https://arxiv.org/abs/2302.07257)\.
- Wang et al\. \(2024\)Wang, X\., Pan, J\., Ding, L\., and Biemann, C\.Mitigating hallucinations in large vision\-language models with instruction contrastive decoding\.*arXiv preprint arXiv:2403\.18715*, 2024\.
- Yang et al\. \(2023\)Yang, Z\., Li, L\., Lin, K\., Wang, J\., Lin, C\.\-C\., Liu, Z\., and Wang, L\.The dawn of lmms: Preliminary explorations with gpt\-4v\(ision\), 2023\.URL[https://arxiv\.org/abs/2309\.17421](https://arxiv.org/abs/2309.17421)\.
- Yu et al\. \(2023\)Yu, L\., Mao, Y\., Wu, J\., and Zhou, F\.Mixup\-based unified framework to overcome gender bias resurgence\.In*Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pp\. 1755–1759, 2023\.
- Yu et al\. \(2024\)Yu, L\., Guo, L\., Kuang, P\., and Zhou, F\.Biases mitigation and expressiveness preservation in language models: A comprehensive pipeline \(student abstract\)\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 38, pp\. 23701–23702, 2024\.
- Yu et al\. \(2025a\)Yu, L\., Guo, L\., Kuang, P\., and Zhou, F\.Bridging the fairness gap: Enhancing pre\-trained models with llm\-generated sentences\.In*ICASSP 2025\-2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)*, pp\. 1–5\. IEEE, 2025a\.
- Yu et al\. \(2025b\)Yu, L\., Sun, J\., Kuang, P\., Zhou, R\., Zhou, F\., and Feng, Z\.Bimodal debiasing for text\-to\-image diffusion: Adaptive guidance in textual and visual spaces\.In*Proceedings of the 33rd ACM International Conference on Multimedia*, pp\. 11249–11258, 2025b\.
- Yu et al\. \(2025c\)Yu, L\., Tian, F\., Kuang, P\., and Zhou, F\.Amplifying commonsense knowledge via bi\-directional relation integrated graph\-based contrastive pre\-training from large language models\.*Information Processing & Management*, 62\(3\):104068, 2025c\.
- Yu et al\. \(2026\)Yu, L\., Chen, Z\., Kuang, P\., Feng, Z\., Zhou, F\., Wang, L\., and Dobbie, G\.Causally\-grounded dual\-path attention intervention for object hallucination mitigation in lvlms\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pp\. 36021–36029, 2026\.
- Zhang et al\. \(2025\)Zhang, C\., Wan, Z\., Kan, Z\., Ma, M\. Q\., Stepputtis, S\., Ramanan, D\., Salakhutdinov, R\., Morency, L\.\-P\., Sycara, K\., and Xie, Y\.Self\-correcting decoding with generative feedback for mitigating hallucinations in large vision\-language models, 2025\.URL[https://arxiv\.org/abs/2502\.06130](https://arxiv.org/abs/2502.06130)\.
- Zhao et al\. \(2025\)Zhao, J\., Zhang, F\., Sun, X\., and Feng, C\.Mitigating hallucination in large vision\-language models through aligning attention distribution to information flow, 2025\.URL[https://arxiv\.org/abs/2505\.14257](https://arxiv.org/abs/2505.14257)\.
- Zhou et al\. \(2023\)Zhou, F\., Mao, Y\., Yu, L\., Yang, Y\., and Zhong, T\.Causal\-debias: Unifying debiasing in pretrained language models and fine\-tuning via causal invariant learning\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 4227–4241, 2023\.
- Zhou et al\. \(2025\)Zhou, G\., Yan, Y\., Zou, X\., Wang, K\., Liu, A\., and Hu, X\.Mitigating modality prior\-induced hallucinations in multimodal large language models via deciphering attention causality\.In*ICLR*, 2025\.
- Zhou et al\. \(2024\)Zhou, Y\., Cui, C\., Yoon, J\., Zhang, L\., Deng, Z\., Finn, C\., Bansal, M\., and Yao, H\.Analyzing and mitigating object hallucination in large vision\-language models, 2024\.URL[https://arxiv\.org/abs/2310\.00754](https://arxiv.org/abs/2310.00754)\.
- Zhu et al\. \(2026\)Zhu, C\., Liu, Y\., Zhang, H\., Wang, A\., Chen, G\., Wang, L\., Luo, W\., and Zhang, K\.Alleviating hallucinations in large language models through multi\-model contrastive decoding and dynamic hallucination detection\.*Advances in Neural Information Processing Systems*, 38:165364–165388, 2026\.
- Zou et al\. \(2024\)Zou, X\., Wang, Y\., Yan, Y\., Lyu, Y\., Zheng, K\., Huang, S\., Chen, J\., Jiang, P\., Liu, J\., Tang, C\., et al\.Look twice before you answer: Memory\-space visual retracing for hallucination mitigation in multimodal large language models\.*arXiv preprint arXiv:2410\.03577*, 2024\.

## Appendix AAdditional Evidence for Motivation and Uncertain Signal

### A\.1Statistical Setup and Window Definitions

Dataset and Sampling\.We conduct a fine\-grained head\-level statistical analysis on LLaVA\-1\.5 to provide empirical grounding for our causal assumptions\. We utilize two balanced subsets:1,0001\{,\}000faithful samples and1,0001\{,\}000hallucinated samples\. For each sample, we extract the raw attention probability matrices during autoregressive generation to characterize the internal information flow\. We then compute metrics across the layer–head dimension to localize structural misalignment during decision\-critical multimodal integration\.

Window Definitions\.To verify that our identified diagnostic signals are not transient artifacts but reflect robust structural properties of the causal graph, we define two query\-position windows denoted by the lengthtail\. These windows are defined on the*Text\-Tail*query set𝒬tail\\mathcal\{Q\}\_\{\\text\{tail\}\}, representing post\-image textual positions where language priors typically begin to compete with visual evidence\. The*Prefill\-Last*decision point, serving as the multimodal handshake, is analyzed separately as the trajectory initialization bottleneck\.

- •Instantaneous Window \(tail=1\\text\{tail\}=1\):Statistics are computed using only the final query position of the current step\. This setting captures the mediator behavior at the precise*instant of causal decision*, directly aligning with the “Decision\-Critical Steps” identified in our SCM\. It is uniquely sensitive to the transient structural “locking” where a head decouples from visual evidence to favor priors\.
- •Smoothed Window \(tail=32\\text\{tail\}=32\):Statistics are aggregated over the 32 most recent text query positions\. This setting filters out token\-level stochasticity to reveal the*persistent structural bias*of specific attention heads across a broader local context\.

Causal Complementarity\.These windows provide multi\-scale validation of our framework:tail=1\\text\{tail\}=1identifies the immediate trigger of pathological shortcuts at decision boundaries, whiletail=32\\text\{tail\}=32confirms the stability of risky mediators as unreliable information conduits\. Reporting both settings demonstrates that while signal magnitude may vary, the underlying structural characteristics of*risky mediators*—specifically their tendency to mediate prior\-driven influence—remain invariant to temporal smoothing\.

### A\.2Global Differences in Visual Attention

We investigate whether object hallucination originates from a*quantitative*deficit in global visual grounding, as suggested by the prevailing attention intensity assumption\. By analyzing the sample\-level distribution of visual attention mass, we provide empirical evidence that hallucination is a structural pathology rather than a simple signal\-magnitude failure\.

Head\-level Visual Mass\.For each layerlland headhh, let𝐋\(l,h\)\\mathbf\{L\}^\{\(l,h\)\}denote the pre\-softmax attention activations and

𝐀\(l,h\)=Softmax​\(𝐋\(l,h\)\)\\mathbf\{A\}^\{\(l,h\)\}=\\mathrm\{Softmax\}\\\!\\left\(\\mathbf\{L\}^\{\(l,h\)\}\\right\)\(8\)denote the resulting attention weights\. Given the visual token indicesℐv​i​s=\[vs,ve\)\\mathcal\{I\}\_\{vis\}=\[v\_\{s\},v\_\{e\}\)and the windowed query set𝒬tail\\mathcal\{Q\}\_\{\\text\{tail\}\}\(Appendix[A\.1](https://arxiv.org/html/2606.27596#A1.SS1)\), the head\-level visual attention mass is defined as:

mV,tail\(l,h\)=1\|𝒬tail\|​∑q∈𝒬tail∑k∈ℐv​i​s𝐀q,k\(l,h\)∈\[0,1\]\.m^\{\(l,h\)\}\_\{V,\\text\{tail\}\}=\\frac\{1\}\{\|\\mathcal\{Q\}\_\{\\text\{tail\}\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\_\{\\text\{tail\}\}\}\\sum\_\{k\\in\\mathcal\{I\}\_\{vis\}\}\\mathbf\{A\}^\{\(l,h\)\}\_\{q,k\}\\in\[0,1\]\.\(9\)This quantity represents the total causal weight a specific mediator allocates to the visual pathXv​i​s→HX\_\{vis\}\\to Hduring the generation window\.

Sample\-level Aggregation\.To evaluate the model’s global observational state, we aggregate the mass across all latent mediators:

mV,tail=1L×H​∑l=1L∑h=1HmV,tail\(l,h\)\.m\_\{V,\\text\{tail\}\}=\\frac\{1\}\{L\\times H\}\\sum\_\{l=1\}^\{L\}\\sum\_\{h=1\}^\{H\}m^\{\(l,h\)\}\_\{V,\\text\{tail\}\}\.\(10\)
Counter\-evidence for Intensity\-based Explanations\.We compare the distributions ofmV,tailm\_\{V,\\text\{tail\}\}for faithful and hallucinated samples in Figure[12](https://arxiv.org/html/2606.27596#A1.F12)\. Critically, hallucinated outputs do not exhibit a statistically significant reduction in global visual attention mass\. As illustrated in Figure[12](https://arxiv.org/html/2606.27596#A1.F12)\(a–b\), the distributions for both groups overlap substantially across both instantaneous \(tail=1\\text\{tail\}=1\) and smoothed \(tail=32\\text\{tail\}=32\) windows\.

These results provide a formal refutation of the coarse*global\-intensity*hypothesis\. The evidence confirms that the model maintains sufficient global visual engagement, yet fails due to*structural misalignment*at decision\-critical steps\. This supports our shift from global magnitude boosting toward surgical intervention on sparse, high\-risk mediators that facilitate pathological shortcuts despite nominal global attention\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x12.png)Figure 12:Hallucination is not explained by a global reduction in visual attention mass\.Comparing the distribution of the sample\-level global visual attention massmV,tailm\_\{V,\\mathrm\{tail\}\}\(averaged over all layers and heads\) between 1,000 Faithful \(blue\) and 1,000 Hallucinated \(orange\) samples on LLaVA\-1\.5\-7B\.\(a\)Smoothed window \(tail=32\\mathrm\{tail\}=32\) aggregating statistics over recent post\-image text queries\.\(b\)Instantaneous window \(tail=1\\mathrm\{tail\}=1\) measuring the decision\-step query\. The two distributions largely overlap under both windows, ruling out a coarse global\-intensity account and motivating head\-level structural diagnosis\.
### A\.3Relationship between System Prompts and Hallucinations

This section provides additional evidence that reliance on system/prefix tokens is a structured signal rather than random noise, and exhibits non\-trivial separability between Faithful and Hallucinated samples on our analysis set\. We report results under bothtail=1\\text\{tail\}=1andtail=32\\text\{tail\}=32windows defined in Appendix[A\.1](https://arxiv.org/html/2606.27596#A1.SS1), with hallucination treated as the positive class\.

##### System\-reliance metric\.

For each layerlland headhh, let𝐀\(l,h\)=Softmax​\(𝐋\(l,h\)\)\\mathbf\{A\}^\{\(l,h\)\}=\\mathrm\{Softmax\}\(\\mathbf\{L\}^\{\(l,h\)\}\)denote attention weights\. Let the system/prefix indices beℐs​y​s=\[0,si​m​g\)\\mathcal\{I\}\_\{sys\}=\[0,s\_\{img\}\), wheresi​m​gs\_\{img\}is the starting index of visual tokens\. Given the windowed query set𝒬tail\\mathcal\{Q\}\_\{\\text\{tail\}\}, we define the system\-attention mass:

ms​y​s,tail\(l,h\)=1\|𝒬tail\|​∑q∈𝒬tail∑k∈ℐs​y​s𝐀q,k\(l,h\)∈\[0,1\]\.m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}=\\frac\{1\}\{\|\\mathcal\{Q\}\_\{\\text\{tail\}\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\_\{\\text\{tail\}\}\}\\sum\_\{k\\in\\mathcal\{I\}\_\{sys\}\}\\mathbf\{A\}^\{\(l,h\)\}\_\{q,k\}\\in\[0,1\]\.\(11\)A largerms​y​s,tail\(l,h\)m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}indicates stronger reliance on system/prefix priors within the observed window\.

##### Head\-wise discriminability\.

We compute the ROC\-AUC for each head\(l,h\)\(l,h\)usingms​y​s,tail\(l,h\)m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}as the score and report its centered value:

Δ​AUC\(l,h\)=AUC\(l,h\)−0\.5\.\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}=\\mathrm\{AUC\}^\{\(l,h\)\}\-0\.5\.\(12\)Here,Δ​AUC\(l,h\)\>0\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}\>0indicates that higher system reliance correlates with hallucination, whileΔ​AUC\(l,h\)<0\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}<0indicates the opposite tendency\.

##### Weighted sample\-level aggregation \(analysis\-only\)\.

To summarize the system\-reliance signal into a sample\-level score for visualization, we aggregate head\-wise masses withΔ​AUC\\Delta\\mathrm\{AUC\}weighting:

sn​\(ms​y​s,tail\)=∑l,hms​y​s,tail\(l,h\)​\(n\)⋅Δ​AUC\(l,h\)∑l,h\|Δ​AUC\(l,h\)\|\.s\_\{n\}\(m\_\{sys,\\text\{tail\}\}\)=\\frac\{\\sum\_\{l,h\}m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}\(n\)\\cdot\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}\}\{\\sum\_\{l,h\}\\left\|\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}\\right\|\}\.\(13\)This aggregation is used*only*as a diagnostic probe on the analysis set; it is not a deployed hallucination detector\.

##### Results and analysis\.

Figure[13](https://arxiv.org/html/2606.27596#A1.F13)plots ROC curves based on\{sn​\(ms​y​s,tail\)\}\\\{s\_\{n\}\(m\_\{sys,\\text\{tail\}\}\)\\\}\. The instantaneous window \(tail=1\\text\{tail\}=1\) yields an AUC of 0\.8226, while the smoothed window \(tail=32\\text\{tail\}=32\) yields an AUC of 0\.7626, indicating a stable and non\-trivial signal\. The higher AUC undertail=1\\text\{tail\}=1suggests that system reliance is most pronounced at the instantaneous decision step\. In our method, this system\-reliance cue is combined with visual uncertainty to form the joint risk score used for mediator localization in the main text\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x13.png)Figure 13:All\-head aggregated system reliance separates hallucinated from faithful samples\.We compute an analysis\-only sample\-level score by aggregating head\-wise system attention massms​y​s,tail\(l,h\)m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}over*all*heads withΔ​AUC\(l,h\)\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}weighting \(Appendix[A\.3](https://arxiv.org/html/2606.27596#A1.SS3)\); a higher score indicates stronger reliance on system/prefix tokens\.\(a\)Smoothed window \(tail=32\\text\{tail\}=32\), AUC=0\.7626\.\(b\)Instantaneous window \(tail=1\\text\{tail\}=1\), AUC=0\.8226\. The stronger separability undertail=1\\text\{tail\}=1indicates that system reliance peaks at instantaneous decision steps\.

### A\.4Visual Uncertainty and Joint Risk Scoring

This section provides complementary evidence for the joint risk formulation used in the main text \(Section[4\.1](https://arxiv.org/html/2606.27596#S4.SS1)\) to localize high\-risk mediators\. Rather than treating visual uncertainty as an isolated cue, we study how*visual\-side instability*and*system\-prior reliance*co\-occur at the head level, and show that their multiplicative interaction yields a*sparse*and*stable*diagnostic signal for hallucination risk\. All statistics follow the same windowed query set𝒬tail\\mathcal\{Q\}\_\{\\text\{tail\}\}and token partitions introduced earlier \(Appendix[A\.1](https://arxiv.org/html/2606.27596#A1.SS1)–[A\.3](https://arxiv.org/html/2606.27596#A1.SS3)\), with hallucination treated as the positive class\.

##### Visual uncertainty via modality\-specific entropy\.

Given attention weights𝐀\(l,h\)\\mathbf\{A\}^\{\(l,h\)\}, we quantify visual\-side uncertainty by the dispersion of attention*within the visual subspace*\. For each query positionq∈𝒬tailq\\in\\mathcal\{Q\}\_\{\\text\{tail\}\}, we re\-normalize attention over visual keysℐv​i​s\\mathcal\{I\}\_\{vis\}:

pq,k\(l,h\)=𝐀q,k\(l,h\)∑j∈ℐv​i​s𝐀q,j\(l,h\)\+ε,k∈ℐv​i​s,p^\{\(l,h\)\}\_\{q,k\}=\\frac\{\\mathbf\{A\}^\{\(l,h\)\}\_\{q,k\}\}\{\\sum\_\{j\\in\\mathcal\{I\}\_\{vis\}\}\\mathbf\{A\}^\{\(l,h\)\}\_\{q,j\}\+\\varepsilon\},\\quad k\\in\\mathcal\{I\}\_\{vis\},\(14\)whereε\\varepsilonis a small constant for numerical stability\. We then define the modality\-specific visual entropy:

Hv​i​s\(l,h\)​\(q\)=−∑k∈ℐv​i​spq,k\(l,h\)​log⁡\(pq,k\(l,h\)\+ε\),H^\{\(l,h\)\}\_\{vis\}\(q\)=\-\\sum\_\{k\\in\\mathcal\{I\}\_\{vis\}\}p^\{\(l,h\)\}\_\{q,k\}\\log\\\!\\left\(p^\{\(l,h\)\}\_\{q,k\}\+\\varepsilon\\right\),\(15\)and average it over the tail window:

Hv​i​s,tail\(l,h\)=1\|𝒬tail\|​∑q∈𝒬tailHv​i​s\(l,h\)​\(q\)\.H^\{\(l,h\)\}\_\{vis,\\text\{tail\}\}=\\frac\{1\}\{\|\\mathcal\{Q\}\_\{\\text\{tail\}\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\_\{\\text\{tail\}\}\}H^\{\(l,h\)\}\_\{vis\}\(q\)\.\(16\)A largerHv​i​s,tail\(l,h\)H^\{\(l,h\)\}\_\{vis,\\text\{tail\}\}indicates more dispersed \(less certain\) routing of visual evidence at decision\-critical steps\.

##### Joint risk score for mediator localization\.

Visual uncertainty alone is insufficient to characterize hallucination: a head can be visually uncertain yet benign if it does not propagate language priors into generation\. Following the main text, we define a head\-level joint risk score as the interaction between visual uncertainty and system reliance:

Stail\(l,h\)=ms​y​s,tail\(l,h\)⋅Hv​i​s,tail\(l,h\)\.S^\{\(l,h\)\}\_\{\\text\{tail\}\}=m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}\\cdot H^\{\(l,h\)\}\_\{vis,\\text\{tail\}\}\.\(17\)The multiplicative form suppresses benign cases where only one factor is high and highlights heads exhibiting the co\-activation pattern most consistent with prior\-driven shortcut behavior\.

##### TopHeads\-projected joint risk as a sample\-level diagnostic probe\.

To summarize head\-wise joint risk into a sample\-level score for analysis, we compute a directional weight for each head based on its head\-wise ROC\-AUC:

Δ​AUC\(l,h\)=AUC\(l,h\)−0\.5,\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}=\\mathrm\{AUC\}^\{\(l,h\)\}\-0\.5,\(18\)rank heads by\|Δ​AUC\(l,h\)\|\|\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}\|, and denote the Top\-KKset byℋK\\mathcal\{H\}\_\{K\}\. For each samplenn, we compute the weighted projection

sn​\(Stail\)=∑\(l,h\)∈ℋKS^tail\(l,h\)​\(n\)⋅Δ​AUC\(l,h\)∑\(l,h\)∈ℋK\|Δ​AUC\(l,h\)\|,s\_\{n\}\(S\_\{\\text\{tail\}\}\)=\\frac\{\\sum\_\{\(l,h\)\\in\\mathcal\{H\}\_\{K\}\}\\widehat\{S\}^\{\(l,h\)\}\_\{\\text\{tail\}\}\(n\)\\cdot\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}\}\{\\sum\_\{\(l,h\)\\in\\mathcal\{H\}\_\{K\}\}\|\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}\|\},\(19\)whereS^tail\(l,h\)​\(n\)\\widehat\{S\}^\{\(l,h\)\}\_\{\\text\{tail\}\}\(n\)is a per\-head z\-score computed using dataset\-level statistics \(mean and standard deviation\) for that head\. This TopHeads projection is used*only*for diagnosing sparsity/stability on the analysis set; it does not prescribe the execution\-time intervention, which follows a per\-layer proportional budget in the main method\.

##### Sparsity and a global\-aggregation reference point\.

Figure[14](https://arxiv.org/html/2606.27596#A1.F14)shows that the discriminative power ofsn​\(Stail\)s\_\{n\}\(S\_\{\\text\{tail\}\}\)saturates quickly asKKincreases, indicating that the diagnostic signal is concentrated in a small subset of heads\. As a reference point, if we aggregateStail\(l,h\)S^\{\(l,h\)\}\_\{\\text\{tail\}\}over*all*heads without Top\-KKselection \(i\.e\., replacingℋK\\mathcal\{H\}\_\{K\}by all\(l,h\)\(l,h\)\), the ROC AUC is0\.78520\.7852, which is lower than the Top\-32 projection used in the main text \(ROC AUC0\.81800\.8180\)\. This suggests that global aggregation tends to dilute the localized signal rather than amplify it\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x14.png)Figure 14:Sample\-level diagnosis and sparsity analysis of the joint risk score\.All statistics are computed on theText\-Tailquery set with an instantaneous window \(tail=1\\text\{tail\}=1\)\. We define the head\-level joint risk asStail\(l,h\)=ms​y​s,tail\(l,h\)⋅Hv​i​s,tail\(l,h\)S^\{\(l,h\)\}\_\{\\text\{tail\}\}=m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}\\cdot H^\{\(l,h\)\}\_\{vis,\\text\{tail\}\}and form a sample\-level diagnostic scoresn​\(Stail\)s\_\{n\}\(S\_\{\\text\{tail\}\}\)by projecting standardized head\-wise scores onto the Top\-KKheads ranked by\|Δ​AUC\(l,h\)\|\|\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}\|\.\(a\)Precision–Recall curve of the Top\-32 projected joint\-risk score\.\(b\)The Top\-32 heads ranked byΔ​AUC\(l,h\)\\Delta\\mathrm\{AUC\}^\{\(l,h\)\}, illustrating that the signal concentrates on a sparse subset of mediators\.\(c\)ROC AUC as a function ofKK, showing rapid saturation with increasingKK\.

### A\.5Visual Evidence of Uncertainty and Structural Reconfiguration Post\-Intervention

In this section, we provide head\-level visual evidence to support the*Risky Mediator*localization in the main text\. We select representative heads ranked highly by the joint risk scoreStail\(l,h\)S^\{\(l,h\)\}\_\{\\text\{tail\}\}\(Appendix[A\.4](https://arxiv.org/html/2606.27596#A1.SS4)\) and visualize how their attention structure changes before and after our logit\-level intervention\. For each head, we fix a representative sample and a decision\-critical query step, and compare the Baseline \(left\) against the Intervention \(right\)\.

##### Visualization setup\.

Figure[15](https://arxiv.org/html/2606.27596#A1.F15)provides a multi\-view diagnostic snapshot for each head at the fixed query step: \(i\) System Reliance, the attention mass onℐs​y​s\\mathcal\{I\}\_\{sys\}\(i\.e\.,ms​y​s,tail\(l,h\)m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}\); \(ii\) Visual Attention Map, the distribution overℐv​i​s\\mathcal\{I\}\_\{vis\}together with the corresponding visual entropy \(i\.e\.,Hv​i​s,tail\(l,h\)H^\{\(l,h\)\}\_\{vis,\\text\{tail\}\}\); and \(iii\) Text Reliance, the attention mass on the textual context\. Text reliance is shown only to illustrate redistribution after intervention and is not used in risk scoring or head selection\.

##### Mechanistic interpretation\.

Our intervention applies a large negative bias to the attention logits𝐋\(l,h\)\\mathbf\{L\}^\{\(l,h\)\}*before*Softmax, which \(under finite precision\) drives unreliable connections to near\-zero probability and forces the remaining mass to be re\-normalized\. Across the four examples, we observe a consistent reconfiguration pattern: \(i\) strong system/prefix lock\-on is reduced; \(ii\) the visual map becomes less diffused and more peak\-focused, reflected by a decrease inHv​i​s,tail\(l,h\)H^\{\(l,h\)\}\_\{vis,\\text\{tail\}\}; and \(iii\) the removed mass is redistributed to more plausible visual or textual anchors\. Together, these qualitative cases provide direct evidence that heads with highStail\(l,h\)=ms​y​s,tail\(l,h\)⋅Hv​i​s,tail\(l,h\)S^\{\(l,h\)\}\_\{\\text\{tail\}\}=m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}\\cdot H^\{\(l,h\)\}\_\{vis,\\text\{tail\}\}indeed exhibit simultaneous system hijacking and elevated visual uncertainty, and that our intervention reconstructs their routing structure toward more grounded evidence\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x15.png)\(a\)Representative risky head \(example 1\)\.
![Refer to caption](https://arxiv.org/html/2606.27596v1/x16.png)\(b\)Representative risky head \(example 2\)\.
![Refer to caption](https://arxiv.org/html/2606.27596v1/x17.png)\(c\)Representative risky head \(example 3\)\.
![Refer to caption](https://arxiv.org/html/2606.27596v1/x18.png)\(d\)Representative risky head \(example 4\)\.

Figure 15:Head\-level structural transformation after logit\-level intervention \(four representative examples\)\.For each risky head \(selected by high joint riskStail\(l,h\)S^\{\(l,h\)\}\_\{\\text\{tail\}\}\), we visualize the attention snapshot at a fixed decision\-critical step:*system reliance*ms​y​s,tail\(l,h\)m^\{\(l,h\)\}\_\{sys,\\text\{tail\}\}\(left bar\),*visual attention map*overℐv​i​s\\mathcal\{I\}\_\{vis\}with entropyHv​i​s,tail\(l,h\)H^\{\(l,h\)\}\_\{vis,\\text\{tail\}\}\(center heatmap\), and*text reliance*\(right bar\)\. Across examples, the intervention consistently suppresses pathological system/prefix lock\-on and sharpens visual evidence routing, leading to a lower visual entropy and a more concentrated set of visual peaks\. These four cases illustrate that the structural reconfiguration induced by our intervention is not confined to a single head or layer, but recurs across multiple risky mediators\.

## Appendix BMethod

This subsection specifies how Fox*executes*the causal intervention at inference time\. After identifying the risky mediatorsHRH\_\{R\}in our diagnosis stage, we instantiate two coupled next\-token distributions at each decoding steptt: \(i\) an*observational*branchPo​b​s​\(⋅\)P\_\{obs\}\(\\cdot\)that follows the original computation graph, and \(ii\) an*interventional*branchPd​o​\(⋅\)P\_\{do\}\(\\cdot\)that enforces𝐝𝐨​\(HR\)\\mathbf\{do\}\(H\_\{R\}\)by editing attention logits before Softmax on the selected heads\. Causally, the intervention aims to attenuate the shortcut pathway𝐗s​y​s→HR→Yt\\mathbf\{X\}\_\{sys\}\\\!\\rightarrow\\\!H\_\{R\}\\\!\\rightarrow\\\!Y\_\{t\}so that the next\-token decision relies less on system/prefix priors and becomes more grounded in the multimodal evidence, while the observational branch preserves the model’s native linguistic manifold\. Our objective is to leverage the improved faithfulness ofPd​oP\_\{do\}without over\-committing to it when the intervention becomes overly restrictive\.

##### Step\-wise procedure\.

Concretely, at each generation step, we perform the following operations:

- •\(1\) Dual forward passes \(observational vs\. interventional\)\.We compute the next\-token logits from the original run to obtainPo​b​sP\_\{obs\}\. In parallel, we run the model again with𝐝𝐨​\(HR\)\\mathbf\{do\}\(H\_\{R\}\)applied on the diagnosed heads to obtainPd​oP\_\{do\}\. In practice,𝐝𝐨​\(HR\)\\mathbf\{do\}\(H\_\{R\}\)is implemented exactly as in Algorithm 1: on the designated intervention layers \(early\-to\-mid range\), we select Top\-⌈k⋅H⌉\\lceil k\\\!\\cdot\\\!H\\rceilheads per layer using the joint risk score and apply a negative bias to their*pre\-Softmax*attention logits on decision\-critical queries, driving unreliable links to near\-zero probability after re\-normalization\. Concretely, for\(l,h\)∈HR\(l,h\)\\in H\_\{R\}andq∈𝒬q\\in\\mathcal\{Q\}, we use numerical logit saturation:𝐋~\(l,h\)=Πdtype​\(𝐋\(l,h\)−γ\)\\tilde\{\\mathbf\{L\}\}^\{\(l,h\)\}=\\Pi\_\{\\text\{dtype\}\}\(\\mathbf\{L\}^\{\(l,h\)\}\-\\gamma\), and then𝐀~\(l,h\)=Softmax​\(𝐋~\(l,h\)\)\\tilde\{\\mathbf\{A\}\}^\{\(l,h\)\}=\\mathrm\{Softmax\}\(\\tilde\{\\mathbf\{L\}\}^\{\(l,h\)\}\)\.
- •\(2\) Candidate truncation for stable conflict measurement\.Measuring divergence over the full vocabulary is dominated by the long tail of near\-zero probabilities\. We therefore construct a decision\-relevant candidate set𝒱t\\mathcal\{V\}\_\{t\}induced byPo​b​sP\_\{obs\}, retaining only tokens within a fixed ratio of the top\-1 probability\. This truncation makes the conflict estimate focus on the local decision boundary rather than numerical tail noise\.
- •\(3\) Conflict estimation\.We quantify the disagreement between the observational and interventional branches using Jensen–Shannon divergence on the truncated candidates: dt=JSD\(Po​b​s\(⋅∣𝒱t\)∥Pd​o\(⋅∣𝒱t\)\)\.d\_\{t\}=\\mathrm\{JSD\}\\\!\\left\(P\_\{obs\}\(\\cdot\\mid\\mathcal\{V\}\_\{t\}\)\\,\\\|\\,P\_\{do\}\(\\cdot\\mid\\mathcal\{V\}\_\{t\}\)\\right\)\.\(20\)A smalldtd\_\{t\}indicates that the intervention stays close to the observational manifold at steptt, whereas a largedtd\_\{t\}signals a strong causal perturbation that substantially reshapes the next\-token preference\.
- •\(4\) Conflict\-gated injection\.We convertdtd\_\{t\}into a step\-wise injection weightλt\\lambda\_\{t\}\. When the two branches are consistent \(low conflict\), we apply a fixed gainα\\alphato strengthen the interventional correction\. When they diverge \(high conflict\), we fall back to a softer, conflict\-proportional injection, preventing the interventional branch from overwhelming the observational manifold\.

##### Logit\-level combination\.

Finally, we couple the two branches at the logit level and select the next token:

𝐳f​i​n​a​l,t=𝐳o​b​s,t\+λt⋅𝐳d​o,t,yt∼Softmax​\(𝐳f​i​n​a​l,t\),\\mathbf\{z\}\_\{final,t\}=\\mathbf\{z\}\_\{obs,t\}\+\\lambda\_\{t\}\\cdot\\mathbf\{z\}\_\{do,t\},\\qquad y\_\{t\}\\sim\\mathrm\{Softmax\}\(\\mathbf\{z\}\_\{final,t\}\),\(21\)whereλt\\lambda\_\{t\}is determined by the conflict\-gating rule described above\.

Algorithm 1 summarizes the step\-wise inference procedure\.

## Appendix CDetailed Configurations and Experimental Results

### C\.1Models and Baselines

We select LVLMs that represent diverse architectural paradigms to ensure the generalizability of our method:

- •LLaVA\-1\.5\(Liu et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib18)\): A widely\-used general\-purpose baseline that connects a CLIP\-ViT\-L/14 encoder with the Vicuna LLM via a two\-layer MLP projection\.
- •Shikra\(Chen et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib4)\): A structured LVLM specialized for referential dialogue and fine\-grained object grounding, handling bounding box inputs/outputs\.
- •InstructBLIP\(Dai et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib7)\): An instruction\-tuned model utilizing a Q\-Former to compress visual features into soft queries for the LLM\.

For baselines, we compare against the following inference\-time interventions:

- •ICD\(Wang et al\.,[2024](https://arxiv.org/html/2606.27596#bib.bib29)\): Constructs contrastive instruction branches to estimate and suppress language priors\.
- •VCD\(Leng et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib14)\): Introduces visual noise to amplify hallucination\-prone logits via contrastive decoding\.
- •OPERA\(Huang et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib11)\): A beam\-search\-based method that detects “over\-trust” attention patterns and applies a rollback penalty\.
- •SID\(Huo et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib12)\): Reweights candidate tokens dynamically using self\-contrastive signals to prevent error amplification\.
- •CausalMM\(Zhou et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib40)\): Applies counterfactual reasoning on both encoder and decoder sides to disentangle spurious correlations\.

### C\.2Hyperparameters and Hardware

Sampling Strategy\.To simulate realistic generation scenarios, we use Nucleus Sampling with top\-p=0\.9p=0\.9, temperatureT=1\.0T=1\.0, and a maximum length of 512 tokens\. No repetition or length penalties are applied\. Note that OPERA is evaluated using its official beam\-search configuration \(num\_beams=5\) as it is incompatible with standard sampling\.

Method\-Specific Parameters\.Our method involves four key hyperparameters: the per\-layer head suppression ratiokk, the conflict thresholdτJS\\tau\_\{\\mathrm\{JS\}\}, the consensus amplification factorα\\alpha, and the truncation ratioβ\\betaused in conflict estimation\.

We fixα=2\\alpha=2across all backbones\. The remaining hyperparameters\(k,τJS,β\)\(k,\\tau\_\{\\mathrm\{JS\}\},\\beta\)are selected in a model\-specific manner via grid search on a held\-out validation set\. Notably, the search consistently selects the sameβ\\betaacross all evaluated backbones, and we therefore useβ=0\.1\\beta=0\.1for LLaVA\-1\.5, InstructBLIP, and Shikra\.

The optimal configurations are:

- •LLaVA\-1\.5:k=0\.45k=0\.45,τJS=0\.2\\tau\_\{\\mathrm\{JS\}\}=0\.2,β=0\.1\\beta=0\.1\.
- •InstructBLIP:k=0\.4k=0\.4,τJS=0\.2\\tau\_\{\\mathrm\{JS\}\}=0\.2,β=0\.1\\beta=0\.1\.
- •Shikra:k=0\.4k=0\.4,τJS=0\.2\\tau\_\{\\mathrm\{JS\}\}=0\.2,β=0\.1\\beta=0\.1\.

We report the best results averaged over10independent runs\. Statistical significance is determined by a two\-sidedtt\-test \(p<0\.05p<0\.05\)\. All experiments are conducted on8×8\\timesNVIDIA A100 \(40GB\) GPUs using PyTorch and HuggingFace Transformers\.

### C\.3Detailed Metrics and Protocols

##### \(1\) POPE\.

This metric evaluates object existence through a series of binary \(Yes/No\) questions \(e\.g\., “Is there a \[object\] in the image?”\), thereby measuring the model’s propensity to fabricate non\-existent visual evidence\. POPE consists of three distinct sampling configurations to assess different facets of model reliability:

- •Random:Targets are sampled randomly with broad category coverage to examine the model’s general recognition and grounding capabilities across diverse objects\.
- •Popular:Categories with higher frequencies in the training distribution are selected to observe if the model is more stable and less prone to hallucination when dealing with “common and familiar” objects\.
- •Adversarial:Categories that are frequently misreported or confused by LVLMs are selected, increasing the evaluation difficulty and revealing the model’s robustness against hallucination under ambiguous or interfering inputs\.

By focusing on binary probing, this benchmark bypasses the complexities associated with parsing open\-ended generated captions, ensuring a stable, fair, and adaptable evaluation process\. We report both Accuracy and the F1\-score to quantify performance\. The F1\-score serves as a balanced harmonic mean between Precision and Recall, defined as:

Recall=Correctly identified objectsGround\-truth objects,\\text\{Recall\}=\\frac\{\\text\{Correctly identified objects\}\}\{\\text\{Ground\-truth objects\}\},\(22\)Precision=Correctly identified objectsTotal generated objects,\\text\{Precision\}=\\frac\{\\text\{Correctly identified objects\}\}\{\\text\{Total generated objects\}\},\(23\)F1=2×Precision×RecallPrecision\+Recall\.\\text\{F1\}=2\\times\\frac\{\\text\{Precision\}\\times\\text\{Recall\}\}\{\\text\{Precision\}\+\\text\{Recall\}\}\.\(24\)
In our experimental framework, Recall characterizes the proportion of ground\-truth objects successfully retrieved by the model from the visual evidence𝐗v​i​s\\mathbf\{X\}\_\{vis\}, while Precision measures the ratio of generated objects that actually exist in the image rather than being hallucinations\. As the harmonic mean of both, the F1\-score provides a robust holistic measure of generation quality\. Consequently, Accuracy and F1\-score constitute our standard baseline framework for assessing the model’s overall efficacy in multimodal grounding\.

##### \(2\) CHAIR\.

TheCHAIR\(Rohrbach et al\.,[2018](https://arxiv.org/html/2606.27596#bib.bib23)\)benchmark consists of two primary metrics,CHAIRI\\text\{CHAIR\}\_\{I\}andCHAIRS\\text\{CHAIR\}\_\{S\}, which measure object hallucinations in image captioning at the instance and sentence levels, respectively\. Specifically, the instance\-level metricCHAIRI\\text\{CHAIR\}\_\{I\}calculates the proportion of hallucinated objects relative to all mentioned objects in the generated captions\. The sentence\-level metricCHAIRS\\text\{CHAIR\}\_\{S\}reflects the proportion of generated sentences that contain at least one hallucination\.

To ensure that our intervention𝐝𝐨​\(HR\)\\mathbf\{do\}\(H\_\{R\}\)does not merely reduce hallucinations by excessively suppressing the generation of fine\-grained details, we further incorporate Recall and F1\-score as indicators of semantic completeness\. These metrics verify that the performance gains are not achieved through an “evasion strategy” but through reliable visual grounding\. The metrics follow the formulations below:

CHAIRI=\|\{hallucinated objects\}\|\|\{all mentioned objects\}\|,\\text\{CHAIR\}\_\{I\}=\\frac\{\|\\\{\\text\{hallucinated objects\}\\\}\|\}\{\|\\\{\\text\{all mentioned objects\}\\\}\|\},\(25\)CHAIRS=\|\{sentences with hallucinations\}\|\|\{total sentences\}\|\.\\text\{CHAIR\}\_\{S\}=\\frac\{\|\\\{\\text\{sentences with hallucinations\}\\\}\|\}\{\|\\\{\\text\{total sentences\}\\\}\|\}\.\(26\)The robustness of the captions is measured via:

Recall=\|\{accurately mentioned objects\}\|\|\{ground\-truth objects\}\|,\\text\{Recall\}=\\frac\{\|\\\{\\text\{accurately mentioned objects\}\\\}\|\}\{\|\\\{\\text\{ground\-truth objects\}\\\}\|\},\(27\)Precision=\|\{all mentioned objects\}∩\{ground\-truth objects\}\|\|\{all mentioned objects\}\|,\\text\{Precision\}=\\frac\{\|\\\{\\text\{all mentioned objects\}\\\}\\cap\\\{\\text\{ground\-truth objects\}\\\}\|\}\{\|\\\{\\text\{all mentioned objects\}\\\}\|\},\(28\)F1=2×Precision×RecallPrecision\+Recall\.\\text\{F1\}=2\\times\\frac\{\\text\{Precision\}\\times\\text\{Recall\}\}\{\\text\{Precision\}\+\\text\{Recall\}\}\.\(29\)Collectively,CHAIRI\\text\{CHAIR\}\_\{I\},CHAIRS\\text\{CHAIR\}\_\{S\}, and the F1\-score constitute our comprehensive evaluation framework for the captioning task\.

##### \(3\) MME\.

TheMME\(Fu et al\.,[2025](https://arxiv.org/html/2606.27596#bib.bib9)\)benchmark pairs each image with two semantically similar questions whose ground\-truth answers are “Yes” and “No,” respectively\. Evaluation is conducted using two metrics: Accuracy and Accuracy\+\.

- •Accuracyis calculated at the question granularity: a correct response to any single question contributes to the score\.
- •Accuracy\+is calculated at the image granularity: a sample is counted as correct only if the model correctly answers*both*the “Yes” and “No” questions associated with the same image\.

This stricter requirement ensures that the model truly perceives the visual evidence𝐗v​i​s\\mathbf\{X\}\_\{vis\}rather than relying on language priors\. The final MME Score is defined as the sum of these metrics:

Accuracy=∑i∈ℐ𝟏​\[f​\(i,qy​e​s\)=“Yes”\]\+∑i∈ℐ𝟏​\[f​\(i,qn​o\)=“No”\]2​\|ℐ\|,\\text\{Accuracy\}=\\frac\{\\sum\_\{i\\in\\mathcal\{I\}\}\\mathbf\{1\}\[f\(i,q\_\{yes\}\)=\\text\{\`\`Yes''\}\]\+\\sum\_\{i\\in\\mathcal\{I\}\}\\mathbf\{1\}\[f\(i,q\_\{no\}\)=\\text\{\`\`No''\}\]\}\{2\|\\mathcal\{I\}\|\},\(30\)Accuracy\+=∑i∈ℐ𝟏​\[f​\(i,qy​e​s\)=“Yes”∧f​\(i,qn​o\)=“No”\]\|ℐ\|,\\text\{Accuracy\}^\{\+\}=\\frac\{\\sum\_\{i\\in\\mathcal\{I\}\}\\mathbf\{1\}\[f\(i,q\_\{yes\}\)=\\text\{\`\`Yes''\}\\land f\(i,q\_\{no\}\)=\\text\{\`\`No''\}\]\}\{\|\\mathcal\{I\}\|\},\(31\)MME Score=Accuracy\+Accuracy\+\.\\text\{MME Score\}=\\text\{Accuracy\}\+\\text\{Accuracy\}^\{\+\}\.\(32\)

##### \(4\) GPT\-4V Assisted Evaluation\.

While CHAIR and POPE effectively identify object hallucinations, they provide limited insight into the overall linguistic quality and descriptive richness of open\-ended outputs\. To complement these automatic metrics, we conduct a GPT\-4V–assisted evaluation by following an established protocol from prior work\(Huang et al\.,[2024b](https://arxiv.org/html/2606.27596#bib.bib11); Yang et al\.,[2023](https://arxiv.org/html/2606.27596#bib.bib30)\)\. Specifically, GPT\-4V acts as a multimodal judge and assigns scores under two criteria: Accuracy \(factual consistency with respect to objects, attributes, and spatial/relational correctness\) and Detailedness \(the richness and precision of*correctly grounded*visual details\)\. Leveraging its advanced perception, GPT\-4V can capture subtle errors in color, spatial positioning, and logical relationships between objects\. We conduct this evaluation on a curated subset of 50 images from the MS\-COCO 2014 validation set, used solely as complementary evidence rather than a primary benchmark\. For each image–model pair, we generate descriptions with a standardized prompt and evaluate both the original backbone output and the output produced with our method\. GPT\-4V inference is performed withmax\_tokens=512 andtemperature=0\.2\. The full prompt and required output format are provided in Table[3](https://arxiv.org/html/2606.27596#A3.T3)\.

GPT\-4V PromptYou are required to score the performance of two AI assistants in describing a given image\. You should pay extra attention to the hallucination, which refers to the part of descriptions that are inconsistent with the image content, such as claiming the existence of something not present in the image or describing incorrectly in terms of the counts, positions, or colors of objects in the image\. Please rate the responses of the assistants on a scale of11to1010, where a higher score indicates better performance, according to the following criteria:1\. Accuracy:Evaluate whether the response is accurate and faithful to the actual image content\. Focus on identifying any hallucinations including non\-existent objects, incorrect attributes \(colors, sizes, materials\), wrong quantities, false spatial relationships, or activities that are not happening\. Responses with fewer hallucinations and higher fidelity to the image should receive higher scores\.2\. Detailedness:Assess whether the response provides rich and informative details about the image\. Consider the completeness of the description, coverage of important visual elements, and the depth of observations\. Note that hallucinated content does NOT count as valid details – only accurate information contributes to this score\.Please output the scores for each criterion, containing only two values indicating the scores for Assistant 1 and 2, respectively\. The two scores are separated by a space\. Following the scores, please provide an explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment\.\[Assistant 1\]\{Response of Assistant 1\}\[End of Assistant 1\]\[Assistant 2\]\{Response of Assistant 2\}\[End of Assistant 2\]Output format:Accuracy: <score\_1\> <score\_2\>Reason: <your explanation\>Detailedness: <score\_1\> <score\_2\>Reason: <your explanation\>Table 3:The prompt used for GPT\-4V evaluation\.

### C\.4Additional Ablation Studies

##### Efficiency and Latency Analysis\.

Table[4](https://arxiv.org/html/2606.27596#A3.T4)compares the inference cost on LLaVA\-1\.5 alongside POPE Adversarial performance\. Fox achieves a superior Pareto trade\-off, maintaining the same latency regime as VCD/SID \(≈\\approx200200ms/token\) while reaching the highest accuracy \(81\.9381\.93%\)\. For 10\-token generation, Fox incurs only modest overhead \(1,0401,040ms\), whereas search\-based methods like OPERA exhibit significantly higher latency \(2,5602,560ms\) due to beam search and iterative rollback\. These results confirm that our gains stem from a logit\-level intervention within the same dual\-pass contrastive decoding regime as VCD/SID, rather than increased decoding depth or search budgets\. In practice, the system\-prompt and visual\-token KV cache can be shared between branches, keeping the overhead in the same latency class while providing a stronger balance between faithfulness and efficiency\.

Table 4:Efficiency comparison\. Inference latency \(ms\) for 1\-token and 10\-token generation\. OPERA is not evaluated in the 1\-token setting \(marked as \-\)\.
#### C\.4\.1Ablation Study onα\\alpha

Table 5:Sensitivity of the enhancement factorα\\alphaon POPE\.As shown in Table[5](https://arxiv.org/html/2606.27596#A3.T5), we conduct a sensitivity analysis of the enhancement factorα\\alphaon POPE to determine a default setting that can be reused across backbones\. Concretely, we varyα\\alphaon LLaVA\-1\.5 while keeping all other hyperparameters fixed, and report the averagedAccuracyandF1over the Random/Popular/Adversarial splits\. We observe that performance changes smoothly within a reasonably wide range ofα\\alpha, withα=2\\alpha=2achieving the best or near\-best averaged performance\. When further increasingα\\alpha, the gains exhibit diminishing returns and slightly regress on some splits, suggesting that overly strong consensus enhancement may bias the binary verification toward a more*robust but conservative*decision behavior, which is detrimental to overall F1\. Based on this trend, we setα=2\\alpha=2as the default and keep it fixed for all subsequent backbones \(including InstructBLIP and Shikra\), reducing model\-specific tuning freedom and verifying the transferability and robustness of this choice\.

#### C\.4\.2Ablation Study on InstructBLIP

![Refer to caption](https://arxiv.org/html/2606.27596v1/x19.png)Figure 16:Parameter sensitivity analysis on InstructBLIP\. Impact ofkk,τJS\\tau\_\{\\mathrm\{JS\}\}, andβ\\betaon captioning performance \(CHAIRS, CHAIRI, and F1\), evaluated on 500 COCO samples\.We analyze the sensitivity of key hyperparameters on InstructBLIP\. Sinceα\\alphais fixed globally toα=2\\alpha=2across all backbones \(see Appendix[C\.4\.1](https://arxiv.org/html/2606.27596#A3.SS4.SSS1)\), we focus on the three parameters that govern intervention strength and conflict gating: the per\-layer head suppression ratiokk, the JSD conflict thresholdτJS\\tau\_\{\\mathrm\{JS\}\}, and the truncation ratioβ\\beta\. Overall, these parameters mainly control the trade\-off between*reliability*\(lower hallucination, reflected by CHAIRS/CHAIRI\) and*informativeness*\(semantic coverage, reflected by F1\), withτJS\\tau\_\{\\mathrm\{JS\}\}serving as the primary knob\.

##### Impact ofτJS\\tau\_\{\\mathrm\{JS\}\}\.

AsτJS\\tau\_\{\\mathrm\{JS\}\}increases, the decoding procedure more frequently follows the intervention\-dominant path at uncertain steps, which consistently reduces hallucination errors \(lower CHAIRSand CHAIRI\)\. However, overly largeτJS\\tau\_\{\\mathrm\{JS\}\}yields diminishing returns and may slightly regress F1, indicating a conservative bias that prioritizes safety over semantic coverage\. This suggests thatτJS\\tau\_\{\\mathrm\{JS\}\}should be set within a moderate range to suppress uncertainty\-driven deviations without over\-stabilizing the output\.

##### Impact ofkk\.

The ratiokkcontrols the sparsity budget of head\-level intervention per layer\. Varyingkkresults in relatively smooth changes in both CHAIR and F1: moderatekkvalues typically improve CHAIRS/CHAIRIwithout harming F1, while overly aggressive intervention \(largekk\) brings limited additional gains and can slightly reduce F1, suggesting over\-contraction of the effective generation space\.

##### Impact ofβ\\beta\.

The parameterβ\\betacontrols the truncation strength in conflict estimation\. InstructBLIP is relatively robust toβ\\betawithin a broad range, where performance varies smoothly and improvements on CHAIR metrics remain stable\. Extreme truncation may narrow the effective candidate space and lead to diminishing returns, occasionally accompanied by a slight F1 decrease\.

##### Selection of Intervention Layers in InstructBLIP\.

For InstructBLIP, we set the intervention range to layers 4–10, adhering to our core finding that intervention must occur in the early\-to\-mid stages to suppress*risky mediators*before erroneous evidence chains become consolidated in subsequent generation\.

The distinction lies in InstructBLIP’s architecture: cross\-modal information is first compressed into a compact set of visual representations via the Q\-Former’s learnable queries before fusion with the LLM\. Consequently, the “effective visual evidence” enters the language decoder in a manner that favors early information aggregation followed by progressive verbalization\.

- •Early Layers \(<4<4\):Intervening too early often acts on the stage before multimodal fusion is fully realized, resulting in unstable gains\.
- •Latter Layers \(\>10\>10\):Intervening in later stages primarily affects linguistic expression and decoding convergence, offering limited help in correcting biases formed during the initial fusion stage, and potentially leading to overly conservative outputs\.

By targeting the 4–10 layer range, we effectively balance the suppression of uncertainty\-driven shortcuts with the preservation of the model’s inherent linguistic fluency\.

#### C\.4\.3Ablation Studies on Shikra

![Refer to caption](https://arxiv.org/html/2606.27596v1/x20.png)Figure 17:Parameter sensitivity analysis on Shikra\. Impact ofkk,τJS\\tau\_\{\\mathrm\{JS\}\}, andβ\\betaon captioning performance \(CHAIRS, CHAIRI, and F1\), evaluated on 500 COCO samples\.As shown in Figure[17](https://arxiv.org/html/2606.27596#A3.F17), we examine the sensitivity of key hyperparameters on Shikra\. Sinceα\\alphais fixed globally toα=2\\alpha=2across all backbones \(see Appendix[C\.4\.1](https://arxiv.org/html/2606.27596#A3.SS4.SSS1)\), we focus on the three parameters that directly govern intervention strength and conflict gating: the per\-layer head suppression ratiokk, the JSD conflict thresholdτJS\\tau\_\{\\mathrm\{JS\}\}, and the truncation ratioβ\\beta\. Overall, these parameters mainly control the trade\-off between*reliability*\(lower hallucination, reflected by CHAIRS/CHAIRI\) and*informativeness*\(semantic coverage, reflected by F1\), withτJS\\tau\_\{\\mathrm\{JS\}\}remaining the most influential knob\.

##### Impact ofτJS\\tau\_\{\\mathrm\{JS\}\}\.

AsτJS\\tau\_\{\\mathrm\{JS\}\}increases, the decoding procedure more often follows the intervention\-dominant path at uncertain steps, yielding consistent reductions in hallucination metrics \(lower CHAIRSand CHAIRI\)\. However, overly largeτJS\\tau\_\{\\mathrm\{JS\}\}introduces diminishing returns and may slightly reduce F1, indicating a conservative bias that favors safer generation over semantic coverage\. This suggests thatτJS\\tau\_\{\\mathrm\{JS\}\}should be set in a moderate range to suppress uncertainty\-driven deviations without over\-stabilizing the output\.

##### Impact ofkk\.

The ratiokkcontrols the sparsity budget of head\-level intervention per layer\. Figure[17](https://arxiv.org/html/2606.27596#A3.F17)shows that varyingkkleads to relatively smooth changes in both CHAIR and F1\. Moderatekkvalues typically provide a favorable balance, improving CHAIRS/CHAIRIwithout harming F1, while overly aggressive intervention \(largekk\) offers limited additional gains and can slightly regress F1\.

##### Impact ofβ\\beta\.

The parameterβ\\betacontrols the truncation strength in conflict estimation\. Performance varies smoothly across a broad range ofβ\\beta, suggesting that Shikra is relatively robust to this parameter\. Moderateβ\\betavalues achieve stable improvements on CHAIR metrics while maintaining F1, whereas extreme truncation may narrow the effective candidate space and yield diminishing returns\.

##### Selection of Intervention Layers in Shikra\.

For Shikra, we set the intervention range to layers 3–10\. The core rationale remains consistent with our previous findings: the erroneous evidence chains of hallucinations are typically established and progressively amplified during the early\-to\-mid stages of decoding\. Intervening only in the later stages is usually insufficient for timely correction\. Therefore, we place the intervention within the “early\-to\-mid” window to suppress the influence of risky mediators on subsequent generation as early as possible\.

Compared to LLaVA\-1\.5 \(e\.g\., layers 2–15\), Shikra’s optimal window is slightly shifted forward and is narrower, primarily due to differences in model architecture and task format\.

- •Early Structural Routing:As a model designed for referential grounding, Shikra’s input sequences contain more explicit region/object\-related tokens\. Cross\-modal alignment forms strong structural routing in earlier layers\.
- •Generative Convergence:Subsequent layers tend toward the “convergence” of linguistic generation and instruction execution based on the established alignment\. Intervening at this stage is more likely to disrupt stable generation while providing limited help in correcting errors formed during early alignment\.

Based on this divergence, while maintaining the principle of “early\-to\-mid stage intervention,” we set Shikra’s intervention range to layers 3–10 to better fit its internal dynamics where alignment is established early and generation converges later\.

Table 6:POPE F1 Score on the Random/Popular/Adversarial splits for three LVLM backbones \(LLaVA\-1\.5, InstructBLIP, and Shikra\)\. Higher is better\.

### C\.5POPE F1 Results

In the main paper, we report Accuracy on POPE as the primary metric to emphasize hallucination suppression\. To provide a complementary view of the precision–recall trade\-off, we additionally report F1 scores on all POPE splits in Table[6](https://arxiv.org/html/2606.27596#A3.T6)\. Overall, our method remains*competitive*in F1 across backbones and splits: it achieves the best F1 on LLaVA\-1\.5 \(Random/Adversarial\) and on InstructBLIP \(Random/Popular\), while maintaining comparable performance on the remaining settings\. These results suggest that the accuracy gains reported in the main paper are not obtained by a degenerate conservative strategy that trivially avoids positive answers, but are accompanied by a balanced precision–recall behavior\.

### C\.6Additional MME Results

As shown in Figure[18](https://arxiv.org/html/2606.27596#A3.F18), inShikra, Fox still achieves an overall improvement on the MME benchmark, increasing the total score from430\.00430\.00to446\.63446\.63, which validates the effectiveness of our structural intervention under this backbone\. Specifically, Fox yields stable gains on object\-level dimensions, includingExistence\(180→195180\\to 195\) andCount\(75→9075\\to 90\), and also brings a slight improvement onColor\(96\.67→98\.3396\.67\\to 98\.33\)\. In contrast,Positionexhibits a noticeable drop \(78\.33→63\.3378\.33\\to 63\.33\), suggesting that Shikra is more sensitive to attention\-structure changes for spatial attribute verification, and the benefits of structural intervention can vary across fine\-grained dimensions\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x21.png)Figure 18:Performance on the MME benchmark\. Higher scores indicate better effectiveness\.
### C\.7Results under Greedy Decoding

To rule out potential confounding effects introduced by stochastic decoding \(e\.g\., sampling temperature or nucleus sampling\), we further evaluate all methods under greedy decoding on LLaVA\-1\.5\. In this setting, the next token is selected deterministically by maximizing the conditional probability at each step, thereby eliminating randomness from the generation process\.

Table 7:Greedy decoding results on POPE and CHAIR\.We report split\-wise POPE Accuracy and F1 on Random/Popular/Adversarial, together with CHAIRSand CHAIRI\. Since greedy decoding is used as a diagnostic control to eliminate sampling stochasticity, we omit averaged scores and focus on split\-level robustness across difficulty regimes\.As shown in Table[7](https://arxiv.org/html/2606.27596#A3.T7), our method remains consistently strong in the fully deterministic setting\. In particular, it achieves the best performance on the challenging POPE\-Adversarial split and simultaneously yields the lowest CHAIRSand CHAIRI, indicating fewer hallucinated objects at both the sentence and instance levels\. These improvements persist without any sampling variance, suggesting that the gains are not attributable to temperature tuning or favorable sampling randomness, but stem from the proposed structural intervention\.

These results provide strong evidence that our improvements stem from*structural intervention on high\-risk mediators*, rather than from stochastic decoding effects\. In particular, even in the absence of sampling diversity, our method effectively suppresses hallucination while maintaining semantic coverage, as reflected by improvements on both POPE and CHAIR metrics\.

### C\.8Mitigating Generation Degradation via Conflict\-Gated Cooperation

As demonstrated in our previous analysis, the structural intervention on risky mediators effectively severs the shortcut path𝐗s​y​s→HR→Yt\\mathbf\{X\}\_\{sys\}\\to H\_\{R\}\\to Y\_\{t\}, thereby promoting visual grounding\. However, as illustrated in Figure[19](https://arxiv.org/html/2606.27596#A3.F19), this “de\-priorization” process can lead to unintended consequences\. While the interventional branchPd​oP\_\{do\}achieves high factual accuracy by strictly relying on stable visual evidence, it simultaneously compresses the model’s linguistic expressive capacity, occasionally resulting in repetitive patterns or a lack of semantic richness, a common challenge for attention\-based suppression methods\.

##### Rationale for JSD\-based Cooperative Decoding\.

To address this, we employ a Conflict\-Gated Cooperative Decoding strategy \(Insight III\) based on Jensen\-Shannon Divergence \(dtd\_\{t\}\)\. The rationale for using JSD as the gating mechanism is twofold:

- •Real\-time Conflict Detection:dtd\_\{t\}serves as a sensitive probe to detect when the interventional branch significantly deviates from the observational manifold\. High divergence suggests that the model is at a critical decision point where the intervention might be over\-suppressing necessary linguistic context\.
- •Adaptive Manifold Re\-injection:By utilizing JSD to regulate the fusion, we effectively re\-inject the “high\-precision anchor” \(Pd​oP\_\{do\}\) into the “high\-fluency manifold” \(Po​b​sP\_\{obs\}\)\. As shown in Figure[19](https://arxiv.org/html/2606.27596#A3.F19), compared to pure attention intervention which suffers from context loss and repetition \(red text\), our Entropy\-Guided Causal Decoding \(Fox\) achieves a superior balance, maintaining factual accuracy \(green text\) while preserving the natural flow and diversity of the generated response\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x22.png)Figure 19:The role of JSD\-based conflict gating\.Without JSD gating, always applying the intervention leads to severe generation degradation due to excessive context suppression\. Conversely, an overly large JSD threshold biases the decoding toward an overly conservative regime, reducing semantic coverage\. A moderate JSD threshold enables adaptive cooperation between the interventional and observational branches, achieving a balanced trade\-off between factual reliability and generation quality\.

### C\.9Additional Performance

We present additional qualitative examples to showcase the practical performance of Fox in reducing hallucinations across different LVLM backbones\. As shown in Fig\.[20](https://arxiv.org/html/2606.27596#A3.F20), Fox effectively suppresses prior\-driven hallucinated attributes and objects onLLaVA\-1\.5while preserving visually grounded details\. Fig\.[21](https://arxiv.org/html/2606.27596#A3.F21)further demonstrates that the proposed intervention generalizes toInstructBLIP\. Finally, Fig\.[22](https://arxiv.org/html/2606.27596#A3.F22)reports additional cases onInstructBLIPandShikra, where Fox consistently improves visual grounding under diverse scenes and object configurations\. These examples complement the quantitative results in the main paper by providing intuitive evidence of cross\-backbone robustness\.

![Refer to caption](https://arxiv.org/html/2606.27596v1/x23.png)Figure 20:Fox’s performance on reducing hallucinations of LLaVA\-1\.5\.![Refer to caption](https://arxiv.org/html/2606.27596v1/x24.png)Figure 21:Fox’s performance on reducing hallucinations of InstructBLIP\.![Refer to caption](https://arxiv.org/html/2606.27596v1/x25.png)Figure 22:Fox’s performance on reducing hallucinations of Shikra\.

Similar Articles

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

arXiv cs.AI

This paper analyzes hallucination in large language models as a structural consequence of three architectural decisions: self-attention's co-occurrence learning, maximum likelihood estimation training objective, and autoregressive decoding's left-to-right commitment. It maps each mechanism to specific hallucination types and argues that dataset pathologies amplify but do not cause these vulnerabilities.

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.